Enable Debug Logging for YARN Application in Hive

When troubleshooting Hadoop related issues, we constantly need to enable DEBUG level logging so that we can see more what Hadoop is doing when running jobs. This blog post in particular shows how you can enable DEBUG logging for YARN application jobs when you run them through Hive.

To do so, it is simple. Just add below three lines before you run your query in Hive session:

SET mapreduce.map.log.level=DEBUG;
SET mapreduce.reduce.log.level=DEBUG;
SET yarn.app.mapreduce.am.log.level=DEBUG;

This will turn on Mapper, Reducer and Application Master log to DEBUG level. If you just need one of them, then simply remove the other ones.

Then you can just run your query as normal:

SELECT COUNT(*) FROM tablename;

Before enabling the DEBUG in Mapper log:

2018-05-22 21:06:30,267 WARN [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Metrics system not started: org.apache.commons.configuration.ConfigurationException: Unable to load the configuration from the URL file:/run/cloudera-scm-agent/process/1370-yarn-NODEMANAGER/hadoop-metrics2.properties
2018-05-22 21:06:30,338 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2018-05-22 21:06:30,339 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1526873868324_0001, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@54b83d10)
2018-05-22 21:06:30,686 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.100.142:8020, Ident: (token for hive: HDFS_DELEGATION_TOKEN owner=hive/host-10-17-100-142.coe.cloudera.com@CDH511.COM, renewer=yarn, realUser=, issueDate=1527048353000, maxDate=1527653153000, sequenceNumber=787, masterKeyId=173)
2018-05-22 21:06:31,556 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2018-05-22 21:06:32,194 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /yarn/nm2/usercache/hive/appcache/application_1526873868324_0001
2018-05-22 21:06:33,738 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2018-05-22 21:06:34,271 INFO [main] org.apache.hadoop.mapred.Task:  Using ResourceCalculatorProcessTree : [ ]

After DEBUG is enabled:

2018-05-22 21:08:00,607 DEBUG [main] org.apache.hadoop.mapred.YarnChild: Child starting
2018-05-22 21:08:01,936 DEBUG [main] org.apache.hadoop.metrics2.lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=, value=[Rate of successful kerberos logins and latency (milliseconds)], always=false, type=DEFAULT, sampleName=Ops)
2018-05-22 21:08:01,968 DEBUG [main] org.apache.hadoop.metrics2.lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=, value=[Rate of failed kerberos logins and latency (milliseconds)], always=false, type=DEFAULT, sampleName=Ops)
2018-05-22 21:08:01,968 DEBUG [main] org.apache.hadoop.metrics2.lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=, value=[GetGroups], always=false, type=DEFAULT, sampleName=Ops)
2018-05-22 21:08:01,969 DEBUG [main] org.apache.hadoop.metrics2.lib.MutableMetricsFactory: field private org.apache.hadoop.metrics2.lib.MutableGaugeLong org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=, value=[Renewal failures since startup], always=false, type=DEFAULT, sampleName=Ops)
2018-05-22 21:08:01,969 DEBUG [main] org.apache.hadoop.metrics2.lib.MutableMetricsFactory: field private org.apache.hadoop.metrics2.lib.MutableGaugeInt org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailures with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=, value=[Renewal failures since last successful login], always=false, type=DEFAULT, sampleName=Ops)
2018-05-22 21:08:01,970 DEBUG [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: UgiMetrics, User and group related metrics
2018-05-22 21:08:02,125 DEBUG [main] org.apache.hadoop.security.SecurityUtil: Setting hadoop.security.token.service.use_ip to true
2018-05-22 21:08:02,169 DEBUG [main] org.apache.hadoop.security.Groups:  Creating new Groups object
2018-05-22 21:08:02,211 DEBUG [main] org.apache.hadoop.util.Shell: Failed to detect a valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
	at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:351)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:376)
	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
	at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:168)
	at org.apache.hadoop.security.Groups.<init>(Groups.java:132)
	at org.apache.hadoop.security.Groups.<init>(Groups.java:100)
	at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:435)
	at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:337)
	at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:374)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:78)
2018-05-22 21:08:02,425 DEBUG [main] org.apache.hadoop.util.Shell: setsid exited with exit code 0
2018-05-22 21:08:02,521 DEBUG [main] org.apache.hadoop.security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000; warningDeltaMs=5000
2018-05-22 21:08:02,548 DEBUG [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: from system property: null
2018-05-22 21:08:02,549 DEBUG [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: from environment variable: null
2018-05-22 21:08:02,614 DEBUG [main] org.apache.commons.configuration.ConfigurationUtils: ConfigurationUtils.locate(): base is null, name is hadoop-metrics2-maptask.properties
2018-05-22 21:08:02,616 DEBUG [main] org.apache.commons.configuration.ConfigurationUtils: ConfigurationUtils.locate(): base is null, name is hadoop-metrics2.properties
2018-05-22 21:08:02,617 DEBUG [main] org.apache.commons.configuration.ConfigurationUtils: Loading configuration from the context classpath (hadoop-metrics2.properties)
2018-05-22 21:08:02,620 WARN [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Metrics system not started: org.apache.commons.configuration.ConfigurationException: Unable to load the configuration from the URL file:/run/cloudera-scm-agent/process/1370-yarn-NODEMANAGER/hadoop-metrics2.properties

Hope above helps!

Yarn Job Failed with Error: “Split metadata size exceeded 10000000”

When you run a really big job in Hive that failed with the following error:

2016-06-28 18:55:36,830 INFO [Thread-58] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Setting job diagnostics to Job init failed : org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Split metadata size exceeded 10000000. Aborting job job_1465344841306_1317
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1568)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1432)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1390)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:996)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:138)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1289)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1057)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1500)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1496)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1429)
Caused by: java.io.IOException: Split metadata size exceeded 10000000. Aborting job job_1465344841306_1317
at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:53)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1563)
... 17 more

This indicated that the value for mapreduce.job.split.metainfo.maxsize is too small for your job (default value of 10000000).

There are two options to fix this:

1. Set the value of mapreduce.job.split.metainfo.maxsize to be “-1” (unlimited) specifically for this job just before running it:

SET mapreduce.job.split.metainfo.maxsize=-1;

This should remove the limit, however, be warned that it will effectively let YARN to create unlimited metadata splits, if there is not enough resources on your cluster, it could have the potential to bring down the host.

2. The safer way is to increase the value to maybe double the value of default, which is 10000000:

SET mapreduce.job.split.metainfo.maxsize=20000000;

You could gradually increase the value and monitor your cluster to make sure that it will not bring down your machines.

I have seen other posts on Google that people were suggesting to set the value of mapreduce.job.split.metainfo.maxsize in mapred-site.xml configuration file. In my opinion, this only affect small number of queries when running against very BIG data set, so it is better to set this value at job level, so that no cluster restart will be required.

Please note that if you are using MapReduce V1, the setting should be mapreduce.jobtracker.split.metainfo.maxsize instead, which does the same thing.

Hope this helps.

Hive query failed with error: Killing the Job. mapResourceReqt: 1638 maxContainerCapability:1200″

This article explains how to fix the following error when running a hive query:

MAP capability required is more than the supported max container capability in the cluster. 
Killing the Job. mapResourceRequest: <memory:1638, vCores:1> maxContainerCapability:<memory:1200, vCores:2>

rtaImage

This error might not be obvious, however, this is caused by the following config not setup properly in YARN:

mapreduce.map.memory.mb = 1638
yarn.scheduler.maximum-allocation-mb = 1200
yarn.nodemanager.resource.memory-mb = 1300

The solution here is to change the above mentioned property to have the following values:

mapreduce.map.memory.mb < yarn.nodemanager.resource.memory-mb < yarn.scheduler.maximum-allocation-mb mapreduce.map.memory.mb is the amount of heap can be allocated to map task, which is inside a container. yarn.nodemanager.resource.memory-mb is the amount of heap to be allocated to the container that will hold either mappers or reducers. yarn.scheduler.maximum-allocation-mb is the amount of maximum memory can be allocated to container, which need to be higher than yarn.nodemanager.resource.memory-mb. Then the problem will be solved. Hope this helps.

Hive query failed with error: Killing the Job. mapResourceReqt: 1638 maxContainerCapability:1200

When running a Hive query, get the following error in the jobhistory:

MAP capability required is more than the supported max container capability in the cluster. 
Killing the Job. mapResourceRequest: <memory:1638, vCores:1> maxContainerCapability:<memory:1200, vCores:2>

hive-error

This is caused by the following settings in YARN:

mapreduce.map.memory.mb => 1638
yarn.scheduler.maximum-allocation-mb => 1200
yarn.nodemanager.resource.memory-mb => 1300

The solution is to setup the settings mentioned above in the following way:

mapreduce.map.memory.mb < yarn.nodemanager.resource.memory-mb < yarn.scheduler.maximum-allocation-mb

Then the problem should be resolved.

Sqoop Fails with FileNotFoundException in CDH

The following Exceptions occur when executing Sqoop on a cluster managed by Cloudera Manager:

15/05/11 20:42:55 WARN security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/mnt/var/opt/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/sqoop/lib/hsqldb-1.8.0.10.jar
15/05/11 20:42:55 ERROR tool.ImportTool: Encountered IOException running import job: java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/mnt/var/opt/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/sqoop/lib/hsqldb-1.8.0.10.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestampsAndCacheVisibilities(ClientDistributedCacheManager.java:57)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:267)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:388)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:481)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:198)
at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:171)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:268)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:665)
at org.apache.sqoop.manager.MySQLManager.importTable(MySQLManager.java:118)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

This is caused by Sqoop needs configuration deployment throught a YARN Gateway.

To fix this problem, in Cloudera Manager, go to:

YARN > Instances > Add Roles Instances > Gateway (Select Hosts) > (Click on target hosts) > OK

then “Deploy Client Configuration” is required, go to:

YARN > Intances > Actions (top right corner) > Deploy Client Configuration

Now run the sqoop command again and it should be working.

The reason that the YARN Gateway is needed is that when Sqoop runs on a particular host, this host has to know where the Resource Manager is, whether it is in the local mode or cluster mode etc, so that it knows how to submit those MapReduce jobs. By deploying client configuration to the host, it will have those information and it knows what to do when Sqoop is run, as Sqoop jobs are actually MapReduce jobs.

The solution is not obvious, and hope it can help.