Hive Query Failed with Token Renewer Error | Hive on Spark

If you run Hive on Spark on some CDH versions, you might run into issues when Hive is trying to renew HDFS delegation tokens. See below error message (that you can find from HiveServer2 server logs):

2017-06-27 17:04:08,836 INFO org.apache.hive.spark.client.SparkClientImpl: [stderr-redir-1]: 17/06/27 17:04:08 
WARN security.UserGroupInformation: PriviledgedActionException as:testuser (auth:PROXY) via 
hive/example.hadoop.com@REALM.COM (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: 
testuser tries to renew a token with renewer hive

If you do some googling, you should be able to locate the corresponding upstream Hive JIRA for this issue: HIVE-15485. And from this JIRA, you should also be able to identify that the issue was introduced by HIVE-14383. This is due to the fact that Spark needs the principal/keytab passed in via –principal and –keytab options, and does the renewal by copying the keytab to the cluster and handling login to kerberos inside the application. But the option –principal and –keytab could not work with –proxy-user in spark-submit.sh, so at this moment we could support either the token renewal or the impersonation, but not both.

The only way to avoid such issue is to upgrade CDH to the version that has the fix for HIVE-15485, which has been fixed in the following releases:

CDH5.8.5
CDH5.9.2
CDH5.10.1, CDH5.10.2
CDH5.11.0, CDH5.11.1

Since HIVE-14383 was introduced in the following CDH:

CDH5.8.3, CDH5.8.4, CDH5.8.5
CDH5.9.1, CDH5.9.2
CDH5.10.0, CDH5.10.1, CDH5.10.2 
CDH5.11.0, CDH5.11.1

This makes the following CDH currently will have such issues:

CDH5.8.3, CDH5.8.4, CDH5.9.1, CDH5.10.0

Please deploy the latest maintenance release for your major version to avoid such issue in Hive on Spark.

Hope above helps.

Read Files under Sub-Directories for Hive and Impala

Sometimes you might want to store data under sub-directories in HDFS and then you want Hive or Impala to read from those sub-directories. For example, you have the following directory structure:

root hdfs     231206 2017-06-30 02:45 /test/table1/000000_0
root hdfs          0 2017-06-30 02:45 /test/table1/child_directory
root hdfs     231206 2017-06-30 02:45 /test/table1/child_directory/000000_0

By default, Hive will only look for files in the root of directory specified, in my test case is /test/table1. However, Hive supports to read all data under the root table’s sub-directories as well. This can be achieved by updating the following settings:

SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;

Impala however, on the other side, currently does not support reading files from table’s sub-directories. This has been reported in the upstream JIRA of IMPALA-1944. Currently there is no immediate plan to support such feature, but it might be in the future release of Impala.

Hope above information is useful.

Impala Auto Update Metadata Support

There are lots of CDH users requested that Impala to support automatic metadata update, so that they do NOT need to run “INVALIDATE METADATA” every time when create table or data are updated through other components, like Hive or Pig.

I would like to share that this has been reported upstream and tracked via JIRA: IMPALA-3124. Currently it is still not fixed and it requires detailed design to make sure that it will work in the proper way.

There is no ETA at this stage on when this feature will be added. Advise anyone interested on this feature to add comments to the JIRA. The more votes, the better chance that it will be prioritized.

Hope above information is helpful.

Oozie Server failed to Start with error java.lang.NoSuchFieldError: EXTERNAL_PROPERTY

This issue happens in CDH distribution of Hadoop that is managed by Cloudera Manager (possibly in other distributions as well, due to known upstream JIRA, but I have not tested). Oozie will fail to start after enabling Oozie HA through Cloudera Manager user interface.

The full error message from Oozie’s process stdout.log (can be found under /var/run/cloudera-scm-agent/process/XXX-oozie-OOZIE_SERVER/logs directory) file looks like below:

Wed Jan 25 11:07:41 GST 2017 
JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera 
using 5 as CDH_VERSION 
using /var/lib/oozie/tomcat-deployment as CATALINA_BASE 
Copying JDBC jar from /usr/share/java/oracle-connector-java.jar to /var/lib/oozie 

ERROR: Oozie could not be started 

REASON: java.lang.NoSuchFieldError: EXTERNAL_PROPERTY 

Stacktrace: 
----------------------------------------------------------------- 
java.lang.NoSuchFieldError: EXTERNAL_PROPERTY 
at org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector._findTypeResolver(JacksonAnnotationIntrospector.java:777) 
at org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector.findPropertyTypeResolver(JacksonAnnotationIntrospector.java:214) 
at org.codehaus.jackson.map.ser.BeanSerializerFactory.findPropertyTypeSerializer(BeanSerializerFactory.java:370) 
at org.codehaus.jackson.map.ser.BeanSerializerFactory._constructWriter(BeanSerializerFactory.java:772) 
at org.codehaus.jackson.map.ser.BeanSerializerFactory.findBeanProperties(BeanSerializerFactory.java:586) 
at org.codehaus.jackson.map.ser.BeanSerializerFactory.constructBeanSerializer(BeanSerializerFactory.java:430) 
at org.codehaus.jackson.map.ser.BeanSerializerFactory.findBeanSerializer(BeanSerializerFactory.java:343) 
at org.codehaus.jackson.map.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:287) 
at org.codehaus.jackson.map.ser.StdSerializerProvider._createUntypedSerializer(StdSerializerProvider.java:782) 
at org.codehaus.jackson.map.ser.StdSerializerProvider._createAndCacheUntypedSerializer(StdSerializerProvider.java:735) 
at org.codehaus.jackson.map.ser.StdSerializerProvider.findValueSerializer(StdSerializerProvider.java:344) 
at org.codehaus.jackson.map.ser.StdSerializerProvider.findTypedValueSerializer(StdSerializerProvider.java:420) 
at org.codehaus.jackson.map.ser.StdSerializerProvider._serializeValue(StdSerializerProvider.java:601) 
at org.codehaus.jackson.map.ser.StdSerializerProvider.serializeValue(StdSerializerProvider.java:256) 
at org.codehaus.jackson.map.ObjectMapper._configAndWriteValue(ObjectMapper.java:2566) 
at org.codehaus.jackson.map.ObjectMapper.writeValue(ObjectMapper.java:2056) 
at org.apache.oozie.util.FixedJsonInstanceSerializer.serialize(FixedJsonInstanceSerializer.java:65) 
at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalRegisterService(ServiceDiscoveryImpl.java:201) 
at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.registerService(ServiceDiscoveryImpl.java:186) 
at org.apache.oozie.util.ZKUtils.advertiseService(ZKUtils.java:217) 
at org.apache.oozie.util.ZKUtils.<init>(ZKUtils.java:141) 
at org.apache.oozie.util.ZKUtils.register(ZKUtils.java:154) 
at org.apache.oozie.service.ZKLocksService.init(ZKLocksService.java:70) 
at org.apache.oozie.service.Services.setServiceInternal(Services.java:386) 
at org.apache.oozie.service.Services.setService(Services.java:372) 
at org.apache.oozie.service.Services.loadServices(Services.java:305) 
at org.apache.oozie.service.Services.init(Services.java:213) 
at org.apache.oozie.servlet.ServicesLoader.contextInitialized(ServicesLoader.java:46) 
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4210) 
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4709) 
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:802) 
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779) 
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:583) 
at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:944) 
at org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:779) 
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:505) 
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1322) 
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:325) 
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142) 
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1068) 
at org.apache.catalina.core.StandardHost.start(StandardHost.java:822) 
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1060) 
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) 
at org.apache.catalina.core.StandardService.start(StandardService.java:525) 
at org.apache.catalina.core.StandardServer.start(StandardServer.java:759) 
at org.apache.catalina.startup.Catalina.start(Catalina.java:595) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
at java.lang.reflect.Method.invoke(Method.java:606) 
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) 
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) 

To fix the issue, please follow the steps below:

  1. Delete or move the following files under CDH’s parcel directory (most likely they are symlinks):
    /opt/cloudera/parcels/CDH/lib/oozie/libserver/hive-exec.jar
    /opt/cloudera/parcels/CDH/lib/oozie/libtools/hive-exec.jar
    
  2. Download hive-exec-{cdh version}-core.jar file from the Cloudera repo, for example, for CDH5.8.2, please go to:
    https://repository.cloudera.com/cloudera/cloudera-repos/org/apache/hive/hive-exec/1.1.0-cdh5.8.2/

    and put the file under the following directories on the Oozie server:

    /opt/cloudera/parcels/CDH/lib/oozie/libserver/
    /opt/cloudera/parcels/CDH/lib/oozie/libtools/
    
  3. Download kryo-2.22.jar from the maven repository:
    http://repo1.maven.org/maven2/com/esotericsoftware/kryo/kryo/2.22/kryo-2.22.jar

    and put it under directories on the Oozie server:

    /opt/cloudera/parcels/CDH/lib/oozie/libserver/
    /opt/cloudera/parcels/CDH/lib/oozie/libtools/
    
  4. Finally restart Oozie service

This is a known Oozie issue and reported in the upstream JIRA: OOZIE-2621, which has been resolved and targeted for 4.3.0 release.

Hope this helps.

Unable to Import Data as Parquet into Encrypted HDFS Zone | Sqoop Parquet Import

Recently I discovered an issue in Sqoop that when importing data into Hive table, whose location is in an encrypted HDFS zone, it will fail with “can’t be moved into an encryption zone” error:

Command:

sqoop import --connect <postgres_url> --username <username> --password <password> \
--table sourceTable --split-by id --hive-import --hive-database staging \
--hive-table hiveTable --as-parquetfile

Errors:

2017-05-24 13:38:51,539 INFO [Thread-84] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 
Setting job diagnostics to Job commit failed: org.kitesdk.data.DatasetIOException: Could not move contents of hdfs://nameservice1/tmp/staging/.
temp/job_1495453174050_1035/mr/job_1495453174050_1035 to 
hdfs://nameservice1/user/hive/warehouse/staging.db/hiveTable
        at org.kitesdk.data.spi.filesystem.FileSystemUtil.stageMove(FileSystemUtil.java:117)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:406)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:62)
        at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.commitJob(DatasetKeyOutputFormat.java:387)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:274)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): 
/tmp/staging/.temp/job_1495453174050_1035/mr/job_1495453174050_1035/964f7b5e-2f55-421d-bfb6-7613cc4bf26e.parquet 
can't be moved into an encryption zone.
        at org.apache.hadoop.hdfs.server.namenode.EncryptionZoneManager.checkMoveValidity(EncryptionZoneManager.java:284)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedRenameTo(FSDirectory.java:564)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.renameTo(FSDirectory.java:478)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInternal(FSNamesystem.java:3929)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInt(FSNamesystem.java:3891)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3856)

This is caused by a known Sqoop bug: SQOOP-2943. This happens because Sqoop uses Kite SDK to generate Parquet file, and Kite SDK uses /tmp directory to generate the parquet file on the fly. And because /tmp directory is not encrypted and hive warehouse directory is encrypted, the final move command to move the parquet file from /tmp to hive warehouse will fail due to the encryption. The import only fails with parquet format, text file format works as expected.

Currently SQOOP-2943 is not fixed and there is no direct workaround.

For the time being, the workaround is to:

  1. Import data as text file format into Hive temporary table, and then use Hive query to copy data into destination parquet table. OR
  2. Import data as parquet file into temporary directory outside of Hive warehouse, and then again use Hive to copy data into destination parquet table