Hive Query Failed with Token Renewer Error | Hive on Spark

If you run Hive on Spark on some CDH versions, you might run into issues when Hive is trying to renew HDFS delegation tokens. See below error message (that you can find from HiveServer2 server logs):

2017-06-27 17:04:08,836 INFO org.apache.hive.spark.client.SparkClientImpl: [stderr-redir-1]: 17/06/27 17:04:08 
WARN security.UserGroupInformation: PriviledgedActionException as:testuser (auth:PROXY) via 
hive/example.hadoop.com@REALM.COM (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: 
testuser tries to renew a token with renewer hive

If you do some googling, you should be able to locate the corresponding upstream Hive JIRA for this issue: HIVE-15485. And from this JIRA, you should also be able to identify that the issue was introduced by HIVE-14383. This is due to the fact that Spark needs the principal/keytab passed in via –principal and –keytab options, and does the renewal by copying the keytab to the cluster and handling login to kerberos inside the application. But the option –principal and –keytab could not work with –proxy-user in spark-submit.sh, so at this moment we could support either the token renewal or the impersonation, but not both.

The only way to avoid such issue is to upgrade CDH to the version that has the fix for HIVE-15485, which has been fixed in the following releases:

CDH5.8.5
CDH5.9.2
CDH5.10.1, CDH5.10.2
CDH5.11.0, CDH5.11.1

Since HIVE-14383 was introduced in the following CDH:

CDH5.8.3, CDH5.8.4, CDH5.8.5
CDH5.9.1, CDH5.9.2
CDH5.10.0, CDH5.10.1, CDH5.10.2 
CDH5.11.0, CDH5.11.1

This makes the following CDH currently will have such issues:

CDH5.8.3, CDH5.8.4, CDH5.9.1, CDH5.10.0

Please deploy the latest maintenance release for your major version to avoid such issue in Hive on Spark.

Hope above helps.

Hive on Spark query failed with ConnectTimeoutException

Recently I have been dealing with an issue that Hive on Spark job intermittently failed with ConnectionTimeouException. The connection timed out when the ApplicationMaster is trying to communicate back to HiveServer2 on a random port and failed immediately after 2 seconds of trying to connect. See below stack trace for details:

17/05/03 03:20:06 INFO yarn.ApplicationMaster: Waiting for spark context initialization
17/05/03 03:20:06 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
17/05/03 03:20:06 INFO client.RemoteDriver: Connecting to: <hs2-host>:35915
17/05/03 03:20:08 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: 
io.netty.channel.ConnectTimeoutException: connection timed out: <hs2-host>/172.19.22.11:35915 
java.util.concurrent.ExecutionException: io.netty.channel.ConnectTimeoutException: connection timed out: <hs2-host>/172.19.22.11:35915 
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) 
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:156) 
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
at java.lang.reflect.Method.invoke(Method.java:606) 
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542) 
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: <hs2-host>/172.19.22.11:35915 
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:220) 
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) 
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) 
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) 
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) 
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) 
at java.lang.Thread.run(Thread.java:745) 
17/05/03 03:20:08 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.util.concurrent.ExecutionException: 
io.netty.channel.ConnectTimeoutException: connection timed out: <hs2-host>/172.19.22.11:35915) 
17/05/03 03:20:16 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application. 
17/05/03 03:20:16 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.util.concurrent.ExecutionException: 
io.netty.channel.ConnectTimeoutException: connection timed out: <hs2-host>/172.19.22.11:35915) 
17/05/03 03:20:16 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1492040605432_11445 
17/05/03 03:20:16 INFO util.ShutdownHookManager: Shutdown hook called

We can see from above log that the timeout happened 2 seconds after the attempted connection, which does not make too much sense that the timeout value is such short period.

After digging further into code, it turned out that this timeout is controlled by hive’s setting called hive.spark.client.connect.timeout. The default value for this setting is 1000ms, which is only 1 second, which explained the cause.

This issue only happens when cluster is on high load and HiveServer2 is not able to respond back to ApplicationMaster within 1 second and then connection will timeout.

To by pass this issue, we can simply increase this timeout value to, say, 5 seconds:

SET hive.spark.client.connect.timeout=5000;

# Your query here

I have reported such issue upstream in JIRA: HIVE-16794, and I will submit a patch to increase this timeout setting soon.