Hive on Spark query failed with ConnectTimeoutException

Recently I have been dealing with an issue that Hive on Spark job intermittently failed with ConnectionTimeouException. The connection timed out when the ApplicationMaster is trying to communicate back to HiveServer2 on a random port and failed immediately after 2 seconds of trying to connect. See below stack trace for details:

17/05/03 03:20:06 INFO yarn.ApplicationMaster: Waiting for spark context initialization
17/05/03 03:20:06 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
17/05/03 03:20:06 INFO client.RemoteDriver: Connecting to: <hs2-host>:35915
17/05/03 03:20:08 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: 
io.netty.channel.ConnectTimeoutException: connection timed out: <hs2-host>/172.19.22.11:35915 
java.util.concurrent.ExecutionException: io.netty.channel.ConnectTimeoutException: connection timed out: <hs2-host>/172.19.22.11:35915 
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) 
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:156) 
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
at java.lang.reflect.Method.invoke(Method.java:606) 
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542) 
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: <hs2-host>/172.19.22.11:35915 
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:220) 
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) 
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) 
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) 
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) 
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) 
at java.lang.Thread.run(Thread.java:745) 
17/05/03 03:20:08 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.util.concurrent.ExecutionException: 
io.netty.channel.ConnectTimeoutException: connection timed out: <hs2-host>/172.19.22.11:35915) 
17/05/03 03:20:16 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application. 
17/05/03 03:20:16 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.util.concurrent.ExecutionException: 
io.netty.channel.ConnectTimeoutException: connection timed out: <hs2-host>/172.19.22.11:35915) 
17/05/03 03:20:16 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1492040605432_11445 
17/05/03 03:20:16 INFO util.ShutdownHookManager: Shutdown hook called

We can see from above log that the timeout happened 2 seconds after the attempted connection, which does not make too much sense that the timeout value is such short period.

After digging further into code, it turned out that this timeout is controlled by hive’s setting called hive.spark.client.connect.timeout. The default value for this setting is 1000ms, which is only 1 second, which explained the cause.

This issue only happens when cluster is on high load and HiveServer2 is not able to respond back to ApplicationMaster within 1 second and then connection will timeout.

To by pass this issue, we can simply increase this timeout value to, say, 5 seconds:

SET hive.spark.client.connect.timeout=5000;

# Your query here

I have reported such issue upstream in JIRA: HIVE-16794, and I will submit a patch to increase this timeout setting soon.

Leave a Reply

Your email address will not be published. Required fields are marked *