Oracle Number(1,0) field maps to Boolean in Spark

Recently I was working on a issue that when importing data from Oracle into Hive table using Spark, the data of type Number(1,0) in Oracle was implicitly converted into Boolean data type. Before was on CDH5.5.x, it worked correctly, however, after upgrading to CDH5.10.x, the issue happened. See below Hive table output after import:

Before upgrade:

SELECT column1 FROM test_table limit 2;

After upgrade:

SELECT column1 FROM test_table limit 2;

After digging further, I discovered that this change was introduced by SPARK-16625, due to the integration required for Spark to work correctly with Oracle.

Since the change was intended, the following is the suggested workarounds:

  1. Cast the Boolean to a type of your choosing in the Spark code, before writing it to the Hive table
  2. Make sure that the mapped column in Hive is also of compatible data type, for example, TinyInt, rather than String, so that the value of True or False will be mapped to 1 or 0 respectively, rather than string value of “True” or “False” (the reason that the column got “False” and “True” values were because the column was of String data type)

Hope above helps.

Spark jobs failed with delegation token renewal error

An Oozie Spark job failed with the following error:

Job aborted due to stage failure: Task 103 in stage 194576.0 failed 4 times, most recent failure: Lost task 103.3 in stage 194576.0 
(TID 119674041, ): org.apache.hadoop.ipc.RemoteException($InvalidToken): 
token (token for sparkpse: HDFS_DELEGATION_TOKEN owner=@HADOOP.CHARTER.COM, renewer=yarn, realUser=, issueDate=1482494610879, 
maxDate=1483099410879, sequenceNumber=274718, masterKeyId=166) can't be found in cache 
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke( 
at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)

This is caused by long running Spark job in a kerberized environment the checkpointing fails as Token is not renewed properly.

The workaround is to add “–conf spark.hadoop.fs.hdfs.impl.disable.cache=true” to Spark job command line parameters to disable the token cache from spark side.

How to load different version of Spark into Oozie

This article explains the steps needed to load Spark2 into Oozie under CDH5.9.x which comes with Spark1.6. Although this was tested under CDH5.9.0, it should be similar for earlier releases.

Please follow the steps below:

  1. Locate the current shared-lib directory by running:
    oozie admin -oozie http://<oozie-server-host>:11000/oozie -sharelibupdate

    you will get something like below:

    [ShareLib update status]
    host = http://<oozie-server-host>:11000/oozie
    status = Successful
    sharelibDirOld = hdfs://<oozie-server-host>:8020/user/oozie/share/lib/lib_20161202183044
    sharelibDirNew = hdfs://<oozie-server-host>:8020/user/oozie/share/lib/lib_20161202183044

    This tells me that the current sharelib directory is /user/oozie/share/lib/lib_20161202183044

  2. Create a new directory for spark2.0 under this directory:
    hadoop fs -mkdir /user/oozie/share/lib/lib_20161202183044/spark2

  3. Put all your spark 2 jars under this directory, please also make sure that oozie-sharelib-spark-4.1.0-cdh5.9.0.jar is there too
  4. Update the sharelib by running:

    oozie admin -oozie http://<oozie-server-host>:11000/oozie -sharelibupdate
  5. Confirm that the spark2 has been added to the shared lib path:

    oozie admin -oozie http://<oozie-server-host>:11000/oozie -shareliblist

    you should get something like below:

    [Available ShareLib]
  6. Go back to spark workflow and add the following configuration under Spark action:

  7. Save workflow and run to test if it will pick up the correct JARs now.

Please be advised that although this can work, it will put Spark action in Oozie not supported by Cloudera, because it is not tested and it should not be recommended. But if you are still willing to go ahead, the steps above should help.

Enabling Snappy Support for Shark/Spark

Shark/Spark will not be able to read Snappy compressed data out of the box. In my previous post, I explained how to enable Snappy compression in Hadoop 2.4, once this is done, enabling Snappy in Spark is dead simple, all you need to do is to set an envionment variable for it:

In $SPARK_HOME/conf/, add the following:


Of course, this assumes that you have hadoop’s native Snappy libray in the specified directory.

After distributing this conf file to all the slaves and restarting the cluster, Shark will be able to create or read Snappy compressed data through Spark.

Getting “protobuf” Error When Using Shark 0.9.1 with Hadoop 2.4

Recently I need to experiment with Shark so that I can compare its performance with Hive at work. In order to do this, I need to setup a single cluster on one of my virtualbox machines.

Installing Hadoop and Hive was quite smooth, you can refer back to my post regarding the instructions. However, I encountered problems when trying to install Shark and Spark.

Spark runs OK, no problems, but Shark kept failing when I tried to run some queries:

shark> select * from test;
17.096: [Full GC 71198K->24382K(506816K), 0.3150970 secs]
Exception in thread "main" java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetOwnerRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(
	at Method)
	at java.lang.ClassLoader.loadClass(
	at sun.misc.Launcher$AppClassLoader.loadClass(
	at java.lang.ClassLoader.loadClass(
	at java.lang.Class.getDeclaredMethods0(Native Method)
	at java.lang.Class.privateGetDeclaredMethods(
	at java.lang.Class.privateGetPublicMethods(
	at java.lang.Class.privateGetPublicMethods(
	at java.lang.Class.getMethods(
	at sun.misc.ProxyGenerator.generateClassFile(
	at sun.misc.ProxyGenerator.generateProxyClass(
	at java.lang.reflect.Proxy.getProxyClass0(
	at java.lang.reflect.Proxy.newProxyInstance(
	at org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(
	at org.apache.hadoop.ipc.RPC.getProtocolProxy(
	at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(
	at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(
	at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(
	at org.apache.hadoop.hdfs.DFSClient.<init>(
	at org.apache.hadoop.hdfs.DFSClient.<init>(
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(
	at org.apache.hadoop.fs.FileSystem.createFileSystem(
	at org.apache.hadoop.fs.FileSystem.access$200(
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
	at org.apache.hadoop.fs.FileSystem$Cache.get(
	at org.apache.hadoop.fs.FileSystem.get(
	at org.apache.hadoop.fs.Path.getFileSystem(
	at org.apache.hadoop.hive.ql.Context.getScratchDir(
	at org.apache.hadoop.hive.ql.Context.getMRScratchDir(
	at org.apache.hadoop.hive.ql.Context.getMRTmpFileURI(
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(
	at shark.parse.SharkSemanticAnalyzer.analyzeInternal(SharkSemanticAnalyzer.scala:137)
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(
	at shark.SharkDriver.compile(SharkDriver.scala:215)
	at org.apache.hadoop.hive.ql.Driver.compile(
	at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:338)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(
	at shark.SharkCliDriver$.main(SharkCliDriver.scala:235)
	at shark.SharkCliDriver.main(SharkCliDriver.scala)

It is not obvious what’s happening, but it was caused by the package called “protobuf”. It can be found in jar file “hive-exec-0.11.0-shark-0.9.1.jar” which is located at $SHARK_HOME/lib_managed/jars/edu.berkeley.cs.shark/hive-exec, all you need to do is to remove the package completely:

cd $SHARK_HOME/lib_managed/jars/edu.berkeley.cs.shark/hive-exec
unzip hive-exec-0.11.0-shark-0.9.1.jar
rm -f com/google/protobuf/*
zip -r hive-exec-0.11.0-shark-0.9.1.jar *
rm -fr com javaewah/ javax/ javolution/ META-INF/ org/

That’s it, restart your “shark” and it should be working now.