Enabling Snappy Support for Shark/Spark

Shark/Spark will not be able to read Snappy compressed data out of the box. In my previous post, I explained how to enable Snappy compression in Hadoop 2.4, once this is done, enabling Snappy in Spark is dead simple, all you need to do is to set an envionment variable for it:

In $SPARK_HOME/conf/spark-env.sh, add the following:

export SPARK_LIBRARY_PATH=$HADOOP_HOME/lib/native

Of course, this assumes that you have hadoop’s native Snappy libray libsnappy.so in the specified directory.

After distributing this conf file to all the slaves and restarting the cluster, Shark will be able to create or read Snappy compressed data through Spark.

Getting “protobuf” Error When Using Shark 0.9.1 with Hadoop 2.4

Recently I need to experiment with Shark so that I can compare its performance with Hive at work. In order to do this, I need to setup a single cluster on one of my virtualbox machines.

Installing Hadoop and Hive was quite smooth, you can refer back to my post regarding the instructions. However, I encountered problems when trying to install Shark and Spark.

Spark runs OK, no problems, but Shark kept failing when I tried to run some queries:


shark> select * from test;
17.096: [Full GC 71198K->24382K(506816K), 0.3150970 secs]
Exception in thread "main" java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetOwnerRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at java.lang.Class.getDeclaredMethods0(Native Method)
	at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
	at java.lang.Class.privateGetPublicMethods(Class.java:2651)
	at java.lang.Class.privateGetPublicMethods(Class.java:2661)
	at java.lang.Class.getMethods(Class.java:1467)
	at sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
	at sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
	at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
	at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
	at org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(ProtobufRpcEngine.java:92)
	at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)
	at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:334)
	at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:241)
	at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:141)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:576)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:521)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:146)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
	at org.apache.hadoop.hive.ql.Context.getScratchDir(Context.java:180)
	at org.apache.hadoop.hive.ql.Context.getMRScratchDir(Context.java:231)
	at org.apache.hadoop.hive.ql.Context.getMRTmpFileURI(Context.java:288)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1274)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1059)
	at shark.parse.SharkSemanticAnalyzer.analyzeInternal(SharkSemanticAnalyzer.scala:137)
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:279)
	at shark.SharkDriver.compile(SharkDriver.scala:215)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
	at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:338)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
	at shark.SharkCliDriver$.main(SharkCliDriver.scala:235)
	at shark.SharkCliDriver.main(SharkCliDriver.scala)

It is not obvious what’s happening, but it was caused by the package called “protobuf”. It can be found in jar file “hive-exec-0.11.0-shark-0.9.1.jar” which is located at $SHARK_HOME/lib_managed/jars/edu.berkeley.cs.shark/hive-exec, all you need to do is to remove the package completely:

cd $SHARK_HOME/lib_managed/jars/edu.berkeley.cs.shark/hive-exec
unzip hive-exec-0.11.0-shark-0.9.1.jar
rm -f com/google/protobuf/*
zip -r hive-exec-0.11.0-shark-0.9.1.jar *
rm -fr com hive-exec-log4j.properties javaewah/ javax/ javolution/ META-INF/ org/

That’s it, restart your “shark” and it should be working now.