Shark/Spark will not be able to read Snappy compressed data out of the box. In my previous post, I explained how to enable Snappy compression in Hadoop 2.4, once this is done, enabling Snappy in Spark is dead simple, all you need to do is to set an envionment variable for it:
In $SPARK_HOME/conf/spark-env.sh, add the following:
Of course, this assumes that you have hadoop’s native Snappy libray libsnappy.so in the specified directory.
After distributing this conf file to all the slaves and restarting the cluster, Shark will be able to create or read Snappy compressed data through Spark.