How to Index LZO files in Hadoop

Today I was trying to index LZO file using hadoop command:

hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.7.0-1.cdh5.7.0.p0.40/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer /tmp/lzo_test

However, it failed with the following error:

16/09/10 03:05:51 INFO mapreduce.Job: Task Id : attempt_1473404927068_0005_m_000000_0, Status : FAILED
Error: java.lang.NullPointerException
       	at com.hadoop.mapreduce.LzoSplitRecordReader.initialize(
       	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(
       	at org.apache.hadoop.mapred.MapTask.runNewMapper(
       	at org.apache.hadoop.mapred.YarnChild$
       	at Method)
       	at org.apache.hadoop.mapred.YarnChild.main(

After some research, it turned out that I did not have LZO codec enabled in my cluster. I did the following to resolve the issue:

– Add gplextras parcel to Cloudera Manager’s parcel configuration:

where x.x.x should match with your CDH version, in my case is 5.7.0


– Install GPL Extras Parcel through Cloudera Manager as normal
– Add the following codec to Compression Codec (io.compression.codecs) in Cloudera Manager > HDFS > Configurations:



– Restart Cluster
– Deploy Client Cluster Configuration
– Install native-lzo library on all hosts in the cluster

sudo yum install lzop

After the above changes, the index job should finish without issues:

hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.7.0-1.cdh5.7.0.p0.40/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer /tmp/lzo_test
16/09/10 03:22:10 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
16/09/10 03:22:10 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 820f83d05b8d916e89dbb72d6ef129113b277303]
16/09/10 03:22:12 INFO lzo.DistributedLzoIndexer: Adding LZO file hdfs:// to indexing list (no index currently exists)
16/09/10 03:22:12 INFO Configuration.deprecation: is deprecated. Instead, use
16/09/10 03:22:12 INFO client.RMProxy: Connecting to ResourceManager at
16/09/10 03:22:12 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 119 for hive on
16/09/10 03:22:12 INFO security.TokenCache: Got dt for hdfs://; Kind: HDFS_DELEGATION_TOKEN, Service:, Ident: (HDFS_DELEGATION_TOKEN token 119 for hive)
16/09/10 03:22:12 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/09/10 03:22:12 INFO input.FileInputFormat: Total input paths to process : 1
16/09/10 03:22:13 INFO mapreduce.JobSubmitter: number of splits:1
16/09/10 03:22:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1473502403576_0002
16/09/10 03:22:13 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service:, Ident: (HDFS_DELEGATION_TOKEN token 119 for hive)
16/09/10 03:22:13 INFO impl.YarnClientImpl: Submitted application application_1473502403576_0002
16/09/10 03:22:13 INFO mapreduce.Job: The url to track the job:
16/09/10 03:22:13 INFO mapreduce.Job: Running job: job_1473502403576_0002
16/09/10 03:22:22 INFO mapreduce.Job: Job job_1473502403576_0002 running in uber mode : false
16/09/10 03:22:22 INFO mapreduce.Job:  map 0% reduce 0%
16/09/10 03:22:35 INFO mapreduce.Job:  map 22% reduce 0%
16/09/10 03:22:38 INFO mapreduce.Job:  map 36% reduce 0%
16/09/10 03:22:41 INFO mapreduce.Job:  map 50% reduce 0%
16/09/10 03:22:44 INFO mapreduce.Job:  map 62% reduce 0%
16/09/10 03:22:47 INFO mapreduce.Job:  map 76% reduce 0%
16/09/10 03:22:50 INFO mapreduce.Job:  map 90% reduce 0%
16/09/10 03:22:53 INFO mapreduce.Job:  map 100% reduce 0%
16/09/10 03:22:53 INFO mapreduce.Job: Job job_1473502403576_0002 completed successfully
16/09/10 03:22:53 INFO mapreduce.Job: Counters: 30
       	File System Counters
       		FILE: Number of bytes read=0
       		FILE: Number of bytes written=118080
       		FILE: Number of read operations=0
       		FILE: Number of large read operations=0
       		FILE: Number of write operations=0
       		HDFS: Number of bytes read=85196
       		HDFS: Number of bytes written=85008
       		HDFS: Number of read operations=2
       		HDFS: Number of large read operations=0
       		HDFS: Number of write operations=4
       	Job Counters
       		Launched map tasks=1
       		Data-local map tasks=1
       		Total time spent by all maps in occupied slots (ms)=28846
       		Total time spent by all reduces in occupied slots (ms)=0
       		Total time spent by all map tasks (ms)=28846
       		Total vcore-seconds taken by all map tasks=28846
       		Total megabyte-seconds taken by all map tasks=29538304
       	Map-Reduce Framework
       		Map input records=10626
       		Map output records=10626
       		Input split bytes=138
       		Spilled Records=0
       		Failed Shuffles=0
       		Merged Map outputs=0
       		GC time elapsed (ms)=32
       		CPU time spent (ms)=3960
       		Physical memory (bytes) snapshot=233476096
       		Virtual memory (bytes) snapshot=1564155904
       		Total committed heap usage (bytes)=176160768
       	File Input Format Counters
       		Bytes Read=85058
       	File Output Format Counters
       		Bytes Written=0

Enable Snappy Compression For Flume

Snappy is a compression/decompression library developed by Google. It aims for very high speeds and reasonable compression ( might be bigger than other standard compression algorithms but faster speed ). Snappy is shipped with Hadoop, unlike LZO compression which is excluded due to licensing issues. To enable Snappy in your Flume installation, following the steps below:

Install on Red Hat systems:

$ sudo yum install hadoop-0.20-native

Install on Ubuntu systems:

$ sudo apt-get install hadoop-0.20-native

This should create a directory under /usr/lib/hadoop/lib/native/ which contains some native hadoop libraries.

Then create environment config for Flume:

$ cp /usr/lib/flume/bin/ /usr/lib/flume/bin/

And update the last line in the file to be:

For 32-bit platform

$ export JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-i386-32

For 64-bit platform

$ export JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64

Next update the flume’s configuration file under “/etc/flume/conf/flume-site.xml” on the collector node to:

    <description>Writes formatted data compressed in specified codec to
    dfs. Value is None, GzipCodec, DefaultCodec (deflate), BZip2Codec, SnappyCodec
    or any other Codec Hadoop is aware of </description>

And then finally restart the flume-node:

$ /etc/init.d/flume-node restart

You next file update in HDFS will look something like the following:

-rw-r--r--   3 flume supergroup          0 2011-10-21 14:01 /data/traffic/Y2011_M9_W37_D254/R0_P0/C1_20111021-140124175+1100.955183363700204.00000244.snappy.tmp
-rw-r--r--   3 flume supergroup   35156526 2011-10-20 16:51 /data/traffic/Y2011_M9_W37_D254/R0_P0/C2_20111020-164928958+1100.780424004236302.00000018.snappy
-rw-r--r--   3 flume supergroup     830565 2011-10-20 17:15 /data/traffic/Y2011_M9_W37_D254/R0_P0/C2_20111020-171423368+1100.781918413572302.00000018.snappy
-rw-r--r--   3 flume supergroup          0 2011-10-20 17:19 /data/traffic/Y2011_M9_W37_D254/R0_P0/C2_20111020-171853599+1100.782188644505302.00000042.snappy.tmp
-rw-r--r--   3 flume supergroup    1261171 2011-10-20 17:37 /data/traffic/Y2011_M9_W37_D254/R0_P0/C2_20111020-173728225+1100.783303271088302.00000018.snappy
-rw-r--r--   3 flume supergroup    2128701 2011-10-20 17:40 /data/traffic/Y2011_M9_W37_D254/R0_P0/C2_20111020-174024045+1100.783479090669302.00000046.snappy

Happy Fluming..

Compile Hadoop LZO Compression Library on CentOS

To compile and install Hadoop’s LZO compression library on CentOS, following the steps below:

Download hadoop LZO source from Kevin’s Hadoop LZO Project.

If you are using Ant version of < 1.7, please download latest ant binary pacakge from Apache Ant, otherwise you will get the following error when compiling:

/root/kevinweil-hadoop-lzo-4c5a227/build.xml:510: Class doesn't support the nested "typefound" element.

Install lzo-devel:

yum install lzo-devel.x86_64
yum install lzop.x86_64

Unzip Apache Ant and Hadoop LZO Compression Library to somewhere you have access to.

cd kevinweil-hadoop-lzo-4c5a227

Do the following if your ant version is less than 1.7:

export ANT_HOME=<path_to_ant_downloaded_dir>


<path_to_ant> compile-native tar

In my case is

~/apache-ant-1.8.2/bin/ant compile-native tar

Carefully check the compiler output, if you see errors like this:

     [exec] checking for strerror_r... yes
     [exec] checking whether strerror_r returns char *... yes
     [exec] checking for mkdir... yes
     [exec] checking for uname... yes
     [exec] checking for memset... yes
     [exec] checking for JNI_GetCreatedJavaVMs in -ljvm... no
     [exec] checking jni.h usability... 
     [exec] configure: error: Native java headers not found. Is $JAVA_HOME set correctly?
     [exec] no
     [exec] checking jni.h presence... no
     [exec] checking for jni.h... no

/..../kevinweil-hadoop-lzo-4c5a227/build.xml:247: exec returned: 1

Then you will need to find your java path and update build.xml file and add “JAVA_HOME” setting on line 247, in my case is “/usr/lib/jvm/java/”

<exec dir="${build.native}" executable="sh" failonerror="true">
   <env key="OS_NAME" value="${}"/>
   <env key="JAVA_HOME" value="/usr/lib/jvm/java/" />
   <env key="OS_ARCH" value="${os.arch}"/>
   <env key="JVM_DATA_MODEL" value="${}"/>
   <env key="NATIVE_SRCDIR" value="${native.src.dir}"/>
   <arg line="${native.src.dir}/configure"/>

and re-run the compiler again

<path_to_ant> compile-native tar

If everything goes well, you should get “BUILD SUCCESSFUL” message at the end of the compile process.

Now do

ls -al build

in the current directory and you will see the following files generated:

drwxr-xr-x 9 root root    4096 Oct 19 17:21 .
drwxr-xr-x 6 root root    4096 Oct 19 17:21 ..
drwxr-xr-x 4 root root    4096 Oct 19 17:21 classes
drwxr-xr-x 3 root root    4096 Oct 19 17:21 docs
drwxr-xr-x 6 root root    4096 Oct 19 17:21 hadoop-lzo-0.4.14
-rw-r--r-- 1 root root   62239 Oct 19 17:21 hadoop-lzo-0.4.14.jar
-rw-r--r-- 1 root root 1824851 Oct 19 17:21 hadoop-lzo-0.4.14.tar.gz
drwxr-xr-x 5 root root    4096 Oct 19 16:59 ivy
drwxr-xr-x 3 root root    4096 Oct 19 17:12 native
drwxr-xr-x 2 root root    4096 Oct 19 17:12 src
drwxr-xr-x 3 root root    4096 Oct 19 17:12 test

The most important one is hadoop-lzo-0.4.14.jar which can be copied to hadoop’s library directory and ready to be used.