How to Index LZO files in Hadoop

maths studies coursework help Today I was trying to index LZO file using hadoop command:

http://mzoologia.uprrp.edu/?online-paper-writing-services-legit hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.7.0-1.cdh5.7.0.p0.40/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer /tmp/lzo_test

http://www.gutailong.com/essays-online/ essays online However, it failed with the following error:

http://www.atelierfotografico.eu/how-to-write-an-english-literature-phd-proposal/ how to write an english literature phd proposal 16/09/10 03:05:51 INFO mapreduce.Job: Task Id : attempt_1473404927068_0005_m_000000_0, Status : FAILED Error: java.lang.NullPointerException at com.hadoop.mapreduce.LzoSplitRecordReader.initialize(LzoSplitRecordReader.java:50) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Dissertation Help Asia After some research, it turned out that I did not have LZO codec enabled in my cluster. I did the following to resolve the issue:

http://www.greedyrooster.it/how-to-finish-your-thesis/ – Add gplextras parcel to Cloudera Manager’s parcel configuration:
https://archive.cloudera.com/gplextras5/parcels/x.x.x

buy already written essay where x.x.x should match with your CDH version, in my case is 5.7.0

http://www.townandcountryinteriors.com/nursing-term-papers/ screen-shot-2016-09-10-at-8-25-24-pm

http://pacificcrossroads.net/?homework-live – Install GPL Extras Parcel through Cloudera Manager as normal
– Add the following codec to Compression Codec (io.compression.codecs) in Cloudera Manager > HDFS > Configurations:

great persuasive essay com.hadoop.compression.lzo.LzopCodec com.hadoop.compression.lzo.LzoCodec

follow site screen-shot-2016-09-10-at-8-17-45-pm

http://glidecoaching.com/?masters-dissertation-services-by-chris-hart – Restart Cluster
– Deploy Client Cluster Configuration
– Install native-lzo library on all hosts in the cluster

enter sudo yum install lzop

source After the above changes, the index job should finish without issues:

The Best Essay Writing hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.7.0-1.cdh5.7.0.p0.40/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer /tmp/lzo_test 16/09/10 03:22:10 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 16/09/10 03:22:10 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 820f83d05b8d916e89dbb72d6ef129113b277303] 16/09/10 03:22:12 INFO lzo.DistributedLzoIndexer: Adding LZO file hdfs://host-10-17-101-195.coe.cloudera.com:8020/tmp/lzo_test/test.txt.lzo to indexing list (no index currently exists) 16/09/10 03:22:12 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 16/09/10 03:22:12 INFO client.RMProxy: Connecting to ResourceManager at host-10-17-101-195.coe.cloudera.com/10.17.101.195:8032 16/09/10 03:22:12 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 119 for hive on 10.17.101.195:8020 16/09/10 03:22:12 INFO security.TokenCache: Got dt for hdfs://host-10-17-101-195.coe.cloudera.com:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.101.195:8020, Ident: (HDFS_DELEGATION_TOKEN token 119 for hive) 16/09/10 03:22:12 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 16/09/10 03:22:12 INFO input.FileInputFormat: Total input paths to process : 1 16/09/10 03:22:13 INFO mapreduce.JobSubmitter: number of splits:1 16/09/10 03:22:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1473502403576_0002 16/09/10 03:22:13 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.101.195:8020, Ident: (HDFS_DELEGATION_TOKEN token 119 for hive) 16/09/10 03:22:13 INFO impl.YarnClientImpl: Submitted application application_1473502403576_0002 16/09/10 03:22:13 INFO mapreduce.Job: The url to track the job: http://host-10-17-101-195.coe.cloudera.com:8088/proxy/application_1473502403576_0002/ 16/09/10 03:22:13 INFO mapreduce.Job: Running job: job_1473502403576_0002 16/09/10 03:22:22 INFO mapreduce.Job: Job job_1473502403576_0002 running in uber mode : false 16/09/10 03:22:22 INFO mapreduce.Job: map 0% reduce 0% 16/09/10 03:22:35 INFO mapreduce.Job: map 22% reduce 0% 16/09/10 03:22:38 INFO mapreduce.Job: map 36% reduce 0% 16/09/10 03:22:41 INFO mapreduce.Job: map 50% reduce 0% 16/09/10 03:22:44 INFO mapreduce.Job: map 62% reduce 0% 16/09/10 03:22:47 INFO mapreduce.Job: map 76% reduce 0% 16/09/10 03:22:50 INFO mapreduce.Job: map 90% reduce 0% 16/09/10 03:22:53 INFO mapreduce.Job: map 100% reduce 0% 16/09/10 03:22:53 INFO mapreduce.Job: Job job_1473502403576_0002 completed successfully 16/09/10 03:22:53 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=118080 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=85196 HDFS: Number of bytes written=85008 HDFS: Number of read operations=2 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 Job Counters Launched map tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=28846 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=28846 Total vcore-seconds taken by all map tasks=28846 Total megabyte-seconds taken by all map tasks=29538304 Map-Reduce Framework Map input records=10626 Map output records=10626 Input split bytes=138 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=32 CPU time spent (ms)=3960 Physical memory (bytes) snapshot=233476096 Virtual memory (bytes) snapshot=1564155904 Total committed heap usage (bytes)=176160768 File Input Format Counters Bytes Read=85058 File Output Format Counters Bytes Written=0

Enable Snappy Compression For Flume

follow Snappy is a compression/decompression library developed by Google. It aims for very high speeds and reasonable compression ( might be bigger than other standard compression algorithms but faster speed ). Snappy is shipped with Hadoop, unlike LZO compression which is excluded due to licensing issues. To enable Snappy in your Flume installation, following the steps below:

http://gdgmumbai.org/?p=free-research-paper-generator Install on Red Hat systems:

http://rockexim.com/buy-research-paper-cheap/ $ sudo yum install hadoop-0.20-native

Homework Help Levers Ks3 Install on Ubuntu systems:

$ sudo apt-get install hadoop-0.20-native

This should create a directory under /usr/lib/hadoop/lib/native/ which contains some native hadoop libraries.

Then create environment config for Flume:

$ cp /usr/lib/flume/bin/flume-env.sh.template /usr/lib/flume/bin/flume-env.sh

And update the last line in the file to be:

For 32-bit platform

$ export JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-i386-32

For 64-bit platform

$ export JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64

Next update the flume’s configuration file under “/etc/flume/conf/flume-site.xml” on the collector node to:

  <property>
    <name>flume.collector.dfs.compress.codec</name>
    <value>SnappyCodec</value>
    <description>Writes formatted data compressed in specified codec to
    dfs. Value is None, GzipCodec, DefaultCodec (deflate), BZip2Codec, SnappyCodec
    or any other Codec Hadoop is aware of </description>
  </property>

And then finally restart the flume-node:

$ /etc/init.d/flume-node restart

You next file update in HDFS will look something like the following:

-rw-r--r--   3 flume supergroup          0 2011-10-21 14:01 /data/traffic/Y2011_M9_W37_D254/R0_P0/C1_20111021-140124175+1100.955183363700204.00000244.snappy.tmp
-rw-r--r--   3 flume supergroup   35156526 2011-10-20 16:51 /data/traffic/Y2011_M9_W37_D254/R0_P0/C2_20111020-164928958+1100.780424004236302.00000018.snappy
-rw-r--r--   3 flume supergroup     830565 2011-10-20 17:15 /data/traffic/Y2011_M9_W37_D254/R0_P0/C2_20111020-171423368+1100.781918413572302.00000018.snappy
-rw-r--r--   3 flume supergroup          0 2011-10-20 17:19 /data/traffic/Y2011_M9_W37_D254/R0_P0/C2_20111020-171853599+1100.782188644505302.00000042.snappy.tmp
-rw-r--r--   3 flume supergroup    1261171 2011-10-20 17:37 /data/traffic/Y2011_M9_W37_D254/R0_P0/C2_20111020-173728225+1100.783303271088302.00000018.snappy
-rw-r--r--   3 flume supergroup    2128701 2011-10-20 17:40 /data/traffic/Y2011_M9_W37_D254/R0_P0/C2_20111020-174024045+1100.783479090669302.00000046.snappy

Happy Fluming..

Compile Hadoop LZO Compression Library on CentOS

To compile and install Hadoop’s LZO compression library on CentOS, following the steps below:

Download hadoop LZO source from Kevin’s Hadoop LZO Project.

If you are using Ant version of < 1.7, please download latest ant binary pacakge from Apache Ant, otherwise you will get the following error when compiling:

BUILD FAILED
/root/kevinweil-hadoop-lzo-4c5a227/build.xml:510: Class org.apache.tools.ant.taskdefs.ConditionTask doesn't support the nested "typefound" element.

Install lzo-devel:

yum install lzo-devel.x86_64
yum install lzop.x86_64

Unzip Apache Ant and Hadoop LZO Compression Library to somewhere you have access to.

unzip kevinweil-hadoop-lzo-4c5a227.zip
cd kevinweil-hadoop-lzo-4c5a227

Do the following if your ant version is less than 1.7:

export ANT_HOME=<path_to_ant_downloaded_dir>

Then

<path_to_ant> compile-native tar

In my case is

~/apache-ant-1.8.2/bin/ant compile-native tar

Carefully check the compiler output, if you see errors like this:

     [exec] checking for strerror_r... yes
     [exec] checking whether strerror_r returns char *... yes
     [exec] checking for mkdir... yes
     [exec] checking for uname... yes
     [exec] checking for memset... yes
     [exec] checking for JNI_GetCreatedJavaVMs in -ljvm... no
     [exec] checking jni.h usability... 
     [exec] configure: error: Native java headers not found. Is $JAVA_HOME set correctly?
     [exec] no
     [exec] checking jni.h presence... no
     [exec] checking for jni.h... no

BUILD FAILED
/..../kevinweil-hadoop-lzo-4c5a227/build.xml:247: exec returned: 1

Then you will need to find your java path and update build.xml file and add “JAVA_HOME” setting on line 247, in my case is “/usr/lib/jvm/java/”

<exec dir="${build.native}" executable="sh" failonerror="true">
   <env key="OS_NAME" value="${os.name}"/>
   <env key="JAVA_HOME" value="/usr/lib/jvm/java/" />
   <env key="OS_ARCH" value="${os.arch}"/>
   <env key="JVM_DATA_MODEL" value="${sun.arch.data.model}"/>
   <env key="NATIVE_SRCDIR" value="${native.src.dir}"/>
   <arg line="${native.src.dir}/configure"/>
</exec>

and re-run the compiler again

<path_to_ant> compile-native tar

If everything goes well, you should get “BUILD SUCCESSFUL” message at the end of the compile process.

Now do

ls -al build

in the current directory and you will see the following files generated:

drwxr-xr-x 9 root root    4096 Oct 19 17:21 .
drwxr-xr-x 6 root root    4096 Oct 19 17:21 ..
drwxr-xr-x 4 root root    4096 Oct 19 17:21 classes
drwxr-xr-x 3 root root    4096 Oct 19 17:21 docs
drwxr-xr-x 6 root root    4096 Oct 19 17:21 hadoop-lzo-0.4.14
-rw-r--r-- 1 root root   62239 Oct 19 17:21 hadoop-lzo-0.4.14.jar
-rw-r--r-- 1 root root 1824851 Oct 19 17:21 hadoop-lzo-0.4.14.tar.gz
drwxr-xr-x 5 root root    4096 Oct 19 16:59 ivy
drwxr-xr-x 3 root root    4096 Oct 19 17:12 native
drwxr-xr-x 2 root root    4096 Oct 19 17:12 src
drwxr-xr-x 3 root root    4096 Oct 19 17:12 test

The most important one is hadoop-lzo-0.4.14.jar which can be copied to hadoop’s library directory and ready to be used.