How to confirm Dynamic Partition Pruning works in Impala

This article explains how to confirm Impala’s new Dynamic Partition Pruning feature is effective in CDH5.7.x. Dynamic Partition Pruning is a new feature introduced from CDH5.7.x / Impala 2.5, where information about the partition is collected during run time and impala prunes unnecessary partitions in the ways that were impractical …

How to Index LZO files in Hadoop

Today I was trying to index LZO file using hadoop command: hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.7.0-1.cdh5.7.0.p0.40/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer /tmp/lzo_test However, it failed with the following error: 16/09/10 03:05:51 INFO mapreduce.Job: Task Id : attempt_1473404927068_0005_m_000000_0, Status : FAILED Error: java.lang.NullPointerException at com.hadoop.mapreduce.LzoSplitRecordReader.initialize(LzoSplitRecordReader.java:50) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at …

How to redirect parquet’s log message into STDERR rather than STDOUT

This article explains the steps needed to redirect parquet’s log message from STDOUT to STDERR, so that the output of Hive result will not be polluted should the user wants to capture the query result on command line. In Parquet’s code based, it writes its logging information directly into STDOUT, …

Beeline options need to be placed before “-e” option

Recently I needed to deal with an issue that users tried to specify “–incremental=true” as beeline command line option, due to the issue that beeline failed with OutOfMemory error when fetching results from HiveServer2. This option should help with the OOM problem, however it did not in this particular case. …