This article explains the steps needed to redirect parquet’s log message from STDOUT to STDERR, so that the output of Hive result will not be polluted should the user wants to capture the query result on command line.
In Parquet’s code based, it writes its logging information directly into STDOUT, this will cause some applications to fail because those messages will be captured, see example below:
1. Table with TEXT file format works as below:
$ test=`hive -e "SELECT * FROM default.test"` $ echo $test 2 5 4 3 2 1 5 4 3 2
2. However, if you do the same thing for Parquet table, the result is different:
$ test_parquet=`hive -e "SELECT * FROM default.test_parquet"` $ echo $test_parquet 2 5 4 3 2 1 5 4 3 2 16/08/2016 5:55:32 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl 16/08/2016 5:55:32 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 10 records. 16/08/2016 5:55:32 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block 16/08/2016 5:55:32 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 15 ms. row count = 10
So if an application tries to use the variable $test_parquet, it will cause issues due to those WARNING messages.
This problem has been reported in upstream JIRA: HIVE-13954, however, at the time of writing (CDH5.8.1), this JIRA has not been backported into CDH yet.
To workaround the problem, follow the steps below:
- Save the content of the following to a file:
#=============== parquet.handlers= java.util.logging.ConsoleHandler .level=INFO java.util.logging.ConsoleHandler.level=INFO java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter java.util.logging.SimpleFormatter.format=[%1$tc] %4$s: %2$s - %5$s %6$s%n #===============
and put it anywhere you like on the client machine that you will run Hive CLI, in my test I put it under /tmp/parquet/parquet-logging2.properties
- run the following command on shell before you run Hive CLI:
please change the path to the properties file accordingly
run your Hive CLI command:
test_parquet=`hive -e "SELECT * FROM default.test_parquet"`
the output will be saved in “$test_parquet” as expected
Note: Please be advised that Hive CLI is now deprecated, we strongly advise that you connect to HS2 through JDBC or ODBC driver to get proper results, we do not recommend to parse result from the Hive CLI output.
Note: Please also advise that this workaround will only work in CDH version of Hive, as the upstream version has different package names as the one in CDH for Parquet class.