Hive unable to read Snappy files generated by Hive and Flume together

This article explains the workarounds to avoid the Hive query failure when processing snappy files generated by Hive and Flume under the same directory.

The following are the steps to re-produce the issue:

  1. A Hive table (from_hive) with its data injected from Flume
  2. Create another table with same column structure (from_flume)
  3. Insert data into the new table by selecting from old table, this worked and SELECT COUNT(*) returned correct result:
    INSERT INTO from_hive SELECT * FROM from_flume;
    
  4. At this stage SELECT query works on both old and new tables
  5. Copy the data generated by Flume into the new table’s location, so that those files sit under the same table’s location
  6. Do a SELECT * from the table will result in the following error:
    Error: java.io.IOException: java.io.IOException: java.lang.IndexOutOfBoundsException
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:226)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:136)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
    Caused by: java.io.IOException: java.lang.IndexOutOfBoundsException
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
    at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:105)
    at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:224)
    ... 11 more
    Caused by: java.lang.IndexOutOfBoundsException
    at java.io.DataInputStream.readFully(DataInputStream.java:192)
    at org.apache.hadoop.io.Text.readWithKnownLength(Text.java:319)
    at org.apache.hadoop.io.Text.readFields(Text.java:291)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
    at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2245)
    at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2229)
    at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
    at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
    ... 15 more
    

This is caused by the fact that snappy files generated by Hive and Flume are not compatible. They will work independently under different tables but only fails when putting both of them under the same table or partition.

This is to do with the way that Hive and Flume creates snappy files, they have different headers:

Flume Source:


SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable)org.apache.hadoop.io.compress.SnappyCodec

Select Souce:


SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text)org.apache.hadoop.io.compress.SnappyCodec

So when Hive uses CombineHiveInputFormat class (default) to read snappy files, one mapper will read multiple files and if they are being run in the same mapper, due to different structures in the snappy file, Hive will not be able to read them together properly.

The solution is to set hive.input.format to org.apache.hadoop.hive.ql.io.HiveInputFormat to avoid Hive to use CombineHiveInputFormat class to combine multiple snappy files when reading. This will ensure that one mapper will read one file only, but the side effect is that more mappers will be used or files being processed sequentially:


SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

Leave a Reply

Your email address will not be published. Required fields are marked *