Impala Reported Corrupt Parquet File After Failed With OutOfMemory Error

Impala Reported Corrupt Parquet File After Failed With OutOfMemory Error

Recently I was dealing with an issue that impala reported Corrupt Parquet File after it failed with OutOfMemory error, however, if it does not fail, no corruption will be reported. See below error message reportd in Impala Daemon logs:
Memory limit exceeded
HdfsParquetScanner::ReadDataPage() failed to allocate 65535 bytes for decompressed data.
Corrupt Parquet file 'hdfs://nameservice1/path/to/file/914164e7120e6076-cdae1be60000001f_169433548_data.0.parq': column 'client_ord_id' had 1024 remaining values but expected 0 _
[Executed: 4/29/2017 5:28:58 AM] [Execution: 588ms]
When an impala query failed with OOM error, it also reported corrupted parquet file:

HdfsParquetScanner::ReadDataPage() failed to allocate 65535 bytes for decompressed data.
Corrupt Parquet file 'hdfs://nameservice1/path/to/file/914164e7120e6076-cdae1be60000001f_169433548_data.0.parq': column 'client_ord_id' had 1024 remaining values but expected 0 _
[Executed: 4/29/2017 5:28:58 AM] [Execution: 588ms]
This is reported in the upstream JIRA: IMPALA-5197, this can happen in the following scenarios:
  • Query failed with OOM error
  • There is a LIMIT clause in the query
  • Query is manually cancelled by the user
Those corrupt messages do not mean the file is really corrupted, it is caused by an Impala bug that mentioned earlier IMPALA-5197. If it is caused by OutOfMemory error, simply increase the memory limit for the query and try again:
SET MEM_LIMIT=10g;
For the other two causes, we will need to wait for IMPALA-5197 to be fixed Update:IMPALA-5197 has been fixed since CDH5.12.0 as well as CDH5.10.2, CDH5.9.3 and CDH5.11.2.

Loading

2 Comments

Leave a Reply to Eric Lin Cancel reply

Your email address will not be published. Required fields are marked *

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!