Impala Reported Corrupt Parquet File After Failed With OutOfMemory Error

Recently I was dealing with an issue that impala reported Corrupt Parquet File after it failed with OutOfMemory error, however, if it does not fail, no corruption will be reported.

See below error message reportd in Impala Daemon logs:

Memory limit exceeded
HdfsParquetScanner::ReadDataPage() failed to allocate 65535 bytes for decompressed data.
Corrupt Parquet file 'hdfs://nameservice1/path/to/file/914164e7120e6076-cdae1be60000001f_169433548_data.0.parq': column 'client_ord_id' had 1024 remaining values but expected 0 _
[Executed: 4/29/2017 5:28:58 AM] [Execution: 588ms]
When an impala query failed with OOM error, it also reported corrupted parquet file:

HdfsParquetScanner::ReadDataPage() failed to allocate 65535 bytes for decompressed data.
Corrupt Parquet file 'hdfs://nameservice1/path/to/file/914164e7120e6076-cdae1be60000001f_169433548_data.0.parq': column 'client_ord_id' had 1024 remaining values but expected 0 _
[Executed: 4/29/2017 5:28:58 AM] [Execution: 588ms]

This is reported in the upstream JIRA: IMPALA-5197, this can happen in the following scenarios:

  • Query failed with OOM error
  • There is a LIMIT clause in the query
  • Query is manually cancelled by the user

Those corrupt messages do not mean the file is really corrupted, it is caused by an Impala bug that mentioned earlier IMPALA-5197.

If it is caused by OutOfMemory error, simply increase the memory limit for the query and try again:

SET MEM_LIMIT=10g;

For the other two causes, we will need to wait for IMPALA-5197 to be fixed