Impala query failed with error: “Incompatible Parquet Schema”

Yesterday, I was dealing with an issue that when running a very simple Impala SELECT query, it failed with “Incompatible Parquet schema” error. I have confirmed the following workflow that triggered the error:

  1. Parquet file is created from external library
  2. Load the parquet file into Hive/Impala table
  3. Query the table through Impala will fail with below error message
    incompatible Parquet schema for column 'db_name.tbl_name.col_name'. 
    Column type: DECIMAL(19, 0), Parquet schema:\noptional byte_array col_name [i:2 d:1 r:0]
  4. The same query works well in Hive

This is due to impala currently does not support all decimal specs that are supported by Parquet. Currently Parquet supports the following specs:

  • int32: for 1 <= precision <= 9
  • int64: for 1 <= precision <= 18; precision < 10 will produce a warning
  • fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
  • binaryprecision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

Please refer to Parquet Logical Type Definitions page for details.

However, Impala only supports fixed_len_byte_array, but no others. This has been reported in the upstream JIRA: IMPALA-2494

The only workaround for now is to create a parquet file that will use supported specs for Decimal column, or simply create parquet file through either Hive or Impala.

Impala Query fails with NoSuchObjectException error

In the last few months, I have seem CDH users hitting Impala query returning NoSuchObjectException error very often. This happens when running query against a particular table with INT partition types and it failed with below message:

WARNINGS:
ImpalaRuntimeException: Error making 'alter_partitions' RPC to Hive Metastore:
CAUSED BY: InvalidOperationException: Alter partition operation failed: NoSuchObjectException(message:partition values=[2017, 6, 1, 8])

We have confirmed that the table has four partitions with Integer data type, and select individual partition works.

The following scenario will trigger such error:

  • Partitions with INT data type
  • Partition data was inserted from Hive with zero prefixes, something like below query:
    INSERT OVERWRITE TABLE test_tbl PARTITION (year = '2017', month = '06'....) .....
    
  • Partition data will be created under HDFS location like below:
    hdfs://nameservice1/user/hive/warehouse/test_tbl/year=2017/month=06/day=01/hour=08
    
  • When query through Impala, since the data type is INT, Impala will convert values from “06” to 6, “01” to 1 etc, and will be looking for location :
    hdfs://nameservice1/user/hive/warehouse/test_tbl/year=2017/month=6/day=1/hour=8
    

    instead​ of:

    hdfs://nameservice1/user/hive/warehouse/test_tbl/year=2017/month=06/day=01/hour=08
    

    hence triggered NoSuchObjectException error.

To fix the issue, there are two options:

  1. Convert the data type of partition columns to String, instead of Integer:
    ALTER TABLE test_tbl PARTITION COLUMN (year string);
    ALTER TABLE test_tbl PARTITION COLUMN (month string);
    ALTER TABLE test_tbl PARTITION COLUMN (day string);
    ALTER TABLE test_tbl PARTITION COLUMN (hour string);
    
  2. if integer type need to be kept, then we will need to re-build the table into a new one and store them into locations without leading zeros. This can be done by running the following queries from Impala:
    CREATE TABLE new_test_tbl LIKE test_tbl;
    
    INSERT OVERWRITE TABLE new_test_tbl PARTITION (year, month, day, hour) as SELECT * FROM test_tbl;
    

    The new table will have leading zeros in partitions removed and then we can switch over to use the new table. When writing more data into the new table through Hive, please be sure to remove all leading zeros to prevent the issue from happening again.

Above steps should help resolve the issue. Hope they will help.

Read Files under Sub-Directories for Hive and Impala

Sometimes you might want to store data under sub-directories in HDFS and then you want Hive or Impala to read from those sub-directories. For example, you have the following directory structure:

root hdfs     231206 2017-06-30 02:45 /test/table1/000000_0
root hdfs          0 2017-06-30 02:45 /test/table1/child_directory
root hdfs     231206 2017-06-30 02:45 /test/table1/child_directory/000000_0

By default, Hive will only look for files in the root of directory specified, in my test case is /test/table1. However, Hive supports to read all data under the root table’s sub-directories as well. This can be achieved by updating the following settings:

SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;

Impala however, on the other side, currently does not support reading files from table’s sub-directories. This has been reported in the upstream JIRA of IMPALA-1944. Currently there is no immediate plan to support such feature, but it might be in the future release of Impala.

Hope above information is useful.

Impala Auto Update Metadata Support

There are lots of CDH users requested that Impala to support automatic metadata update, so that they do NOT need to run “INVALIDATE METADATA” every time when create table or data are updated through other components, like Hive or Pig.

I would like to share that this has been reported upstream and tracked via JIRA: IMPALA-3124. Currently it is still not fixed and it requires detailed design to make sure that it will work in the proper way.

There is no ETA at this stage on when this feature will be added. Advise anyone interested on this feature to add comments to the JIRA. The more votes, the better chance that it will be prioritized.

Hope above information is helpful.

Impala Reported Corrupt Parquet File After Failed With OutOfMemory Error

Recently I was dealing with an issue that impala reported Corrupt Parquet File after it failed with OutOfMemory error, however, if it does not fail, no corruption will be reported.

See below error message reportd in Impala Daemon logs:

Memory limit exceeded
HdfsParquetScanner::ReadDataPage() failed to allocate 65535 bytes for decompressed data.
Corrupt Parquet file 'hdfs://nameservice1/path/to/file/914164e7120e6076-cdae1be60000001f_169433548_data.0.parq': column 'client_ord_id' had 1024 remaining values but expected 0 _
[Executed: 4/29/2017 5:28:58 AM] [Execution: 588ms]
When an impala query failed with OOM error, it also reported corrupted parquet file:

HdfsParquetScanner::ReadDataPage() failed to allocate 65535 bytes for decompressed data.
Corrupt Parquet file 'hdfs://nameservice1/path/to/file/914164e7120e6076-cdae1be60000001f_169433548_data.0.parq': column 'client_ord_id' had 1024 remaining values but expected 0 _
[Executed: 4/29/2017 5:28:58 AM] [Execution: 588ms]

This is reported in the upstream JIRA: IMPALA-5197, this can happen in the following scenarios:

  • Query failed with OOM error
  • There is a LIMIT clause in the query
  • Query is manually cancelled by the user

Those corrupt messages do not mean the file is really corrupted, it is caused by an Impala bug that mentioned earlier IMPALA-5197.

If it is caused by OutOfMemory error, simply increase the memory limit for the query and try again:

SET MEM_LIMIT=10g;

For the other two causes, we will need to wait for IMPALA-5197 to be fixed