Yesterday, I was dealing with an issue that when running a very simple Impala SELECT query, it failed with “Incompatible Parquet schema” error. I have confirmed the following workflow that triggered the error:
- Parquet file is created from external library
- Load the parquet file into Hive/Impala table
- Query the table through Impala will fail with below error message
incompatible Parquet schema for column 'db_name.tbl_name.col_name'. Column type: DECIMAL(19, 0), Parquet schema:\noptional byte_array col_name [i:2 d:1 r:0]
- The same query works well in Hive
This is due to impala currently does not support all decimal specs that are supported by Parquet. Currently Parquet supports the following specs:
int32: for 1 <= precision <= 9
int64: for 1 <= precision <= 18; precision < 10 will produce a warning
fixed_len_byte_array: precision is limited by the array size. Length
ncan store <=
floor(log_10(2^(8*n - 1) - 1))base-10 digits
precisionis not limited, but is required. The minimum number of bytes to store the unscaled value should be used.
Please refer to Parquet Logical Type Definitions page for details.
However, Impala only supports fixed_len_byte_array, but no others. This has been reported in the upstream JIRA: IMPALA-2494
The only workaround for now is to create a parquet file that will use supported specs for Decimal column, or simply create parquet file through either Hive or Impala.