Yesterday, I was dealing with an issue that when running a very simple Impala SELECT query, it failed with “Incompatible Parquet schema” error. I have confirmed the following workflow that triggered the error:

  1. Parquet file is created from external library
  2. Load the parquet file into Hive/Impala table
  3. Query the table through Impala will fail with below error message

    incompatible Parquet schema for column 'db_name.tbl_name.col_name'. 
    Column type: DECIMAL(19, 0), Parquet schema:\noptional byte_array col_name [i:2 d:1 r:0]
    

  4. The same query works well in Hive

This is due to impala currently does not support all decimal specs that are supported by Parquet. Currently Parquet supports the following specs:

  • int32: for 1 <= precision <= 9
  • int64: for 1 <= precision <= 18; precision < 10 will produce a warning
  • fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
  • binaryprecision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

Please refer to Parquet Logical Type Definitions page for details.

However, Impala only supports fixed_len_byte_array, but no others. This has been reported in the upstream JIRA: IMPALA-2494

The only workaround for now is to create a parquet file that will use supported specs for Decimal column, or simply create parquet file through either Hive or Impala.

8 Comments

  1. Louis

    Hi Eric,

    I have exactly this issue while I am writing a partition of a parquet table via Spark. My schema is a combination of strings and decimal(38,17) datatypes.
    Can you elaborate on the supported specs for Decimal column ?
    Which one of those is not supported by Impala ?
    Thank you very much for your help.

    Best,

    Louis

  2. krish

    Hello Eric,

    We recently upgraded cloudera from 5.6.0 to 5.8.3 version after that we are facing two issues. since we are out of support from cloudera, we are expecting some resolution steps from you.

    1. while running insert overwrite Query. throwing an issue like “Spilling has been disabled due to no usable scratch space.
    2. parq’ has an incompatible Parquet schema for column ”. Column type: STRING, Parquet schema:

    1. Eric Lin

      Hi Krish,

      Thanks for visiting my blog and posting questions.

      For 1, this means you have not setup Impala’s scratch directory correctly, so disk spilling feature won’t work.

      If you are using Cloudera Manager, please go to CM > Impala > Configuration > “Impala Daemon Scratch Directories” and confirm if any values has been set?

      For 2, This looks like you have mismatched column type between Impala/Hive and Parquet file. Your comment seemed to be cut of, as I don’t see anything after “Parquet: schema:”. Can you check the data type of that column in Parquet and then update the table in Hive/Impala to match it?

      Cheers
      Eric

  3. krish

    Eric, thanks for the reply.

    For 1, this means you have not setup Impala’s scratch directory correctly, so disk spilling feature won’t work.
    Response: scratch directories has been set and it was working fine with cdh5.6 version after upgrade it to 5.8.3. some of the insert overtwirte queries are throwing the spilled issue.

    we have already set like this

    /datanode1/impala/impalad until /datanode10/

  4. krish

    Hello Eric,

    i found something after upgrade of cloudera manager from 5.6 to 5.8.3 all the components have been upgraded to 5.8.3 but supervisord still having 5.6 versionjust like this.

    CDH 5.8.3, Parcels — After upgrade
    Supervisord 3.0-cm5.6.0 —After upgrade

    Basically to start and stop the process, we rely on an open source system called supervisord. It takes care of redirecting log files, notifying us of process failure, setuid’ing to the right user.

    SO, i’m assuming when we run insert overwrite this supervisord tried to set the permission and it allows to set ownership to the scratch directory as per the cdh 5.8.3 but here since supervisord is still having 5.6.0 it has breakdown with 5.8.3 and listening to clouder agent 5.6.0 . Is this reason for this issue “Spilling has been disabled due to no usable scratch space” ?

    please let me know the if my understanding is correct? if yes, can we do hard restart of cloudera agent to reflect the changes of supervisord to 5.8.3 from 5.6.0?

    1. Eric Lin

      Hi Krish,

      I highly doubt that supervisord version issue will cause Impala to behave like this, but regardless you need to get supervisord to the correct version. Yes, try to hard restart it to see if it helps.

      Do you have impala daemon log with the error message that can be shared with me? Like putting it temporarily on dropbox shared folder and remove it after it is done?

      I just would like to check exactly what Impala says.

      Cheers
      Eric

Leave a Reply to Louis Cancel reply

Your email address will not be published. Required fields are marked *