Impala Failed to Read Parquet Decimal Data

If you are using Hive JDBC/ODBC driver to create Parquet data via Hive, and you face below error when reading from Impala:

File 'hdfs://namespace1/path/to/parquet/file/15110390563421' column 'column1' has a scale that does not match the table metadata scale. 
File metadata scale: 0 Table metadata scale: 2 when we try to execute in Impala.

It is most likely that you hit a known issue HIVE-15519.

Due to this bug, the DECIMAL scale value returned from the driver was incorrect (returns null), hence caused the data being written to parquet file with wrong scale value for those columns. And Impala will complain that the column’s definition at metadata side is not matching with the column type stored in Parquet file, due to different scale values.

Currently there is no workarounds available for this issue, and a patch is required. Or you will have to wait it to be fixed. At the time of writing, the latest Cloudera CDH is at 5.13.1.

Drop Impala UDF function returned NoSuchObjectException error

Impala UDF works a bit differently than Hive UDF, as they are written in different languages. Recently I have encountered an issue that when I try to drop Impala UDF written in C, it failed with below error:

DROP FUNCTION udf_testing.has_vowels;
Query: DROP FUNCTION udf_testing.has_vowels
ERROR:
ImpalaRuntimeException: Error making 'dropFunction' RPC to Hive Metastore:
CAUSED BY: NoSuchObjectException: Function has_vowels does not exist

The “SHOW CREATE FUNCTION” query worked OK as below:

SHOW CREATE FUNCTION has_vowels;
Query: SHOW CREATE FUNCTION has_vowels
+-----------------------------------------------------------------------------+
| result                                                                      |
+-----------------------------------------------------------------------------+
| CREATE FUNCTION udf_testing.has_vowels(STRING)                              |
|  RETURNS BOOLEAN                                                            |
|  LOCATION 'hdfs://nameservice1/user/hive/impala-udf/libudfsample.so'        |
|  SYMBOL='_Z9HasVowelsPN10impala_udf15FunctionContextERKNS_9StringValE'      |
|                                                                             |
+-----------------------------------------------------------------------------+
Fetched 1 row(s) in 0.03s

After researching, it turned out that when dropping functions in Impala, you will also need to specify the function parameters. So below query will work:

DROP FUNCTION udf_testing.has_vowels(STRING);
Query: DROP FUNCTION udf_testing.has_vowels(STRING)

This is not immediately obvious, but it is documented on Cloudera’ offical documentation site: DROP FUNCTION Statement.

Hope above helps.

Setting up Cloudera ODBC driver on Windows 10

I have seen lots of CDH users now have trouble setting up Hive/Impala ODBC drivers on Windows 10 machine to connect to remote Kerberized cluster recently. Connection keeps getting Kerberos related error messages. Like below:

[Cloudera][Hardy] (34) Error from server: SASL(-1):
generic failure: GSSAPI Error: Unspecified GSS failure.
Minor code may provide more information (Credential cache is empty).

OR

[Cloudera][ImpalaODBC] (100) Error from the Impala Thrift API:
SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.
Minor code may provide more information (No credentials cache found)

To help CDH users to get it working without much hassle, I would like to compile a list of steps below for reference. I have tested this in my VM Windows 10.

1. For Kerberos authentication to work, you need to get a valid Kerberos ticket on your client machine, which is Windows 10. Hence, you will need to download and install MIT Kerberos client tool so that you can authenticate yourself against the remote cluster, much like running “kinit” on Linux.

To get the tool, please visit http://web.mit.edu/kerberos/dist and follow the links

2. In order for client machine to talk to remote KDC server that contains principal database, we need a valid krb5 configuration file on client side. This file normally sits under /etc/krb5.conf on Linux. On Windows 10, it should be under C:\ProgramData\MIT\Kerberos5\krb5.ini. Please copy the krb5.conf file in your cluster and then copy to this location on your Windows machine. Please be aware that the file name in Windows should be krb5.ini, not krb5.conf. Also note that C:\ProgramData is a hidden directory, so you will need to unhide it first from File Explorer before you can access the files underneath it.

3. Make sure that you connect to correct port number, for Hive, it is normally 10000 by default. For Impala, it should be 21050, NOT 21000, which is used by impala-shell.

If you have Load Balancer setup for either Hive or Impala, then the port number could also be different, please consult with your system admin to get the correct port number if this is the case.

4. Add Windows system variable KRB5CCNAME with value of “C:\krb5\krb5cc”, where “krb5cc” is a file name for the kerberos ticket cache, it can be anything, but we commonly use krb5cc or krb5cache. To do so, please follow steps below:

a. open “File Explorer”
b. right click on “This PC”
c. select “Properties”
d. next to “Computer name”, click on “Change settings”
e. click on “Advanced” tab and then “Environment Variables”
f. under “System Variables”, click on “New”
g. enter “KRB5CCNAME” in “Variable name” and “C:\krb5\krb5cc” in “Variable value” (without double quotes)
h. click on “OK” and then “OK” again
i. restart Windows

5. If you have SSL enabled for either Hive or Impala, you will also need to “Enable SSL” for ODBC driver. This can be found under “SSL Options” popup window, see below screenshot for details:

Please note that “SSL Options” is only available in newer version of ODBC driver, if you do not see this option, please upgrade ODBC driver to latest version. At the time of writing, Hive ODBC Driver is at 2.5.24.

That should be it. The above are the common missing steps by Windows users when trying to connect to Hive or Impala via ODBC. If you have encountered other problems that need extra steps, please leave a comment below and I will update my post.

Hope above helps.

Impala query failed with error: “Incompatible Parquet Schema”

Yesterday, I was dealing with an issue that when running a very simple Impala SELECT query, it failed with “Incompatible Parquet schema” error. I have confirmed the following workflow that triggered the error:

  1. Parquet file is created from external library
  2. Load the parquet file into Hive/Impala table
  3. Query the table through Impala will fail with below error message
    incompatible Parquet schema for column 'db_name.tbl_name.col_name'. 
    Column type: DECIMAL(19, 0), Parquet schema:\noptional byte_array col_name [i:2 d:1 r:0]
  4. The same query works well in Hive

This is due to impala currently does not support all decimal specs that are supported by Parquet. Currently Parquet supports the following specs:

  • int32: for 1 <= precision <= 9
  • int64: for 1 <= precision <= 18; precision < 10 will produce a warning
  • fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
  • binaryprecision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

Please refer to Parquet Logical Type Definitions page for details.

However, Impala only supports fixed_len_byte_array, but no others. This has been reported in the upstream JIRA: IMPALA-2494

The only workaround for now is to create a parquet file that will use supported specs for Decimal column, or simply create parquet file through either Hive or Impala.

Impala Query fails with NoSuchObjectException error

In the last few months, I have seem CDH users hitting Impala query returning NoSuchObjectException error very often. This happens when running query against a particular table with INT partition types and it failed with below message:

WARNINGS:
ImpalaRuntimeException: Error making 'alter_partitions' RPC to Hive Metastore:
CAUSED BY: InvalidOperationException: Alter partition operation failed: NoSuchObjectException(message:partition values=[2017, 6, 1, 8])

We have confirmed that the table has four partitions with Integer data type, and select individual partition works.

The following scenario will trigger such error:

  • Partitions with INT data type
  • Partition data was inserted from Hive with zero prefixes, something like below query:
    INSERT OVERWRITE TABLE test_tbl PARTITION (year = '2017', month = '06'....) .....
    
  • Partition data will be created under HDFS location like below:
    hdfs://nameservice1/user/hive/warehouse/test_tbl/year=2017/month=06/day=01/hour=08
    
  • When query through Impala, since the data type is INT, Impala will convert values from “06” to 6, “01” to 1 etc, and will be looking for location :
    hdfs://nameservice1/user/hive/warehouse/test_tbl/year=2017/month=6/day=1/hour=8
    

    instead​ of:

    hdfs://nameservice1/user/hive/warehouse/test_tbl/year=2017/month=06/day=01/hour=08
    

    hence triggered NoSuchObjectException error.

To fix the issue, there are two options:

  1. Convert the data type of partition columns to String, instead of Integer:
    ALTER TABLE test_tbl PARTITION COLUMN (year string);
    ALTER TABLE test_tbl PARTITION COLUMN (month string);
    ALTER TABLE test_tbl PARTITION COLUMN (day string);
    ALTER TABLE test_tbl PARTITION COLUMN (hour string);
    
  2. if integer type need to be kept, then we will need to re-build the table into a new one and store them into locations without leading zeros. This can be done by running the following queries from Impala:
    CREATE TABLE new_test_tbl LIKE test_tbl;
    
    INSERT OVERWRITE TABLE new_test_tbl PARTITION (year, month, day, hour) as SELECT * FROM test_tbl;
    

    The new table will have leading zeros in partitions removed and then we can switch over to use the new table. When writing more data into the new table through Hive, please be sure to remove all leading zeros to prevent the issue from happening again.

Above steps should help resolve the issue. Hope they will help.