Impala query failed with error “Failed to parse view-definition statement” after upgrade

If you have just upgraded CDH or Impala, from CDH5.11 and before to CDH5.12 and after, or Impala 2.8 and before to Impala 2.9 and after, and you noticed that your Impala query against certain VIEWs failed with below error:

AnalysisException: Failed to load metadata for table: 'db.view_name' CAUSED BY: 
TableLoadingException: Failed to parse view-definition statement of view: db.view_name

Then please read on.

Firstly, please confirm if your VIEW contains a keyword like “SORT”, by running “SHOW CREATE TABLE view_name”. If the output looks something like below:

CREATE VIEW default.test_view AS SELECT a, sort FROM default.test

Notice the word “sort” in the query, then you are hitting the issue I am talking about in this post.

This is because in CDH5.12.x or Impala 2.9 version, the JIRA IMPALA-4166 was introduced. This JIRA adds support for “SORT BY” operation in Impala, which makes “SORT” a keyword in Impala from this version onwards. And because SORT is now a keyword, any reference to it in other part of the query will cause the query to be invalid, and cause the query to fail.

You can see below stacktrace from Impala Coordinator log:

CAUSED BY: TableLoadingException: Failed to parse view-definition statement of view: db.view_name
@ 0x83efb9 impala::Status::Status()
@ 0xb747c2 impala::JniUtil::GetJniExceptionMsg()
@ 0xa7dfcb impala::Frontend::GetExecRequest()
@ 0xa98fc8 impala::ImpalaServer::ExecuteInternal()
@ 0xaa0278 impala::ImpalaServer::Execute()
@ 0xadf28e impala::ImpalaServer::query()
@ 0xd71cd5 beeswax::BeeswaxServiceProcessor::process_query()
@ 0xd74fd4 beeswax::BeeswaxServiceProcessor::dispatchCall()
@ 0x80ecdc apache::thrift::TDispatchProcessor::process()
@ 0x1b596ab apache::thrift::server::TThreadPoolServer::Task::run()
@ 0x1b40fc9 apache::thrift::concurrency::ThreadManager::Worker::run()
@ 0x9d5c69 impala::ThriftThread::RunRunnable()
@ 0x9d6a42 boost::detail::function::void_function_obj_invoker0<>::invoke()
@ 0xbd6fe2 impala::Thread::SuperviseThread()
@ 0xbd7744 boost::detail::thread_data<>::run()
@ 0xe6417a (unknown)

To fix the issue, there are two ways:

1. Add “`” to the SORT keyword, by DROP and re-CREATE the view, like below:

DROP VIEW default.test_view;
Query: DROP VIEW default.test_view
Fetched 0 row(s) in 4.35s

CREATE VIEW default.test_view AS SELECT a, `sort` FROM default.test;
Query: CREATE VIEW default.test_view AS SELECT a, `sort` FROM default.test
Fetched 0 row(s) in 3.87s

SELECT * FROM default.test_view;
Query: SELECT * FROM default.test_view
Fetched 0 row(s) in 3.63s

2. DROP and re-CREATE the VIEW in beeline, without modifying the VIEW creation query, if your view is complex.

This will work because Hive will add “`” for you automatically, which Impala does not.

This should help to resolve your cluster upgrade issue.

If this post helps, please share your experience below in the comments, thanks.

Impala Failed to Read Parquet Decimal Data

If you are using Hive JDBC/ODBC driver to create Parquet data via Hive, and you face below error when reading from Impala:

File 'hdfs://namespace1/path/to/parquet/file/15110390563421' column 'column1' has a scale that does not match the table metadata scale. 
File metadata scale: 0 Table metadata scale: 2 when we try to execute in Impala.

It is most likely that you hit a known issue HIVE-15519.

Due to this bug, the DECIMAL scale value returned from the driver was incorrect (returns null), hence caused the data being written to parquet file with wrong scale value for those columns. And Impala will complain that the column’s definition at metadata side is not matching with the column type stored in Parquet file, due to different scale values.

Currently there is no workarounds available for this issue, and a patch is required. Or you will have to wait it to be fixed. At the time of writing, the latest Cloudera CDH is at 5.13.1.

Drop Impala UDF function returned NoSuchObjectException error

Impala UDF works a bit differently than Hive UDF, as they are written in different languages. Recently I have encountered an issue that when I try to drop Impala UDF written in C, it failed with below error:

DROP FUNCTION udf_testing.has_vowels;
Query: DROP FUNCTION udf_testing.has_vowels
ERROR:
ImpalaRuntimeException: Error making 'dropFunction' RPC to Hive Metastore:
CAUSED BY: NoSuchObjectException: Function has_vowels does not exist

The “SHOW CREATE FUNCTION” query worked OK as below:

SHOW CREATE FUNCTION has_vowels;
Query: SHOW CREATE FUNCTION has_vowels
+-----------------------------------------------------------------------------+
| result                                                                      |
+-----------------------------------------------------------------------------+
| CREATE FUNCTION udf_testing.has_vowels(STRING)                              |
|  RETURNS BOOLEAN                                                            |
|  LOCATION 'hdfs://nameservice1/user/hive/impala-udf/libudfsample.so'        |
|  SYMBOL='_Z9HasVowelsPN10impala_udf15FunctionContextERKNS_9StringValE'      |
|                                                                             |
+-----------------------------------------------------------------------------+
Fetched 1 row(s) in 0.03s

After researching, it turned out that when dropping functions in Impala, you will also need to specify the function parameters. So below query will work:

DROP FUNCTION udf_testing.has_vowels(STRING);
Query: DROP FUNCTION udf_testing.has_vowels(STRING)

This is not immediately obvious, but it is documented on Cloudera’ offical documentation site: DROP FUNCTION Statement.

Hope above helps.

Setting up Cloudera ODBC driver on Windows 10

I have seen lots of CDH users now have trouble setting up Hive/Impala ODBC drivers on Windows 10 machine to connect to remote Kerberized cluster recently. Connection keeps getting Kerberos related error messages. Like below:

[Cloudera][Hardy] (34) Error from server: SASL(-1):
generic failure: GSSAPI Error: Unspecified GSS failure.
Minor code may provide more information (Credential cache is empty).

OR

[Cloudera][ImpalaODBC] (100) Error from the Impala Thrift API:
SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.
Minor code may provide more information (No credentials cache found)

To help CDH users to get it working without much hassle, I would like to compile a list of steps below for reference. I have tested this in my VM Windows 10.

1. For Kerberos authentication to work, you need to get a valid Kerberos ticket on your client machine, which is Windows 10. Hence, you will need to download and install MIT Kerberos client tool so that you can authenticate yourself against the remote cluster, much like running “kinit” on Linux.

To get the tool, please visit http://web.mit.edu/kerberos/dist and follow the links

2. In order for client machine to talk to remote KDC server that contains principal database, we need a valid krb5 configuration file on client side. This file normally sits under /etc/krb5.conf on Linux. On Windows 10, it should be under C:\ProgramData\MIT\Kerberos5\krb5.ini. Please copy the krb5.conf file in your cluster and then copy to this location on your Windows machine. Please be aware that the file name in Windows should be krb5.ini, not krb5.conf. Also note that C:\ProgramData is a hidden directory, so you will need to unhide it first from File Explorer before you can access the files underneath it.

3. Make sure that you connect to correct port number, for Hive, it is normally 10000 by default. For Impala, it should be 21050, NOT 21000, which is used by impala-shell.

If you have Load Balancer setup for either Hive or Impala, then the port number could also be different, please consult with your system admin to get the correct port number if this is the case.

4. Add Windows system variable KRB5CCNAME with value of “C:\krb5\krb5cc”, where “krb5cc” is a file name for the kerberos ticket cache, it can be anything, but we commonly use krb5cc or krb5cache. To do so, please follow steps below:

a. open “File Explorer”
b. right click on “This PC”
c. select “Properties”
d. next to “Computer name”, click on “Change settings”
e. click on “Advanced” tab and then “Environment Variables”
f. under “System Variables”, click on “New”
g. enter “KRB5CCNAME” in “Variable name” and “C:\krb5\krb5cc” in “Variable value” (without double quotes)
h. click on “OK” and then “OK” again
i. restart Windows

5. If you have SSL enabled for either Hive or Impala, you will also need to “Enable SSL” for ODBC driver. This can be found under “SSL Options” popup window, see below screenshot for details:

Please note that “SSL Options” is only available in newer version of ODBC driver, if you do not see this option, please upgrade ODBC driver to latest version. At the time of writing, Hive ODBC Driver is at 2.5.24.

That should be it. The above are the common missing steps by Windows users when trying to connect to Hive or Impala via ODBC. If you have encountered other problems that need extra steps, please leave a comment below and I will update my post.

Hope above helps.

Impala query failed with error: “Incompatible Parquet Schema”

Yesterday, I was dealing with an issue that when running a very simple Impala SELECT query, it failed with “Incompatible Parquet schema” error. I have confirmed the following workflow that triggered the error:

  1. Parquet file is created from external library
  2. Load the parquet file into Hive/Impala table
  3. Query the table through Impala will fail with below error message

    incompatible Parquet schema for column 'db_name.tbl_name.col_name'. 
    Column type: DECIMAL(19, 0), Parquet schema:\noptional byte_array col_name [i:2 d:1 r:0]
    
  4. The same query works well in Hive

This is due to impala currently does not support all decimal specs that are supported by Parquet. Currently Parquet supports the following specs:

  • int32: for 1 <= precision <= 9
  • int64: for 1 <= precision <= 18; precision < 10 will produce a warning
  • fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
  • binaryprecision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

Please refer to Parquet Logical Type Definitions page for details.

However, Impala only supports fixed_len_byte_array, but no others. This has been reported in the upstream JIRA: IMPALA-2494

The only workaround for now is to create a parquet file that will use supported specs for Decimal column, or simply create parquet file through either Hive or Impala.