Impala Query Failed with ERROR “AnalysisException: ORDER BY expression not produced by aggregation output”

Recently, I discovered a bug in Impala that when you are using Expression in the ORDER BY clause, the query will fail with below error message:

ERROR: AnalysisException: ORDER BY expression not produced by aggregation output (missing from GROUP BY clause?): (CASE WHEN TRUE THEN 1 ELSE a END)

Customer used a very complicated query, and I managed to simplify it to look something like below:

DROP TABLE IF EXISTS test;
CREATE TABLE test (a int);

SELECT   ( 
    CASE 
       WHEN (1 =1) 
       THEN 1
       ELSE a
    end) AS b
FROM  test 
GROUP BY 1 
ORDER BY ( 
    CASE 
       WHEN (1 =1) 
       THEN 1
       ELSE a
    end);

This can be re-produced from CDH5.13.x onward. Since I can also re-produce in latest CDH 5.15.x at the time of writing, I went ahead to create a bug report in upstream JIRA: IMPALA-7083.

As you can see that mentioned in the JIRA, the workaround is to disable ENABLE_EXPR_REWRITES via:

SET ENABLE_EXPR_REWRITES=false;

This is on by default in latest release.

Another workaround, which is a better approach in my opinion, is to replace the expression using number:

DROP TABLE IF EXISTS test;
CREATE TABLE test (a int);

SELECT   ( 
    CASE 
       WHEN (1 =1) 
       THEN 1
       ELSE a
    end) AS b
FROM  test 
GROUP BY 1 
ORDER BY 1;

This will also makes query simpler and easier to read.

However, lots of users use third party software like SAS etc to run queries and they have no control over the query generation, then setting ENABLE_EXPR_REWRITES to false is the way to go.

Hope above helps.

How to Control Impala Daemon’s Memory Limit

This article explains Impala daemon’s processes and how to control the maximum memory each process can use.

Impala Daemon has two different processes running, one is written in C++, used by backend, mainly for query processing. The other one is written in Java, used by frontend, for query compilations, storing metadata information etc, and it is embedded into backend’s C++ process, hence they share the same Process ID. So, the way to control how much memory each process can take is quite different between the two.

Memory Limit for C++ Process:

To control the memory limit for the C++’s backend process, so that each Impala Daemon will not over commit itself when running queries, Cloudera Manager provides native configuration to control it. Simply go to Cloudera Manager Home Page > Impala > Configuration > Impala Daemon Memory Limit, see below screenshot:

Just update the value, save and then restart Impala. To confirm the change takes affect, you can navigate to Impala Daemon’s UI web page at http://:25000/varz and search for “mem_limit”:

Memory Limit for Java Process:

By default, Impala will use a quarter of host’s physical memory, or 32GB, whichever is smaller, for it’s frontend Java process, which is used mainly for query compilation and storing metadata information.

Normally you do not need to make the change. However, should you think that it is used too much, or not enough, you can change it by using following steps:

1. Go to Cloudera Manager Home Page > Impala > Configuration > Impala Daemon Environment Advanced Configuration Snippet (Safety Valve)
2. enter below into the text box:

JAVA_TOOL_OPTIONS=-Xmx?g

Where “?” is the number you choose for the amount of memory in GB for Impala.

3. Save and then restart Impala

To confirm that the change takes affect, run below commands on Impala Daemon’s host:

sudo -u impala jcmd $(pgrep -f IMPALAD) VM.flags

You might want to add path to Java’s bin directory if command “jcmd” returns command not found error.

Sample output looks like below:

12821:
-XX:CICompilerCount=2 -XX:InitialHeapSize=62914560 -XX:MaxHeapSize=994050048 \
-XX:MaxNewSize=331350016 -XX:MinHeapDeltaBytes=524288 -XX:NewSize=20971520 \
-XX:OldSize=41943040 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseParallelGC

You can compare the value of -XX:MaxHeapSize with the value you set in JAVA_TOOL_OPTIONS to make sure they match.

Impala query failed with error “Specified block size is less than configured minimum value”

Currently Impala supports Parquet file format pretty well. For those of you not familiar with Parquet, Parquet is a free and open-source column-oriented data store of the Apache Hadoop ecosystem, and it provides easy and quick access to data with large amount of columns. For more details, you can refer to Apache Parquet official website for details.

However, even though Parquet is good for storing data with large amount of columns and can retrieve column data pretty quickly, there is still a limit as to how many columns you can store, in order for processing engine to work properly, like Impala.

Recently I have discovered a bug in Impala that when you have too many columns (more than 10,000), Impala query will fail with below error:

hdfsOpenFile(hdfs://nameservice1/user/hive/warehouse/default.db/table/_impala_insert_staging/f44e0332a3ec1af9_55c692eb00000000/.dh4e0332q3ac1af9-55c692wb00000003_1471427586_dir/dh4e0332q3ac1af9-55c692wb00000003_1471427586_data.0.parq): 
FileSystem#create((Lorg/apache/hadoop/fs/Path;ZISJ)Lorg/apache/hadoop/fs/FSDataOutputStream;) error:
RemoteException: Specified block size is less than configured minimum value (dfs.namenode.fs-limits.min-block-size): -1130396776 < 1048576
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2705)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2658)

The reason being that when parquet table has 10K+ columns, Impala tries to estimate memory required to process those data, and it will overflow Java’s int32 variable used in Impala code and caused negative value returned, hence caused the error we saw above. This has been reported in the upstream JIRA: IMPALA-7044.

There is no workaround to fix the issue at this stage, but only to reduce the number of columns in Parquet table.

Currently the maximum number of columns Impala can handle is around 8K-10K, depending on the column types, so have to re-design the table to fit with less columns.

Hope above information is helpful.

Unable to Insert data into VARCHAR data type in Impala

In this blog post, I will talk about an issue that Impala user is not able to directly insert into a table that has VARCHAR column type. To re-produce, see below test case:

CREATE TABLE test (a varchar(20));
INSERT INTO test SELECT 'a';
ERROR: AnalysisException: Possible loss of precision for target table 'default.test'.
Expression ''a'' (type: STRING) would need to be cast to VARCHAR(20) for column 'a'

I would like to advise that this is working as expected currently, as Impala was designed to force data integrity by stopping user from converting STRING data type into VARCHAR behind the scene and user is unaware of data loss in this scenario.

This has been reported in the upstream JIRA: IMPALA-1333. And the plan is to allow such implicit casting between STRING and VARCHAR, so that any thirdparty software can handle this case easier. This probably will be implement using a flag to turn ON and OFF at session level when needed, rather than just allow implicit conversion blindly, as it will have the risk of losing data or cause data corruption unexpected.

For now, if you know that you have data in STRING that will fit into VARCHAR’s length, then, as mentioned in the JIRA, you can run below query to overcome the issue:

INSERT INTO test SELECT CAST('a' AS varchar(20));

Impala query failed with error “Failed to parse view-definition statement” after upgrade

If you have just upgraded CDH or Impala, from CDH5.11 and before to CDH5.12 and after, or Impala 2.8 and before to Impala 2.9 and after, and you noticed that your Impala query against certain VIEWs failed with below error:

AnalysisException: Failed to load metadata for table: 'db.view_name' CAUSED BY: 
TableLoadingException: Failed to parse view-definition statement of view: db.view_name

Then please read on.

Firstly, please confirm if your VIEW contains a keyword like “SORT”, by running “SHOW CREATE TABLE view_name”. If the output looks something like below:

CREATE VIEW default.test_view AS SELECT a, sort FROM default.test

Notice the word “sort” in the query, then you are hitting the issue I am talking about in this post.

This is because in CDH5.12.x or Impala 2.9 version, the JIRA IMPALA-4166 was introduced. This JIRA adds support for “SORT BY” operation in Impala, which makes “SORT” a keyword in Impala from this version onwards. And because SORT is now a keyword, any reference to it in other part of the query will cause the query to be invalid, and cause the query to fail.

You can see below stacktrace from Impala Coordinator log:

CAUSED BY: TableLoadingException: Failed to parse view-definition statement of view: db.view_name
@ 0x83efb9 impala::Status::Status()
@ 0xb747c2 impala::JniUtil::GetJniExceptionMsg()
@ 0xa7dfcb impala::Frontend::GetExecRequest()
@ 0xa98fc8 impala::ImpalaServer::ExecuteInternal()
@ 0xaa0278 impala::ImpalaServer::Execute()
@ 0xadf28e impala::ImpalaServer::query()
@ 0xd71cd5 beeswax::BeeswaxServiceProcessor::process_query()
@ 0xd74fd4 beeswax::BeeswaxServiceProcessor::dispatchCall()
@ 0x80ecdc apache::thrift::TDispatchProcessor::process()
@ 0x1b596ab apache::thrift::server::TThreadPoolServer::Task::run()
@ 0x1b40fc9 apache::thrift::concurrency::ThreadManager::Worker::run()
@ 0x9d5c69 impala::ThriftThread::RunRunnable()
@ 0x9d6a42 boost::detail::function::void_function_obj_invoker0<>::invoke()
@ 0xbd6fe2 impala::Thread::SuperviseThread()
@ 0xbd7744 boost::detail::thread_data<>::run()
@ 0xe6417a (unknown)

To fix the issue, there are two ways:

1. Add “`” to the SORT keyword, by DROP and re-CREATE the view, like below:

DROP VIEW default.test_view;
Query: DROP VIEW default.test_view
Fetched 0 row(s) in 4.35s

CREATE VIEW default.test_view AS SELECT a, `sort` FROM default.test;
Query: CREATE VIEW default.test_view AS SELECT a, `sort` FROM default.test
Fetched 0 row(s) in 3.87s

SELECT * FROM default.test_view;
Query: SELECT * FROM default.test_view
Fetched 0 row(s) in 3.63s

2. DROP and re-CREATE the VIEW in beeline, without modifying the VIEW creation query, if your view is complex.

This will work because Hive will add “`” for you automatically, which Impala does not.

This should help to resolve your cluster upgrade issue.

If this post helps, please share your experience below in the comments, thanks.