Impala Query Failed with ERROR “AnalysisException: ORDER BY expression not produced by aggregation output”

Recently, I discovered a bug in Impala that when you are using Expression in the ORDER BY clause, the query will fail with below error message:

ERROR: AnalysisException: ORDER BY expression not produced by aggregation output (missing from GROUP BY clause?): (CASE WHEN TRUE THEN 1 ELSE a END)

Customer used a very complicated query, and I managed to simplify it to look something like below:

DROP TABLE IF EXISTS test;
CREATE TABLE test (a int);

SELECT   ( 
    CASE 
       WHEN (1 =1) 
       THEN 1
       ELSE a
    end) AS b
FROM  test 
GROUP BY 1 
ORDER BY ( 
    CASE 
       WHEN (1 =1) 
       THEN 1
       ELSE a
    end);

This can be re-produced from CDH5.13.x onward. Since I can also re-produce in latest CDH 5.15.x at the time of writing, I went ahead to create a bug report in upstream JIRA: IMPALA-7083.

As you can see that mentioned in the JIRA, the workaround is to disable ENABLE_EXPR_REWRITES via:

SET ENABLE_EXPR_REWRITES=false;

This is on by default in latest release.

Another workaround, which is a better approach in my opinion, is to replace the expression using number:

DROP TABLE IF EXISTS test;
CREATE TABLE test (a int);

SELECT   ( 
    CASE 
       WHEN (1 =1) 
       THEN 1
       ELSE a
    end) AS b
FROM  test 
GROUP BY 1 
ORDER BY 1;

This will also makes query simpler and easier to read.

However, lots of users use third party software like SAS etc to run queries and they have no control over the query generation, then setting ENABLE_EXPR_REWRITES to false is the way to go.

Hope above helps.

How to Control Impala Daemon’s Memory Limit

This article explains Impala daemon’s processes and how to control the maximum memory each process can use.

Impala Daemon has two different processes running, one is written in C++, used by backend, mainly for query processing. The other one is written in Java, used by frontend, for query compilations, storing metadata information etc, and it is embedded into backend’s C++ process, hence they share the same Process ID. So, the way to control how much memory each process can take is quite different between the two.

Memory Limit for C++ Process:

To control the memory limit for the C++’s backend process, so that each Impala Daemon will not over commit itself when running queries, Cloudera Manager provides native configuration to control it. Simply go to Cloudera Manager Home Page > Impala > Configuration > Impala Daemon Memory Limit, see below screenshot:

Just update the value, save and then restart Impala. To confirm the change takes affect, you can navigate to Impala Daemon’s UI web page at http://:25000/varz and search for “mem_limit”:

Memory Limit for Java Process:

By default, Impala will use a quarter of host’s physical memory, or 32GB, whichever is smaller, for it’s frontend Java process, which is used mainly for query compilation and storing metadata information.

Normally you do not need to make the change. However, should you think that it is used too much, or not enough, you can change it by using following steps:

1. Go to Cloudera Manager Home Page > Impala > Configuration > Impala Daemon Environment Advanced Configuration Snippet (Safety Valve)
2. enter below into the text box:

JAVA_TOOL_OPTIONS=-Xmx?g

Where “?” is the number you choose for the amount of memory in GB for Impala.

3. Save and then restart Impala

To confirm that the change takes affect, run below commands on Impala Daemon’s host:

sudo -u impala jcmd $(pgrep -f IMPALAD) VM.flags

You might want to add path to Java’s bin directory if command “jcmd” returns command not found error.

Sample output looks like below:

12821:
-XX:CICompilerCount=2 -XX:InitialHeapSize=62914560 -XX:MaxHeapSize=994050048 \
-XX:MaxNewSize=331350016 -XX:MinHeapDeltaBytes=524288 -XX:NewSize=20971520 \
-XX:OldSize=41943040 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseParallelGC

You can compare the value of -XX:MaxHeapSize with the value you set in JAVA_TOOL_OPTIONS to make sure they match.

Oozie Spark Actions Fail with Error “Spark config without ‘=’: –conf”

Currently Oozie provides easy interface for Spark1 jobs via Spark1 action, so that user does not have to embed spark-submit into shell action. However, recently I have discovered an issue in Oozie that it has a bug to parse Spark configurations and incorrectly generated a spark-submit command to submit Spark jobs. By checking Oozie’s launcher stderr.log, I discovered below error:

Error: Spark config without '=': --conf
Run with --help for usage help or --verbose for debug output
Intercepting System.exit(1)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [1]

Also, by checking the stdout.log, I can see below incorrect command for Spark:

  --conf
  spark.yarn.security.tokens.hive.enabled=false
  --conf
  --conf
  spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*:$PWD/*
  --conf
  spark.driver.extraClassPath=$PWD/*

You can see that there were double “–conf” generated by Oozie for Spark command. This explains the error we saw earlier about “Spark config without ‘=’: –conf”.

This is caused by a known issue reported upstream: OOZIE-2923.

This is a bug on Oozie side that it wrongly parses below configs:

--conf spark.executor.extraClassPath=...
--conf spark.driver.extraClassPath=...

The workaround is to remove the “–conf” in front of the first instance of spark.executor.extraClassPath, so that it will be added by Oozie. For example, if you have below :

<spark-opts>
--files /etc/hive/conf/hive-site.xml 
--driver-memory 4G 
--executor-memory 2G 
... 
--conf spark.yarn.security.tokens.hive.enabled=false 
--conf spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*
</spark-opts>

Simply remove the first –conf before spark.executor.extraClassPath, so it becomes:

<spark-opts>
--files /etc/hive/conf/hive-site.xml 
--driver-memory 4G 
--executor-memory 2G 
... 
--conf spark.yarn.security.tokens.hive.enabled=false  
spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*
</spark-opts>

This will allow you to avoid the issue.

However, the downside is that if you decide to upgrade to a version of CDH that contains the fix to this issue, you will need to re-add “–conf” back.

OOZIE-2923 is affecting CDH5.10.x, CDH5.11.0 and CDH5.11.1.

And CDH5.11.2 and CDH5.12.x and above contains the fix.