Impala Query Profile Explained – Part 1

If you work with Impala, but have no idea how to interpret the Impala query PROFILEs, it would be very hard to understand what’s going on and how to make your query run at its full potential. I think this is the case for lots of Impala users, so I would like to write a simple blog post to share my experience and hope that it can help with anyone who like to learn more.

This is the Part 1 of the series, so I will go with the basics and just cover the main things to look out for when examining the PROFILE.

So first thing first, how do you collect Impala query PROFILE? Well, there are a couple of ways. The simplest way is to just run “PROFILE” after your query in impala-shell, like below:

[impala-daemon-host.com:21000] > SELECT COUNT(*) FROM sample_07;
Query: SELECT COUNT(*) FROM sample_07
Query submitted at: 2018-09-14 15:57:35 (Coordinator: https://impala-daemon-host.com:25000)
dQuery progress can be monitored at: https://impala-daemon-host.com:25000/query_plan?query_id=36433472787e1cab:29c30e7800000000
+----------+
| count(*) |
+----------+
| 823      |
+----------+
Fetched 1 row(s) in 6.68s

[impala-daemon-host.com:21000] > PROFILE; <-- Simply run "PROFILE" as a query
Query Runtime Profile:
Query (id=36433472787e1cab:29c30e7800000000):
Summary:
Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
Session Type: BEESWAX
.....

You can also collect from Cloudera Manager Web UI, by navigating to CM > Impala > Queries, locate the query you just ran and click on “Query Details”

Then scroll down a bit to locate “Download Profile” button:

Last, but not least, you can navigate to Impala Daemon’s web UI and download from there. Go to the Impala Daemon that is used as the coordinator to run the query:

https://{impala-daemon-url}:25000/queries

The list of queries will be displayed:

Click through the “Details” link and then to “Profile” tab:

All right, so we have the PROFILE now, let’s dive into the details.

Below is the snippet of Query PROFILE we will go through today, which is the Summary section at the top of the PROFILE:

Query (id=36433472787e1cab:29c30e7800000000):
Summary:
Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
Session Type: BEESWAX
Start Time: 2018-09-14 15:57:35.883111000
End Time: 2018-09-14 15:57:42.565042000
Query Type: QUERY
Query State: FINISHED
Query Status: OK
Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build 50eddf4550faa6200f51e98413de785bf1bf0de1)
User: hive@VPC.CLOUDERA.COM
Connected User: hive@VPC.CLOUDERA.COM
Delegated User:
Network Address: ::ffff:172.26.26.117:58834
Default Db: default
Sql Statement: SELECT COUNT(*) FROM sample_07
Coordinator: impala-daemon-url.com:22000
Query Options (set by configuration):
Query Options (set by configuration and planner): MT_DOP=0
Plan:
----------------

Let’s break it into sections and walk through one by one. There are a few important information here that used more often:

a. Query ID:

Query (id=36433472787e1cab:29c30e7800000000):

This is useful to identify relevant Query related information from Impala Daemon logs. Simply search this query ID and you can find out what it was doing behind the scene, especially useful for finding out related error messages.

b. Session Type:

Session Type: BEESWAX

This can tell us where the connection is from. BEESWAX means that the query ran from impala-shell client. If you run from Hue, the type will be “HIVESERVER2” since Hue connects via HiveServer2 thrift.

c. Start and End time:

Start Time: 2018-09-14 15:57:35.883111000
End Time: 2018-09-14 15:57:42.565042000

This is useful to tell how long the query ran for. Please keep it in mind that this time includes session idle time. So if you run a simple query that returns in a few seconds in Hue, since Hue keeps session open until session is closed or user runs another query, so the time here might show longer time than normal. The start and end time should match exactly the run time if run through impala-shell however, since impala-shell closes query handler straightaway after query finishes.

d. Query status:

Query Status: OK

This tells if the query finished successfully or not. OK means good. If there are errors, normally will show here, for example, cancelled by user, session timeout, Exceptions etc.

e. Impala version:

Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build 50eddf4550faa6200f51e98413de785bf1bf0de1)

This confirms the version that is used to run the query, if you see this is not matching with your installation, then something is not setup properly.

f. User information:

User: hive@XXX.XXXXXX.COM
Connected User: hive@XXX.XXXXXX.COM
Delegated User:

You can find out who ran the query from this session, so you know who to blame :).

g. DB selected on connection:

Default Db: default

Not used a lot, but good to know.

h. The query that used to return this PROFILE:

Sql Statement: SELECT COUNT(*) FROM sample_07

You will need this info if you are helping others to troubleshoot, as you need to know how query was constructed and what tables are involved. In lots of cases that a simple rewrite of the query will help to resolve issues or boost query performance.

i. The impala daemon that is used to run the query, what we called the Coordinator:

Coordinator: impala-daemon-host.com:22000

This is important piece of information, as you will determine which host to get the impala daemon log should you wish to check for INFO, WARNING and ERROR level logs.

j. Query Options used for this query:

Query Options (set by configuration):
Query Options (set by configuration and planner): MT_DOP=0

This section tells you what kind of QUERY OPTIONS being applied to the current query, if there are any. This is useful to see if there is any user level, or pool level overrides that will affect this query. One example would be if Impala Daemon’s memory is set at, say 120GB, but a small query still fails with OutOfMemory error. This is the place you will check if user accidentally set MEM_LIMIT in their session to a lower value that could results in OutOfMemory error.

This concludes the part 1 of the series to explain the Summary section of the query to understand the basic information. In the next part of the series, I will explain in detail on Query Plan as well as the Execution Summary of the PROFILE.

Any comments or suggestions, please let me know from the comments section below. Thanks

Impala Query Failed with ERROR “AnalysisException: ORDER BY expression not produced by aggregation output”

Recently, I discovered a bug in Impala that when you are using Expression in the ORDER BY clause, the query will fail with below error message:

ERROR: AnalysisException: ORDER BY expression not produced by aggregation output (missing from GROUP BY clause?): (CASE WHEN TRUE THEN 1 ELSE a END)

Customer used a very complicated query, and I managed to simplify it to look something like below:

DROP TABLE IF EXISTS test;
CREATE TABLE test (a int);

SELECT   ( 
    CASE 
       WHEN (1 =1) 
       THEN 1
       ELSE a
    end) AS b
FROM  test 
GROUP BY 1 
ORDER BY ( 
    CASE 
       WHEN (1 =1) 
       THEN 1
       ELSE a
    end);

This can be re-produced from CDH5.13.x onward. Since I can also re-produce in latest CDH 5.15.x at the time of writing, I went ahead to create a bug report in upstream JIRA: IMPALA-7083.

As you can see that mentioned in the JIRA, the workaround is to disable ENABLE_EXPR_REWRITES via:

SET ENABLE_EXPR_REWRITES=false;

This is on by default in latest release.

Another workaround, which is a better approach in my opinion, is to replace the expression using number:

DROP TABLE IF EXISTS test;
CREATE TABLE test (a int);

SELECT   ( 
    CASE 
       WHEN (1 =1) 
       THEN 1
       ELSE a
    end) AS b
FROM  test 
GROUP BY 1 
ORDER BY 1;

This will also makes query simpler and easier to read.

However, lots of users use third party software like SAS etc to run queries and they have no control over the query generation, then setting ENABLE_EXPR_REWRITES to false is the way to go.

Hope above helps.

How to Control Impala Daemon’s Memory Limit

This article explains Impala daemon’s processes and how to control the maximum memory each process can use.

Impala Daemon has two different processes running, one is written in C++, used by backend, mainly for query processing. The other one is written in Java, used by frontend, for query compilations, storing metadata information etc, and it is embedded into backend’s C++ process, hence they share the same Process ID. So, the way to control how much memory each process can take is quite different between the two.

Memory Limit for C++ Process:

To control the memory limit for the C++’s backend process, so that each Impala Daemon will not over commit itself when running queries, Cloudera Manager provides native configuration to control it. Simply go to Cloudera Manager Home Page > Impala > Configuration > Impala Daemon Memory Limit, see below screenshot:

Just update the value, save and then restart Impala. To confirm the change takes affect, you can navigate to Impala Daemon’s UI web page at http://:25000/varz and search for “mem_limit”:

Memory Limit for Java Process:

By default, Impala will use a quarter of host’s physical memory, or 32GB, whichever is smaller, for it’s frontend Java process, which is used mainly for query compilation and storing metadata information.

Normally you do not need to make the change. However, should you think that it is used too much, or not enough, you can change it by using following steps:

1. Go to Cloudera Manager Home Page > Impala > Configuration > Impala Daemon Environment Advanced Configuration Snippet (Safety Valve)
2. enter below into the text box:

JAVA_TOOL_OPTIONS=-Xmx?g

Where “?” is the number you choose for the amount of memory in GB for Impala.

3. Save and then restart Impala

To confirm that the change takes affect, run below commands on Impala Daemon’s host:

sudo -u impala jcmd $(pgrep -f IMPALAD) VM.flags

You might want to add path to Java’s bin directory if command “jcmd” returns command not found error.

Sample output looks like below:

12821:
-XX:CICompilerCount=2 -XX:InitialHeapSize=62914560 -XX:MaxHeapSize=994050048 \
-XX:MaxNewSize=331350016 -XX:MinHeapDeltaBytes=524288 -XX:NewSize=20971520 \
-XX:OldSize=41943040 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseParallelGC

You can compare the value of -XX:MaxHeapSize with the value you set in JAVA_TOOL_OPTIONS to make sure they match.

Impala query failed with error “Specified block size is less than configured minimum value”

Currently Impala supports Parquet file format pretty well. For those of you not familiar with Parquet, Parquet is a free and open-source column-oriented data store of the Apache Hadoop ecosystem, and it provides easy and quick access to data with large amount of columns. For more details, you can refer to Apache Parquet official website for details.

However, even though Parquet is good for storing data with large amount of columns and can retrieve column data pretty quickly, there is still a limit as to how many columns you can store, in order for processing engine to work properly, like Impala.

Recently I have discovered a bug in Impala that when you have too many columns (more than 10,000), Impala query will fail with below error:

hdfsOpenFile(hdfs://nameservice1/user/hive/warehouse/default.db/table/_impala_insert_staging/f44e0332a3ec1af9_55c692eb00000000/.dh4e0332q3ac1af9-55c692wb00000003_1471427586_dir/dh4e0332q3ac1af9-55c692wb00000003_1471427586_data.0.parq): 
FileSystem#create((Lorg/apache/hadoop/fs/Path;ZISJ)Lorg/apache/hadoop/fs/FSDataOutputStream;) error:
RemoteException: Specified block size is less than configured minimum value (dfs.namenode.fs-limits.min-block-size): -1130396776 < 1048576
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2705)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2658)

The reason being that when parquet table has 10K+ columns, Impala tries to estimate memory required to process those data, and it will overflow Java’s int32 variable used in Impala code and caused negative value returned, hence caused the error we saw above. This has been reported in the upstream JIRA: IMPALA-7044.

There is no workaround to fix the issue at this stage, but only to reduce the number of columns in Parquet table.

Currently the maximum number of columns Impala can handle is around 8K-10K, depending on the column types, so have to re-design the table to fit with less columns.

Hope above information is helpful.

Unable to Insert data into VARCHAR data type in Impala

In this blog post, I will talk about an issue that Impala user is not able to directly insert into a table that has VARCHAR column type. To re-produce, see below test case:

CREATE TABLE test (a varchar(20));
INSERT INTO test SELECT 'a';
ERROR: AnalysisException: Possible loss of precision for target table 'default.test'.
Expression ''a'' (type: STRING) would need to be cast to VARCHAR(20) for column 'a'

I would like to advise that this is working as expected currently, as Impala was designed to force data integrity by stopping user from converting STRING data type into VARCHAR behind the scene and user is unaware of data loss in this scenario.

This has been reported in the upstream JIRA: IMPALA-1333. And the plan is to allow such implicit casting between STRING and VARCHAR, so that any thirdparty software can handle this case easier. This probably will be implement using a flag to turn ON and OFF at session level when needed, rather than just allow implicit conversion blindly, as it will have the risk of losing data or cause data corruption unexpected.

For now, if you know that you have data in STRING that will fit into VARCHAR’s length, then, as mentioned in the JIRA, you can run below query to overcome the issue:

INSERT INTO test SELECT CAST('a' AS varchar(20));