Impala Query Profile Explained – Part 1

If you work with Impala, but have no idea how to interpret the Impala query PROFILEs, it would be very hard to understand what’s going on and how to make your query run at its full potential. I think this is the case for lots of Impala users, so I would like to write a simple blog post to share my experience and hope that it can help with anyone who like to learn more.

This is the Part 1 of the series, so I will go with the basics and just cover the main things to look out for when examining the PROFILE.

So first thing first, how do you collect Impala query PROFILE? Well, there are a couple of ways. The simplest way is to just run “PROFILE” after your query in impala-shell, like below:

[impala-daemon-host.com:21000] > SELECT COUNT(*) FROM sample_07;
Query: SELECT COUNT(*) FROM sample_07
Query submitted at: 2018-09-14 15:57:35 (Coordinator: https://impala-daemon-host.com:25000)
dQuery progress can be monitored at: https://impala-daemon-host.com:25000/query_plan?query_id=36433472787e1cab:29c30e7800000000
+----------+
| count(*) |
+----------+
| 823      |
+----------+
Fetched 1 row(s) in 6.68s

[impala-daemon-host.com:21000] > PROFILE; <-- Simply run "PROFILE" as a query
Query Runtime Profile:
Query (id=36433472787e1cab:29c30e7800000000):
Summary:
Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
Session Type: BEESWAX
.....

You can also collect from Cloudera Manager Web UI, by navigating to CM > Impala > Queries, locate the query you just ran and click on “Query Details”

Then scroll down a bit to locate “Download Profile” button:

Last, but not least, you can navigate to Impala Daemon’s web UI and download from there. Go to the Impala Daemon that is used as the coordinator to run the query:

https://{impala-daemon-url}:25000/queries

The list of queries will be displayed:

Click through the “Details” link and then to “Profile” tab:

All right, so we have the PROFILE now, let’s dive into the details.

Below is the snippet of Query PROFILE we will go through today, which is the Summary section at the top of the PROFILE:

Query (id=36433472787e1cab:29c30e7800000000):
Summary:
Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
Session Type: BEESWAX
Start Time: 2018-09-14 15:57:35.883111000
End Time: 2018-09-14 15:57:42.565042000
Query Type: QUERY
Query State: FINISHED
Query Status: OK
Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build 50eddf4550faa6200f51e98413de785bf1bf0de1)
User: hive@VPC.CLOUDERA.COM
Connected User: hive@VPC.CLOUDERA.COM
Delegated User:
Network Address: ::ffff:172.26.26.117:58834
Default Db: default
Sql Statement: SELECT COUNT(*) FROM sample_07
Coordinator: impala-daemon-url.com:22000
Query Options (set by configuration):
Query Options (set by configuration and planner): MT_DOP=0
Plan:
----------------

Let’s break it into sections and walk through one by one. There are a few important information here that used more often:

a. Query ID:

Query (id=36433472787e1cab:29c30e7800000000):

This is useful to identify relevant Query related information from Impala Daemon logs. Simply search this query ID and you can find out what it was doing behind the scene, especially useful for finding out related error messages.

b. Session Type:

Session Type: BEESWAX

This can tell us where the connection is from. BEESWAX means that the query ran from impala-shell client. If you run from Hue, the type will be “HIVESERVER2” since Hue connects via HiveServer2 thrift.

c. Start and End time:

Start Time: 2018-09-14 15:57:35.883111000
End Time: 2018-09-14 15:57:42.565042000

This is useful to tell how long the query ran for. Please keep it in mind that this time includes session idle time. So if you run a simple query that returns in a few seconds in Hue, since Hue keeps session open until session is closed or user runs another query, so the time here might show longer time than normal. The start and end time should match exactly the run time if run through impala-shell however, since impala-shell closes query handler straightaway after query finishes.

d. Query status:

Query Status: OK

This tells if the query finished successfully or not. OK means good. If there are errors, normally will show here, for example, cancelled by user, session timeout, Exceptions etc.

e. Impala version:

Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build 50eddf4550faa6200f51e98413de785bf1bf0de1)

This confirms the version that is used to run the query, if you see this is not matching with your installation, then something is not setup properly.

f. User information:

User: hive@XXX.XXXXXX.COM
Connected User: hive@XXX.XXXXXX.COM
Delegated User:

You can find out who ran the query from this session, so you know who to blame :).

g. DB selected on connection:

Default Db: default

Not used a lot, but good to know.

h. The query that used to return this PROFILE:

Sql Statement: SELECT COUNT(*) FROM sample_07

You will need this info if you are helping others to troubleshoot, as you need to know how query was constructed and what tables are involved. In lots of cases that a simple rewrite of the query will help to resolve issues or boost query performance.

i. The impala daemon that is used to run the query, what we called the Coordinator:

Coordinator: impala-daemon-host.com:22000

This is important piece of information, as you will determine which host to get the impala daemon log should you wish to check for INFO, WARNING and ERROR level logs.

j. Query Options used for this query:

Query Options (set by configuration):
Query Options (set by configuration and planner): MT_DOP=0

This section tells you what kind of QUERY OPTIONS being applied to the current query, if there are any. This is useful to see if there is any user level, or pool level overrides that will affect this query. One example would be if Impala Daemon’s memory is set at, say 120GB, but a small query still fails with OutOfMemory error. This is the place you will check if user accidentally set MEM_LIMIT in their session to a lower value that could results in OutOfMemory error.

This concludes the part 1 of the series to explain the Summary section of the query to understand the basic information. In the next part of the series, I will explain in detail on Query Plan as well as the Execution Summary of the PROFILE.

Any comments or suggestions, please let me know from the comments section below. Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *