In the last episode, Part 3 of the series, I used an example to go through the Impala Profile in more detail. In this part of the series, I will use a real life production system Impala Profile, explain the steps I used to read through the Profile when given an issue and slowly reached the conclusion of the real cause of the problem. If you missed the the last 3 parts of my posts, you can go back and have a look again:
- Impala Query Profile Explained – Part 1
- Impala Query Profile Explained – Part 2
- Impala Query Profile Explained – Part 3
OK, let’s get started. Since the Profile itself is quite large, as it involved several Impala Daemons to run, it will be ugly on the page if I include the full content here. So, please download from this link and examine it by opening from your computer or browser.
The problem with the query was that for whatever reason, the same query used to be able to finish under a few minutes, but now it took more than 1 hour to finish. This profile was just one example, in fact, ALL queries running through this cluster had the exact same issue at the time. So please spend sometime to go through this Profile and see if you are able to capture any useful information and understand the situation here.
Now, let me go through in more detail on the steps I used to troubleshoot this particular issue.
1. Since user complained query took longer than normal, the first thing I wanted to check was how long? So very obvious I looked for the Start and End time at the beginning of the Profile:
Start Time: 2020-01-03 07:33:42.928171000 End Time: 2020-01-03 08:47:55.745537000
I noticed that it took 1 hour and 14 minutes to finish the query, which matched what user complained.
2. I noticed that the query failed with EXCEPTION, due to user cancellation:
Query State: EXCEPTION Query Status: Cancelled
So it was likely that user was not patience anymore and had to cancel the query as it took too long. Nothing to worry about here.
3. Moved along, I checked how complex the query was:
Sql Statement: SELECT COUNT(*),NVL(SUM(NVL(NUMBER_OF_CALLS,0)),0) FROM cef_hpmn WHERE settlement_month IN ( SELECT DISTINCT dd.month_id FROM ext_call_event_fact cef, date_dim dd WHERE CAST(settlement_date_id AS bigint) = dd.date_id ) AND process_date = TO_TIMESTAMP ('01/02/2020', 'MM/dd/yyyy')
I have re-formatted for readability.
I could see that the query was pretty simple, a straight SELECT statement with a sub-query IN condition.
4. Next thing I could see was the Coordinator host:
If I could not conclude anything from this file, then the next step should be to get the impala daemon log on this host. But for now, I just had to continue, as I did not have the log yet.
5. Next block from the Profile I reached the Query Plan section, as I mentioned in my previous post, I normally would skip this section first, and jump to the Summary section, to understand how long each operator took and see if there could be any obvious information that might tell me a bit more:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ----------------------------------------------------------------------------------------------------------- 12:AGGREGATE 1 0.000ns 0.000ns 0 1 16.00 KB 10.00 MB FINALIZE 11:EXCHANGE 1 0.000ns 0.000ns 28 1 232.00 KB 0 UNPA... 06:AGGREGATE 29 61.358ms 110.536ms 28 1 13.60 MB 10.00 MB 05:HASH JOIN 29 223.055ms 345.295ms 277.41M 116.66M 2.00 MB 1.94 MB LEFT... |--10:EXCHANGE 29 29.977us 67.789us 2 261 24.00 KB 0 BROADCAST | 09:AGGREGATE 29 1.569ms 2.903ms 2 261 1.96 MB 10.00 MB FINALIZE | 08:EXCHANGE 29 33.880us 419.580us 58 261 240.00 KB 0 HASH(.. | 04:AGGREGATE 29 9s751ms 3m53s 58 261 4.58 MB 10.00 MB STREAMING | 03:HASH JOIN 29 1s099ms 2s635ms 411.27M 411.27M 3.35 MB 1.94 MB INNER .. | |--07:EXCHANGE 29 142.532us 334.794us 8.07K 8.07K 304.00 KB 0 BROADCAST | | 02:SCAN HDFS 1 1.700ms 1.700ms 8.07K 8.07K 332.94 KB 32.00 MB xx.. | 01:SCAN HDFS 29 44s391ms 13m18s 411.27M 411.27M 525.95 MB 4.38 GB xx.. 00:SCAN HDFS 29 3m7s 56m13s 277.41M 116.66M 7.83 MB 160.00 MB xx..
There are a few key things that I am looking for here ( I have re-formatted again for readability and removed the information that I was not interested in):
- Read in the reverse order, from bottom to top, as it is the order that Impala does the operation
- Compare “Avg Time” and “Max Time” columns
- Compare “#Rows” and “Est. #Rows” columns
- Check “Detail” column to see what type of JOINs were for each operation
Immediately, I noticed there was a big difference between “Avg Time” and “Max Time” for SCAN HDFS operator. Average time took 3 minutes and 7 seconds, but Maximum time from one of the hosts, out of 29 hosts, took 56 minutes and 13 seconds. Kept reading, I also noticed the exact same issue for the second SCAN HDFS operation, 44 seconds vs 13 minutes and 18 seconds.
So my next thought was to identify which host/hosts performed much slower than others and whether it was from the same host. To do so, I used string “id=0” to search through the Profile. “0” is the operator number from the beginning against each line in the Summary section “00:SCAN HDFS”. This “id=[\d]+” string will be attached to each operator in the detailed breakdown section down below in the Profile. Remember to remove any leading 0s.
I searched for the first instance of “id=0” from the beginning of the file, and reached below section:
HDFS_SCAN_NODE (id=0) .... - ScannerThreadsTotalWallClockTime: 20.0m (1200982115995) - MaterializeTupleTime(*): 226ms (226568242) - ScannerThreadsSysTime: 322ms (322168172) - ScannerThreadsUserTime: 6.76s (6758158482) - ScannerThreadsVoluntaryContextSwitches: 10,907 (10907) - TotalRawHdfsOpenFileTime(*): 8.6m (517759170560) - TotalRawHdfsReadTime(*): 3.4m (201957505069) - TotalReadThroughput: 749.9 KiB/s (767874) - TotalTime: 3.1m (187289950304)
I noticed TotalTime was 3.1 minutes, which matched the 3 minutes and 7 seconds that I saw in the Summary section, so this was the Average Fragment. To confirm, I scrolled back and saw below:
Averaged Fragment F00
Continue searching the file, I came to below section (second instance of “id=0”):
HDFS_SCAN_NODE (id=0) .... - ScannerThreadsTotalWallClockTime: 10.4m (626435081910) - MaterializeTupleTime(*): 278ms (278689886) - ScannerThreadsSysTime: 266ms (266094000) - ScannerThreadsUserTime: 5.75s (5748833000) - ScannerThreadsVoluntaryContextSwitches: 11,285 (11285) - TotalRawHdfsOpenFileTime(*): 7.8m (468388283839) - TotalRawHdfsReadTime(*): 1.9m (114598713147) - TotalReadThroughput: 731.0 KiB/s (748535) - TotalTime: 2.1m (125005670562)
This one told me it took 2.1 minutes, which was faster than average of 3.1 minutes, and scrolling back to confirm the host:
Fragment F00 Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000)
Now, I could see that there were three things that I was looking for:
Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000) HDFS_SCAN_NODE (id=0) - TotalTime: 2.1m (125005670562)
I thought it would be easier if I could use simple “grep” to filter out everything. Since the Profile was nicely indented, I used below egrep command to get what I was after:
egrep ' Instance .*\)|^ HDFS_SCAN_NODE \(id=0\)|^ - TotalTime: ' profile-example.txt
It yielded result below:
... Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 2.1m (125005670562) Instance 94481a81355e51e4:51fd9f9500000057 (host=xxxxx-xxx-cdh-cdn003.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.9m (114395426955) Instance 94481a81355e51e4:51fd9f9500000058 (host=xxxxx-xxx-cdh-cdn020.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.5m (92671503850) Instance 94481a81355e51e4:51fd9f950000003d (host=xxxxx-xxx-cdh-cdn012.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.4m (86459970122) Instance 94481a81355e51e4:51fd9f950000004b (host=xxxxx-xxx-cdh-cdn014.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.4m (82187347776) Instance 94481a81355e51e4:51fd9f9500000050 (host=xxxxx-xxx-cdh-cdn006.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.4m (82146306944) Instance 94481a81355e51e4:51fd9f950000004f (host=xxxxx-xxx-cdh-cdn024.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (80468400288) Instance 94481a81355e51e4:51fd9f950000004d (host=xxxxx-xxx-cdh-cdn022.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (79714897965) Instance 94481a81355e51e4:51fd9f9500000043 (host=xxxxx-xxx-cdh-cdn017.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (78877950983) Instance 94481a81355e51e4:51fd9f9500000052 (host=xxxxx-xxx-cdh-cdn001.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (77593734314) Instance 94481a81355e51e4:51fd9f950000003c (host=xxxxx-xxx-cdh-cdn019.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (76164245478) Instance 94481a81355e51e4:51fd9f9500000045 (host=xxxxx-xxx-cdh-cdn007.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.3m (75588331159) Instance 94481a81355e51e4:51fd9f9500000044 (host=xxxxx-xxx-cdh-cdn010.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.2m (73596530464) Instance 94481a81355e51e4:51fd9f9500000042 (host=xxxxx-xxx-cdh-cdn018.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.2m (72946574082) Instance 94481a81355e51e4:51fd9f9500000055 (host=xxxxx-xxx-cdh-cdn026.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.2m (69918383242) Instance 94481a81355e51e4:51fd9f9500000054 (host=xxxxx-xxx-cdh-cdn011.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.2m (69355611992) Instance 94481a81355e51e4:51fd9f9500000051 (host=xxxxx-xxx-cdh-cdn009.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.1m (68527129814) Instance 94481a81355e51e4:51fd9f9500000048 (host=xxxxx-xxx-cdh-cdn016.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.1m (67249633571) Instance 94481a81355e51e4:51fd9f9500000047 (host=xxxxx-xxx-cdh-cdn013.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.1m (63989781076) Instance 94481a81355e51e4:51fd9f9500000041 (host=xxxxx-xxx-cdh-cdn028.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.0m (62739870946) Instance 94481a81355e51e4:51fd9f950000003f (host=xxxxx-xxx-cdh-cdn025.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.0m (62136511127) Instance 94481a81355e51e4:51fd9f950000004c (host=xxxxx-xxx-cdh-cdn005.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.0m (61943905274) Instance 94481a81355e51e4:51fd9f9500000046 (host=xxxxx-xxx-cdh-cdn027.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.0m (61955797776) Instance 94481a81355e51e4:51fd9f950000004e (host=xxxxx-xxx-cdh-cdn021.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 1.0m (60045780252) Instance 94481a81355e51e4:51fd9f9500000040 (host=xxxxx-xxx-cdh-cdn029.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 58.05s (58048904552) Instance 94481a81355e51e4:51fd9f950000004a (host=xxxxx-xxx-cdh-cdn023.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 57.34s (57338024825) Instance 94481a81355e51e4:51fd9f9500000049 (host=xxxxx-xxx-cdh-cdn008.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 53.13s (53130104765) Instance 94481a81355e51e4:51fd9f9500000056 (host=xxxxx-xxx-cdh-cdn004.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 43.24s (43238668974) Instance 94481a81355e51e4:51fd9f950000003e (host=xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=0) - TotalTime: 56.2m (3373973559713)
I have omitted other irrelevant information and only left the ones that I was interested. Now I could see clearly which host was the bottleneck. It was host xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com, which took 56.2 minutes, while ALL other hosts took around 40 seconds to 2 minutes.
Now, I remember another HDFS SCAN had the same symptom, which was operator 1 (01:SCAN HDFS), so I did the same “egrep” command (remember that the indentation for different operators might be different, so I needed to search those again in the Profile first and copy paste exactly the amount of white spaces before them to get the result I wanted):
egrep ' Instance .*\)|^ HDFS_SCAN_NODE \(id=1\)|^ - TotalTime: ' profile-example.txt
And again result confirmed the same:
.... Instance 94481a81355e51e4:51fd9f950000000c (host=xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=1) - TotalTime: 13.3m (798747290751) ... Instance 94481a81355e51e4:51fd9f9500000007 (host=xxxxx-xxx-cdh-cdn001.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=1) - TotalTime: 28.16s (28163113611) Instance 94481a81355e51e4:51fd9f9500000018 (host=xxxxx-xxx-cdh-cdn009.xxx.XXXXXX.com:22000) ... HDFS_SCAN_NODE (id=1) - TotalTime: 23.29s (23285966387) ...
It was clear that, again the same host xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com had the exact same problem, that it ran much slower than other hosts, 13.3 minutes vs 28.16 seconds.
I then came to conclusion that something happened on the host and needed to be fixed.
To confirm my theory as the result of the above investigation, I asked the user to shutdown Impala Daemon on this host and test their query again, and they confirmed back that issue was resolved. And later on they updated me and said that they had identified hardware issues on that host and it had been decommissioned for maintenance.
I hope above steps I took to troubleshoot this particular Impala Query performance issue can be useful and help in someway to understand how best to tackle Impala query issues.