In the last episode, Part 3 of the series, I used an example to go through the Impala Profile in more detail. In this part of the series, I will use a real life production system Impala Profile, explain the steps I used to read through the Profile when given an issue and slowly reached the conclusion of the real cause of the problem. If you missed the the last 3 parts of my posts, you can go back and have a look again:

OK, let’s get started. Since the Profile itself is quite large, as it involved several Impala Daemons to run, it will be ugly on the page if I include the full content here. I will go through section by section and explain on what information I was looking for when troubleshooting the issue.

The problem with the query was that for whatever reason, the same query used to be able to finish under a few minutes, but now it took more than 1 hour to finish. This profile was just one example, in fact, ALL queries running through this cluster had the exact same issue at the time. So please spend sometime to go through this Profile and see if you are able to capture any useful information and understand the situation here.

Now, let me go through in more detail on the steps I used to troubleshoot this particular issue.

1. Since user complained query took longer than normal, the first thing I wanted to check was how long? So very obvious I looked for the Start and End time at the beginning of the Profile:

    Start Time: 2020-01-03 07:33:42.928171000
    End Time: 2020-01-03 08:47:55.745537000

I noticed that it took 1 hour and 14 minutes to finish the query, which matched what user complained.

2. I noticed that the query failed with EXCEPTION, due to user cancellation:

    Query State: EXCEPTION
    Query Status: Cancelled

So it was likely that user was not patience anymore and had to cancel the query as it took too long. Nothing to worry about here.

3. Moved along, I checked how complex the query was:

    Sql Statement: 
SELECT COUNT(*),NVL(SUM(NVL(NUMBER_OF_CALLS,0)),0) 
FROM xxx_hpmm 
WHERE xxxx_month IN (
    SELECT DISTINCT dd.month_id 
    FROM  xxx_xxx_xxx cef, date_xxx dd 
    WHERE CAST(xxxx_date_id AS bigint) = dd.xxx_id
) 
AND xxx_date = TO_TIMESTAMP ('01/02/2020', 'MM/dd/yyyy')

I have re-formatted for readability.
I could see that the query was pretty simple, a straight SELECT statement with a sub-query IN condition.

4. Next thing I could see was the Coordinator host:

Coordinator: xxxxx-xxx-cdh-cdn003.xxx.XXXXXX.com:22000

If I could not conclude anything from this file, then the next step should be to get the impala daemon log on this host. But for now, I just had to continue, as I did not have the log yet.

5. Next block from the Profile I reached the Query Plan section, as I mentioned in my previous post, I normally would skip this section first, and jump to the Summary section, to understand how long each operator took and see if there could be any obvious information that might tell me a bit more:

Operator             #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                             
-----------------------------------------------------------------------------------------------------------
12:AGGREGATE              1    0.000ns    0.000ns        0           1   16.00 KB       10.00 MB  FINALIZE
11:EXCHANGE               1    0.000ns    0.000ns       28           1  232.00 KB              0  UNPA...
06:AGGREGATE             29   61.358ms  110.536ms       28           1   13.60 MB       10.00 MB  
05:HASH JOIN             29  223.055ms  345.295ms  277.41M     116.66M    2.00 MB        1.94 MB  LEFT... 
|--10:EXCHANGE           29   29.977us   67.789us        2         261   24.00 KB              0  BROADCAST
|  09:AGGREGATE          29    1.569ms    2.903ms        2         261    1.96 MB       10.00 MB  FINALIZE
|  08:EXCHANGE           29   33.880us  419.580us       58         261  240.00 KB              0  HASH(..
|  04:AGGREGATE          29    9s751ms      3m53s       58         261    4.58 MB       10.00 MB  STREAMING
|  03:HASH JOIN          29    1s099ms    2s635ms  411.27M     411.27M    3.35 MB        1.94 MB  INNER ..
|  |--07:EXCHANGE        29  142.532us  334.794us    8.07K       8.07K  304.00 KB              0  BROADCAST 
|  |  02:SCAN HDFS        1    1.700ms    1.700ms    8.07K       8.07K  332.94 KB       32.00 MB  xx..             
|  01:SCAN HDFS          29   44s391ms     13m18s  411.27M     411.27M  525.95 MB        4.38 GB  xx.. 
00:SCAN HDFS             29       3m7s     56m13s  277.41M     116.66M    7.83 MB      160.00 MB  xx..

There are a few key things that I am looking for here ( I have re-formatted again for readability and removed the information that I was not interested in):

  • Read in the reverse order, from bottom to top, as it is the order that Impala does the operation
  • Compare “Avg Time” and “Max Time” columns
  • Compare “#Rows” and “Est. #Rows” columns
  • Check “Detail” column to see what type of JOINs were for each operation

Immediately, I noticed there was a big difference between “Avg Time” and “Max Time” for SCAN HDFS operator. Average time took 3 minutes and 7 seconds, but Maximum time from one of the hosts, out of 29 hosts, took 56 minutes and 13 seconds. Kept reading, I also noticed the exact same issue for the second SCAN HDFS operation, 44 seconds vs 13 minutes and 18 seconds.

So my next thought was to identify which host/hosts performed much slower than others and whether it was from the same host. To do so, I used string “id=0” to search through the Profile. “0” is the operator number from the beginning against each line in the Summary section “00:SCAN HDFS”. This “id=[\d]+” string will be attached to each operator in the detailed breakdown section down below in the Profile. Remember to remove any leading 0s.

I searched for the first instance of “id=0” from the beginning of the file, and reached below section:

          HDFS_SCAN_NODE (id=0)
 ....
            - ScannerThreadsTotalWallClockTime: 20.0m (1200982115995)
              - MaterializeTupleTime(*): 226ms (226568242)
              - ScannerThreadsSysTime: 322ms (322168172)
              - ScannerThreadsUserTime: 6.76s (6758158482)
            - ScannerThreadsVoluntaryContextSwitches: 10,907 (10907)
            - TotalRawHdfsOpenFileTime(*): 8.6m (517759170560)
            - TotalRawHdfsReadTime(*): 3.4m (201957505069)
            - TotalReadThroughput: 749.9 KiB/s (767874)
            - TotalTime: 3.1m (187289950304)

I noticed TotalTime was 3.1 minutes, which matched the 3 minutes and 7 seconds that I saw in the Summary section, so this was the Average Fragment. To confirm, I scrolled back and saw below:

    Averaged Fragment F00

Continue searching the file, I came to below section (second instance of “id=0”):

            HDFS_SCAN_NODE (id=0)
....
              - ScannerThreadsTotalWallClockTime: 10.4m (626435081910)
                - MaterializeTupleTime(*): 278ms (278689886)
                - ScannerThreadsSysTime: 266ms (266094000)
                - ScannerThreadsUserTime: 5.75s (5748833000)
              - ScannerThreadsVoluntaryContextSwitches: 11,285 (11285)
              - TotalRawHdfsOpenFileTime(*): 7.8m (468388283839)
              - TotalRawHdfsReadTime(*): 1.9m (114598713147)
              - TotalReadThroughput: 731.0 KiB/s (748535)
              - TotalTime: 2.1m (125005670562)

This one told me it took 2.1 minutes, which was faster than average of 3.1 minutes, and scrolling back to confirm the host:

    Fragment F00
      Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000)

Now, I could see that there were three things that I was looking for:

      Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000)
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 2.1m (125005670562)

I thought it would be easier if I could use simple “grep” to filter out everything. Since the Profile was nicely indented, I used below egrep command to get what I was after:

egrep ' Instance .*\)|^            HDFS_SCAN_NODE \(id=0\)|^              - TotalTime: ' profile-example.txt

It yielded result below:

...
      Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-cdn002.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 2.1m (125005670562)
      Instance 94481a81355e51e4:51fd9f9500000057 (host=xxxxx-xxx-cdh-cdn003.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.9m (114395426955)
      Instance 94481a81355e51e4:51fd9f9500000058 (host=xxxxx-xxx-cdh-cdn020.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.5m (92671503850)
      Instance 94481a81355e51e4:51fd9f950000003d (host=xxxxx-xxx-cdh-cdn012.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.4m (86459970122)
      Instance 94481a81355e51e4:51fd9f950000004b (host=xxxxx-xxx-cdh-cdn014.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.4m (82187347776)
      Instance 94481a81355e51e4:51fd9f9500000050 (host=xxxxx-xxx-cdh-cdn006.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.4m (82146306944)
      Instance 94481a81355e51e4:51fd9f950000004f (host=xxxxx-xxx-cdh-cdn024.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (80468400288)
      Instance 94481a81355e51e4:51fd9f950000004d (host=xxxxx-xxx-cdh-cdn022.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (79714897965)
      Instance 94481a81355e51e4:51fd9f9500000043 (host=xxxxx-xxx-cdh-cdn017.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (78877950983)
      Instance 94481a81355e51e4:51fd9f9500000052 (host=xxxxx-xxx-cdh-cdn001.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (77593734314)
      Instance 94481a81355e51e4:51fd9f950000003c (host=xxxxx-xxx-cdh-cdn019.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (76164245478)
      Instance 94481a81355e51e4:51fd9f9500000045 (host=xxxxx-xxx-cdh-cdn007.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.3m (75588331159)
      Instance 94481a81355e51e4:51fd9f9500000044 (host=xxxxx-xxx-cdh-cdn010.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.2m (73596530464)
      Instance 94481a81355e51e4:51fd9f9500000042 (host=xxxxx-xxx-cdh-cdn018.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.2m (72946574082)
      Instance 94481a81355e51e4:51fd9f9500000055 (host=xxxxx-xxx-cdh-cdn026.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.2m (69918383242)
      Instance 94481a81355e51e4:51fd9f9500000054 (host=xxxxx-xxx-cdh-cdn011.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.2m (69355611992)
      Instance 94481a81355e51e4:51fd9f9500000051 (host=xxxxx-xxx-cdh-cdn009.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.1m (68527129814)
      Instance 94481a81355e51e4:51fd9f9500000048 (host=xxxxx-xxx-cdh-cdn016.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.1m (67249633571)
      Instance 94481a81355e51e4:51fd9f9500000047 (host=xxxxx-xxx-cdh-cdn013.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.1m (63989781076)
      Instance 94481a81355e51e4:51fd9f9500000041 (host=xxxxx-xxx-cdh-cdn028.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.0m (62739870946)
      Instance 94481a81355e51e4:51fd9f950000003f (host=xxxxx-xxx-cdh-cdn025.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.0m (62136511127)
      Instance 94481a81355e51e4:51fd9f950000004c (host=xxxxx-xxx-cdh-cdn005.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.0m (61943905274)
      Instance 94481a81355e51e4:51fd9f9500000046 (host=xxxxx-xxx-cdh-cdn027.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.0m (61955797776)
      Instance 94481a81355e51e4:51fd9f950000004e (host=xxxxx-xxx-cdh-cdn021.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 1.0m (60045780252)
      Instance 94481a81355e51e4:51fd9f9500000040 (host=xxxxx-xxx-cdh-cdn029.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 58.05s (58048904552)
      Instance 94481a81355e51e4:51fd9f950000004a (host=xxxxx-xxx-cdh-cdn023.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 57.34s (57338024825)
      Instance 94481a81355e51e4:51fd9f9500000049 (host=xxxxx-xxx-cdh-cdn008.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 53.13s (53130104765)
      Instance 94481a81355e51e4:51fd9f9500000056 (host=xxxxx-xxx-cdh-cdn004.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 43.24s (43238668974)
      Instance 94481a81355e51e4:51fd9f950000003e (host=xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=0)
              - TotalTime: 56.2m (3373973559713)

I have omitted other irrelevant information and only left the ones that I was interested. Now I could see clearly which host was the bottleneck. It was host xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com, which took 56.2 minutes, while ALL other hosts took around 40 seconds to 2 minutes.

Now, I remember another HDFS SCAN had the same symptom, which was operator 1 (01:SCAN HDFS), so I did the same “egrep” command (remember that the indentation for different operators might be different, so I needed to search those again in the Profile first and copy paste exactly the amount of white spaces before them to get the result I wanted):

egrep ' Instance .*\)|^            HDFS_SCAN_NODE \(id=1\)|^              - TotalTime: ' profile-example.txt

And again result confirmed the same:

....
      Instance 94481a81355e51e4:51fd9f950000000c (host=xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=1)
              - TotalTime: 13.3m (798747290751)
...
      Instance 94481a81355e51e4:51fd9f9500000007 (host=xxxxx-xxx-cdh-cdn001.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=1)
              - TotalTime: 28.16s (28163113611)
      Instance 94481a81355e51e4:51fd9f9500000018 (host=xxxxx-xxx-cdh-cdn009.xxx.XXXXXX.com:22000)
...
            HDFS_SCAN_NODE (id=1)
              - TotalTime: 23.29s (23285966387)
...

It was clear that, again the same host xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com had the exact same problem, that it ran much slower than other hosts, 13.3 minutes vs 28.16 seconds.

I then came to conclusion that something happened on the host and needed to be fixed.

To confirm my theory as the result of the above investigation, I asked the user to shutdown Impala Daemon on this host and test their query again, and they confirmed back that issue was resolved. And later on they updated me and said that they had identified hardware issues on that host and it had been decommissioned for maintenance.

I hope above steps I took to troubleshoot this particular Impala Query performance issue can be useful and help in someway to understand how best to tackle Impala query issues.

Leave a Reply

Your email address will not be published. Required fields are marked *