Impala Query Profile Explained – Part 3 – Hadoop Troubleshooting Guide

All right, in the Part 2 of the Impala Profile series, I have talked about how to read the basic Impala query plan and Execution Summary from the Profile information. If you missed them, you can find the links from below:

Impala Query Profile Explained – Part 1
Impala Query Profile Explained – Part 2

In this third part of this blog series, I will be still focusing on the Query Plan as well as the Execution Summary, but using a more complicated query that is against real life data that is downloaded from Kaggle’s Flights Delay database.

This database has 3 tables:

flights.csv
airlines.csv
airports.csv

The query that generated the profile as below:

SELECT 
    a.airline as airline_name, 
    COUNT(IF(cancelled = 1, 1, NULL)) AS cancelled, 
    COUNT(1) AS total, 
    CONCAT(CAST(CAST(COUNT(IF(cancelled = 1, 1, NULL)) / COUNT(1) AS DECIMAL(8,4)) * 100 AS STRING), "%") AS cancelled_rate 
FROM flights f 
JOIN airlines a 
  ON (f.airline = a.iata_code) 
GROUP BY a.airline ORDER BY a.airline

This query JOINs the flights and airports tables and generated a report that can tell us the cancellation rate for each airline for all flights happened during 2015. The result looks like below:

+------------------------------+-----------+---------+----------------+
| airline_name                 | cancelled | total   | cancelled_rate |
+------------------------------+-----------+---------+----------------+
| Alaska Airlines Inc.         | 669       | 172521  | 0.3800%        |
| American Airlines Inc.       | 10919     | 725984  | 1.5000%        |
| American Eagle Airlines Inc. | 15025     | 294632  | 5.0900%        |
| Atlantic Southeast Airlines  | 15231     | 571977  | 2.6600%        |
| Delta Air Lines Inc.         | 3824      | 875881  | 0.4300%        |
| Frontier Airlines Inc.       | 588       | 90836   | 0.6400%        |
| Hawaiian Airlines Inc.       | 171       | 76272   | 0.2200%        |
| JetBlue Airways              | 4276      | 267048  | 1.6000%        |
| Skywest Airlines Inc.        | 9960      | 588353  | 1.6900%        |
| Southwest Airlines Co.       | 16043     | 1261855 | 1.2700%        |
| Spirit Air Lines             | 2004      | 117379  | 1.7000%        |
| US Airways Inc.              | 4067      | 198715  | 2.0400%        |
| United Air Lines Inc.        | 6573      | 515723  | 1.2700%        |
| Virgin America               | 534       | 61903   | 0.8600%        |
+------------------------------+-----------+---------+----------------+

And the full details on the query plan and execution summary from the profile as below:

F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|  Per-Host Resources: mem-estimate=0B mem-reservation=0B
PLAN-ROOT SINK
|  mem-estimate=0B mem-reservation=0B
|
08:MERGING-EXCHANGE [UNPARTITIONED]
|  order by: a.airline ASC
|  mem-estimate=0B mem-reservation=0B
|  tuple-ids=3 row-size=52B cardinality=14
|
F02:PLAN FRAGMENT [HASH(a.airline)] hosts=4 instances=4
Per-Host Resources: mem-estimate=22.00MB mem-reservation=13.94MB
04:SORT
|  order by: a.airline ASC
|  mem-estimate=12.00MB mem-reservation=12.00MB spill-buffer=2.00MB
|  tuple-ids=3 row-size=52B cardinality=14
|
07:AGGREGATE [FINALIZE]
|  output: count:merge(if(cancelled = 1, 1, NULL)), count:merge(*)
|  group by: a.airline
|  mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB
|  tuple-ids=2 row-size=52B cardinality=14
|
06:EXCHANGE [HASH(a.airline)]
|  mem-estimate=0B mem-reservation=0B
|  tuple-ids=2 row-size=52B cardinality=14
|
F00:PLAN FRAGMENT [RANDOM] hosts=4 instances=4
Per-Host Resources: mem-estimate=187.94MB mem-reservation=3.94MB
03:AGGREGATE [STREAMING]
|  output: count(if(cancelled = 1, 1, NULL)), count(*)
|  group by: a.airline
|  mem-estimate=10.00MB mem-reservation=2.00MB spill-buffer=64.00KB
|  tuple-ids=2 row-size=52B cardinality=14
|
02:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: f.airline = a.iata_code
|  fk/pk conjuncts: f.airline = a.iata_code
|  runtime filters: RF000 <- a.iata_code
|  mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB
|  tuple-ids=0,1 row-size=73B cardinality=5819079
|
|--05:EXCHANGE [BROADCAST]
|  |  mem-estimate=0B mem-reservation=0B
|  |  tuple-ids=1 row-size=54B cardinality=14
|  |
|  F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
|  Per-Host Resources: mem-estimate=32.00MB mem-reservation=0B
|  01:SCAN HDFS [flight_delay.airlines a, RANDOM]
|     partitions=1/1 files=1 size=341B
|     stats-rows=14 extrapolated-rows=disabled
|     table stats: rows=14 size=341B
|     column stats: all
|     mem-estimate=32.00MB mem-reservation=0B
|     tuple-ids=1 row-size=54B cardinality=14
|
00:SCAN HDFS [flight_delay.flights f, RANDOM]
   partitions=1/1 files=1 size=564.96MB
   runtime filters: RF000 -> f.airline
   stats-rows=5819079 extrapolated-rows=disabled
   table stats: rows=5819079 size=564.96MB
   column stats: all
   mem-estimate=176.00MB mem-reservation=0B
   tuple-ids=0 row-size=19B cardinality=5819079
----------------
    Estimated Per-Host Mem: 253689856
    Per Host Min Reservation: host-xxx.xxxx.com:22000(17.88 MB) host-xxx.xxxx.com:22000(17.88 MB) host-xxx.xxxx.com:22000(17.88 MB) host-xxx.xxxx.com:22000(17.88 MB) 
    Request Pool: root.hive
    Admission result: Admitted immediately
    ExecSummary: 
Operator              #Hosts   Avg Time   Max Time  #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                  
-------------------------------------------------------------------------------------------------------------------------
08:MERGING-EXCHANGE        1    4s122ms    4s122ms     14          14          0              0  UNPARTITIONED           
04:SORT                    4  249.999us  999.996us     14          14   12.02 MB       12.00 MB                          
07:AGGREGATE               4    2.750ms    4.000ms     14          14    1.99 MB       10.00 MB  FINALIZE                
06:EXCHANGE                4    4s100ms    4s137ms     55          14          0              0  HASH(a.airline)         
03:AGGREGATE               4  280.499ms  339.002ms     55          14   10.11 MB       10.00 MB  STREAMING               
02:HASH JOIN               4  177.749ms  184.999ms  5.82M       5.82M   10.05 MB        1.94 MB  INNER JOIN, BROADCAST   
|--05:EXCHANGE             4    0.000ns    0.000ns     14          14          0              0  BROADCAST               
|  01:SCAN HDFS            1   97.000ms   97.000ms     14          14  177.00 KB       32.00 MB  flight_delay.airlines a 
00:SCAN HDFS               4    2s052ms    3s278ms  5.82M       5.82M   40.06 MB      176.00 MB  flight_delay.flights f

OK, let’s jump to the Execution Summary section first this time, as it is easier to see and it is normally the first section I would check when helping CDH users to troubleshoot any Impala query related issues. I will assume that you have read through my Part 2 series, so you know what had been explained before and I will go a bit faster this time.

So, from the above Execution Summary, we can see what happened during the life time of the query exeuction, remember to read backwards:

it took average of 2 seconds to scan HDFS data for table flight_delay.flights (2s052ms)
Impala estimated to have 5.82 million rows in table flight_delay.flights, and the scan result returned the same, which means the table stat is up-to-date (5.82M)
Impala estimated 176MB memory needed to scan table flight_delay.flights, but in fact only 40MB was used. This is expected, as memory estimation is not possible to be exactly the same as actual usage. The idea is to get as close as possible
Since the data is big, Impala was able to split the file and perform the scan operation on 4 hosts, so that the load was distributed
After scanning for table flight_delay.flights was completed, Impala started scanning for another table flight_delay.airlines. Again, the estimated and actual rows returned match, so table stats is also up-to-date for flight_delay.airlines as well. And since the table only has 14 rows, it only took 97 milli-seconds to scan it
As the table is small, with only 14 rows, Impala only used 1 host to do the scan operation
The next step was to broadcast the smaller table flight_delay.airlines into all nodes in the cluster that will perform the query, in my case, 4 hosts (BROADCAST)
Once broadcast was done, Impala performed Hash Join between flight_delay.airlines and flight_delay.flights, which took 177 milli-seconds and used 10MB of memory
Since we had called COUNT aggregation function, Impala was instructed to perform the AGGREGATE operation, which ran, again, on 4 hosts, spent 280ms, 10MB of memory and returned 55 rows
Since above step was performed on 4 worker nodes, Impala then needed to merge the results from them, which was achieved by exchanging the data internally and then performed the final aggregation on the intermediate result
A sort operation was performed after step 10 was complete, as we had “ORDER BY” in our query

You can match the operation number, like 00, 01, 02 etc in the Summary section to the numbers in the Query Plan section, which will tell you more details about the operation. The details I had covered in Part 2, so please refer to the previous post if you need reference.

Now, let’s jump further down the PROFILE and have a look at the Planner and Query Timeline:

    Planner Timeline
      Analysis finished: 3ms (3389346)
      Equivalence classes computed: 3ms (3600838)
      Single node plan created: 4ms (4625920)
      Runtime filters computed: 4ms (4734686)
      Distributed plan created: 5ms (5120630)
      Lineage info computed: 13ms (13666462)
      Planning finished: 15ms (15712999)
    Query Timeline
      Query submitted: 0ns (0)
      Planning finished: 16ms (16999947)
      Submit for admission: 17ms (17999944)
      Completed admission: 17ms (17999944)
      Ready to start on 4 backends: 18ms (18999941)
      All 4 execution backends (10 fragment instances) started: 28ms (28999909)
      Rows available: 4.28s (4280986646)
      First row fetched: 4.31s (4308986559)

Each line is pretty much self-explanatory, we can see that the query plan took 15ms seconds to run, submitted for admission from 17ms, ready to execute plan on worker nodes from 28ms and then finally rows were ready at 4.28 seconds and first row was fetched by client at 4.31 seconds. This gives you a very clear overview of how long each stage took. If any of the stages is slow, it will be very obvious and then we can start to drill down further to see what might have happened.

Since my query was fast, so it is not very interesting to see here. Let’s have a look at another real production impala query profile:

    Query Compilation: 16.268ms
       - Metadata of all 1 tables cached: 1.786ms (1.786ms)
       - Analysis finished: 6.162ms (4.376ms)
       - Value transfer graph computed: 6.537ms (374.918us)
       - Single node plan created: 7.955ms (1.417ms)
       - Runtime filters computed: 8.274ms (318.815us)
       - Distributed plan created: 8.430ms (156.307us)
       - Lineage info computed: 9.664ms (1.234ms)
       - Planning finished: 16.268ms (6.603ms)
    Query Timeline: 35m46s
       - Query submitted: 0.000ns (0.000ns)
       - Planning finished: 22.001ms (22.001ms)
       - Submit for admission: 23.001ms (1.000ms)
       - Completed admission: 23.001ms (0.000ns)
       - Ready to start on 2 backends: 24.001ms (1.000ms)
       - All 2 execution backends (2 fragment instances) started: 36.001ms (12.000ms)
       - Rows available: 5m51s (5m51s)
       - First row fetched: 5m52s (950.045ms)
       - Last row fetched: 35m46s (29m53s)
       - Released admission control resources: 35m46s (1.000ms)
       - Unregister query: 35m46s (30.001ms)
     - ComputeScanRangeAssignmentTimer: 0.000ns

This was taken from a case that Impala query took a long time to run and customer wanted to find out why. From the Query Timeline, we can clearly see that it took almost 6 minutes (5m51s) from starting execution (All 2 execution backends) until data was available (Rows available). This 6 minutes execution could be normal, as if there were lots of joins with large dataset, it is common to have query run for several minutes.

However, we can notice that it took Impala 30 minutes to pass the data back to client as First row fetched at 6 minutes mark, but Last row fetch only at 36 minutes mark. So from here, we could suspect that there could be some networking issue between Impala coordinator and client (as data fetch happens from Client, like impala-shell or Hue, to Impala Coordinator host). Another possibility is that client might be capturing the results and performing other actions like printing on the screen, as the return data might be big, that operation can be time consuming.

So this section of the PROFILE can lead us to the right direction of where to look for to find out the bottleneck.

This concludes the Part 3 of the Impala Profile series. I will explain more regarding how to relate the operation number that shown in the Query Plan section to the bottom of the Profile section where it shows the detailed metric of each operation, either average or individually on each host.

Hope to see you again!

9 Comments

anil

July 11, 2019 at 3:41 PM 5 years ago


I gone through all 3 impala query profile parts they are wonderfully crafted. Thanks a lot for sharing the knowledge Eric. very much appreciated.

Thanks,
Anil
1. Eric Lin
  
  July 17, 2019 at 9:46 PM 5 years ago
  
  
  Hi Anil,
  
  Thanks for visiting my blog and I am glad that those posts are useful.
  
  Cheers
  Eric
Stewart Wen

December 17, 2019 at 1:28 PM 4 years ago


(as data fetch happens from Client, like impala-shell or Hue, to Impala Coordinator host)
Is this a mistake? it’s more reasonable that data fetch happens from Coordinator host to Client
1. Eric Lin
  
  December 18, 2019 at 9:28 AM 4 years ago
  
  
  Hi Stewart,
  
  Firstly, thanks for reading my post and posting your comment.
  
  However, data fetch does happen from client side, as API call is triggered from there, to impala Coordinator host, to say, I want more data, please give me more. It is the returning of data that happens on the server side.
  
  Please let me know if you still think otherwise.
  
  Cheers
  Eric
caizheng

February 27, 2020 at 7:01 PM 4 years ago


Hi, thanks for your wonderful post! But I’m wondering why 06 EXCHANGE takes much longer than 02 HASH JOIN? Given that there’re only 55 rows to be exchanged across network.
1. Eric Lin
  
  February 28, 2020 at 12:43 PM 4 years ago
  
  
  Hi Caizheng,
  
  Thanks for visiting and provide feedback. Unfortunately that I do not have the full PROFILE of this query anymore, so I am unable to find out why. I might need to re-run this in a new lab after loading the data. I will share the result if I have further findings here.
  
  Cheers
  Eric
Bruno

September 23, 2021 at 3:47 PM 3 years ago


Hi Eric! I wish to thank you for your didatic article. It was very clarifier to me. To proove that I red it very carefully, I noticed a little mistake, because when you analized summary you said that AGGREGATE had returned 55 rows, but this was the estimated number by IMPALA. The actual number was only 14, the maximum number possible in this case since it can´t be larger than the number of rows of the smaller table, that have the column used to do the aggregation.
1. Eric Lin
  
  October 20, 2021 at 12:35 PM 3 years ago
  
  
  Hi Bruno,
  
  Thanks for adding comment on my site, and I am really sorry for the delayed response.
  
  Based on the Summary, the estimated rows for Aggregation was 14 and actual number was 55. Since it was a JOIN, it is possible that the result can return more than 14 rows, depending what is on the other table.
  
  I still do not see the issue here, but please correct me again if you think otherwise.
  
  Cheers
  Eric
Varun

May 6, 2022 at 4:02 AM 2 years ago


I eric im new to this profiling concept. But after seeing ur blog i gained a confidence. i watched part 1, 2 and 3
heading toward 4 and 5

Other posts that you might also be interested:

9 Comments

Leave a Reply to caizheng Cancel reply