SELECT COUNT(*) FROM table WHERE col = 'value';with 2GB of data takes almost 30 minutes to finish in a reasonable sized cluster like 10 nodes, how do you determine the cause of the slowness? There are many possible causes of this issue, however, I think the following three are a great start:
- The number of partitions
- The number of files per partition
- The size of each files under each partition compared with block size
CREATE TABLE table_test LIKE original_table; SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; INSERT OVERWRITE TABLE table_test PARTITION (rundate, qsequence) SELECT * FROM original_table;This assumes that the number of partitions are reasonable, if not, you can update the query so that the number of partitions generated in the new table can be reduced, while data still maintained. So the goal here is to:
- Reduce the number of partitions per table
- Reduce the number of files per partition
- Increase the size of each file under each partition so that they will be close to the block size configured in the cluster (default of 128MB in latest CDH)