As we know, Hadoop/HDFS/MapReduce/Impala is designed to store and process large amount of data, in terms of TBs or PBs. And we also know that having too many small files will hurt query performance, because NameNode needs to store millions of metadata to hold the information about files being stored in HDFS, queries against tables with so many partitions/files on HDFS will add extra overhead to retrieve the list of files and read them one by one, and hence the performance of the query will be affected. It also has very good chance to hitting the file descriptor limit and cause query failure.

So, does it mean that we should keep the file size as big as possible? Of course not! Having large file stored against any table will also hurt performance. The reason being that in most of cases, Hadoop users will compress data stored in HDFS, so that disk space can be saved. And if you have large compressed file, it will take time to decompress it and cause query slowness.

To prove this theory, I have performed below test in my CDH environment.

  1. I have a CSV file with 565M plain Text file in size and 135M after bzip2 compressed, downloaded from Kaggle’s Flight Delay Dataset.
  2. I created one table named bzip2_smallfiles_4 with 4 such bzip2 files, another one named bzip2_smallfiles_8 with 8 such files
  3. I then also concatenated this text file 4 times to generated a single text file, compressed it using bzip2 and size become about 510MB in size, also created one table named bzip2_bigfile_4 on top of it
  4. the same as 3. but I made the file bigger by concatenating it 8 times, the file is 1.1GB in size after compression and a new table called bzip2_bigfile_8 is created
  5. I then ran “SELECT COUNT(*) FROM” query against those 4 tables one by one to compare the result

With no surprise, I can see that query against the table bzip2_bigfile_8 is the slowest:

bzip2_smallfiles_4:
a. 4 hosts to run the query
b. took around 53 seconds
c. max scan took 52 seconds
d. max decompression took 49 seconds

Operator       Hosts  Avg Time  Max Time   #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail 
00:SCAN HDFS        4  26s464ms  52s687ms  23.28M          -1  40.32 MB      160.00 MB  test.bzip2_smallfiles_4

    Query Timeline
...
      Rows available: 53.86s (53861836202)
      First row fetched: 53.87s (53869836178)
      Unregister query: 53.87s (53874836163)

    Fragment F00
      Instance fc48dc3e014eb7a5:7d7a2dc100000004 (host=xxxx:22000)
        AGGREGATION_NODE (id=1)
          HDFS_SCAN_NODE (id=0)
            File Formats: TEXT/BZIP2:2 
            - DecompressionTime: 49.45s (49449847498)

bzip2_smallfiles_8
a. 4 hosts to run the query
b. took around 54.69s
c. max scan took 54s196ms
d. max decompression took 51.18s

Operator       Hosts  Avg Time  Max Time   #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail 
00:SCAN HDFS        4  52s514ms  54s196ms  46.55M          -1  40.32 MB      160.00 MB  test.bzip2_smallfiles_8 

    Query Timeline
...
      Rows available: 54.36s (54359822792)
      First row fetched: 54.68s (54683821736)
      Unregister query: 54.69s (54688821720)

    Fragment F00
      Instance 5642f67b9a975652:c19438dc00000004 (host=xxxx:22000)
        AGGREGATION_NODE (id=1)
          HDFS_SCAN_NODE (id=0)
            File Formats: TEXT/BZIP2:2 
            - DecompressionTime: 51.18s (51183849937)

bzip2_bigfile_4:
a. 4 hosts to run the query
b. took around 1 minute 50 seconds to run
c. max scan time took 1m49s
d. max decompression took 1.7 minutes

Operator       Hosts  Avg Time  Max Time   #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail 
00:SCAN HDFS        4  27s394ms      1m49s  23.28M          -1  40.15 MB      176.00 MB  test.bzip2_bigfile_4 

    Query Timeline
...
      Rows available: 1.8m (109781665214)
      First row fetched: 1.8m (110408663300)
      Unregister query: 1.8m (110413663284)

    Fragment F00
      Instance 4545c110dbca4c9c:6cd1db1100000004 (host=xxxx:22000)
        AGGREGATION_NODE (id=1)
          HDFS_SCAN_NODE (id=0)
            File Formats: TEXT/BZIP2:2 
            - DecompressionTime: 1.7m (104339662922)

bzip2_bigfile_8:
a. 4 hosts to run the query again
b. took around 3.6m to run
c. max scan time was 3m35s
d. max decompression time was 3.4 minutes

Operator       Hosts  Avg Time  Max Time   #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail 
00:SCAN HDFS        4  53s902ms      3m35s  46.55M          -1  40.32 MB      176.00 MB  test.bzip2_bigfile_8 

    Query Timeline
...
      Rows available: 3.6m (215992297509)
      First row fetched: 3.6m (216480295920)
      Unregister query: 3.6m (216484295907)

    Fragment F00
      Instance 8f42a3b6ca6cf1cf:72fd65e100000004 (host=xxxx:22000)
        AGGREGATION_NODE (id=1)
          HDFS_SCAN_NODE (id=0)
            File Formats: TEXT/BZIP2:2
             - DecompressionTime: 3.4m (203596406406)

The reason I chose bzip2 compression format is because bzip2 is splittable, proved by the fact that all my test queries were using 4 hosts to run, even for those two big single bzip2 files.

As you can see, it takes almost 4 minutes to decompress the largest 1.1GB bzip2 file in order for Impala to read it. For table bzip2_smallfiles_8, even though we got more files to decompress, because we can do this in parallel from multiple hosts, it won’t affect the performance much.

So in conclusion, too many small files (we are talking about in KBs or a few MBs) is a No No in Hadoop, however, having too little files with big compressed size is also a No No. Ideally we should have file size as close to the block size (by default in CDH is 256MB) as possible, so that the performance can be optimized.

Leave a Reply

Your email address will not be published. Required fields are marked *