Disk might be cheap, but when it comes to deal with TB of data, you might want to consider compression for your data storage.

When you want to create a table with compression enabled, you will need to use “STORED AS SEQUENCEFILE” when create a table in Hive:

CREATE TABLE compressed_table (data STRING)
   ROW FORMAT DELIMITED 
   FIELDS TERMINATED BY '\t'
   STORED AS SEQUENCEFILE;

However, you will not be able to use “LOAD DATA” command to load data from text file into this compressed table, Hive will complain about the file format.

There is a trick to by pass this, however. What you need to do is to create a temp table to hold the data from file as “LOAD DATA” command will work for normal text file storage, and “INSERT OVERWRITE INTO TABLE” will also work for compressed table. Follow the steps below:

CREATE TABLE tmp_table (data STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
 
CREATE TABLE compressed_table (data STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
   STORED AS SEQUENCEFILE;
 
LOAD DATA LOCAL INPATH '/tmp/file.txt' INTO TABLE tmp_table;

INSERT OVERWRITE TABLE compressed_table SELECT * FROM tmp_table; 

-- then drop the tmp table
DROP TABLE tmp_table;

Another way ( faster ) is to use the external table which will save the time to load data into the tmp_table, but you will need to put the file into HDFS first:

CREATE EXTERNAL TABLE IF NOT EXISTS tmp_table (data STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
   LOCATION 'hdfs://hadoop-namenode:8020/directory_name';
 
CREATE TABLE compressed_table (data STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
   STORED AS SEQUENCEFILE;

INSERT OVERWRITE TABLE compressed_table SELECT * FROM tmp_table; 

-- then drop the tmp table
DROP TABLE tmp_table;

Hope the new Hive release will make our life easier by fixing these limitations.

Leave a Reply

Your email address will not be published. Required fields are marked *