Today I discovered a bug that Hive can not recognise the existing data for a newly added column to a partitioned external table. In this post, I explained the steps to re-produced as well as the workaround to the issue.

Firstly I prepared the data in text format call test.txt, tab delimited:

col1	col2	col3
row21	row22	row23

Put the file under HDFS using hdfs user:

su - hdfs hdfs dfs -mkdir test
su - hdfs hdfs dfs -mkdir test/p=p1
su - hdfs hdfs dfs -put test.txt test/p=p1

Then I ran the following Hive commands (if you did the first step, you can simply copy all commands and the paste in Hive cli in one go):

DROP TABLE IF EXISTS test_external;
CREATE EXTERNAL TABLE test_external (col1 STRING) PARTITIONED BY (p string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/hdfs/test';
ALTER TABLE test_external ADD PARTITION (p='p1') LOCATION '/user/hdfs/test/p=p1';
ALTER TABLE test_external ADD COLUMNS (col2 STRING);
SELECT * FROM test_external;

And I got the following output:

col1	NULL	p1
row21	NULL	p1

You can see that the output shows the second column “col2” are NULL.

This confirms the bug. The bug was reported on 13th of Jan, 2014, but still not yet fixed. If you are interested, you can have a look at New columns after table alter result in null values despite data.

There are two choices as workarounds:

1. We can DROP the partition and the re”ADD” the partition to trick hive to read it properly (because it is an EXTERNAL table):

ALTER TABLE test_external DROP PARTITION (p='p1');
ALTER TABLE test_external ADD PARTITION (p='p1') LOCATION '/user/hdfs/test/p=p1';
SELECT * FROM test_external;

Now we can see Hive gives us correct output:

col1	col2	p1
row21	row22	p1

2. In Hive 1.1, which was shipped with CDH5.4, comes with a new feature to apply a new column to individual partitions as well as ALL partitions. This feature indirectly fixes the issue we mentioned in this post.

For more information:

Support partial partition spec for certain ALTER PARTITION statements
Support “alter table .. add/replace columns cascade”

So, in the 5th query we run in the original test:

ALTER TABLE test_external ADD COLUMNS (col2 STRING);

we can simply add word “CASCADE” to the end:

ALTER TABLE test_external ADD COLUMNS (col2 STRING) CASCADE;

Now run all the commands again, with modified queries (actually just added one word):

DROP TABLE IF EXISTS test_external;
CREATE EXTERNAL TABLE test_external (col1 STRING) PARTITIONED BY (p string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/hdfs/test';
ALTER TABLE test_external ADD PARTITION (p='p1') LOCATION '/user/hdfs/test/p=p1';
ALTER TABLE test_external ADD COLUMNS (col2 STRING) CASCADE;
SELECT * FROM test_external;

We will get:

col1	col2	p1
row21	row22	p1

Same as choice 1. Of course, this has to be run in Hive 1.1 and above, otherwise the second last command will fail.

    2 Comments

    1. hivers

      nice blog ERIC..!! i m newbie to hive. i hv kind of same issue. i have added a column and shows NULL value. i want to put DATE part from timestamp column to newly created column. i tried with below query:

      ALTER TABLE table_2 ADD COLUMNS(DATE_COL string);
      insert into table_2 (DATE_COL) as select substring(TIMESTAMP_COL, -19, 10) from table_1 ;

      this is working bt still it shows NULL values in newly created date_col.

      table_1 has 13 columns, table_2 has 14 columns (13 + DATE_COL).

      TIMESTAMP_COL :- STRING

      DATE_COL – STRING.

      please tell me how to solve this problem.

    Leave a Reply

    Your email address will not be published. Required fields are marked *