Unable to query Hive parquet table after altering column type

Currently Hive does not support changing column types for parquet tables, due to performance issues. I have developed the following test case to prove the bug:

DROP TABLE IF EXISTS test;
CREATE TABLE test (a INT, b DOUBLE) STORED AS PARQUET;
INSERT OVERWRITE TABLE test VALUES (1000, 1000);
SELECT * FROM test;
ALTER ABLE test CHANGE a a DOUBLE;
SELECT * FROM test;

The following is the output:

0: jdbc:hive2://10.17.80.41:10000/default> drop table if exists test;
No rows affected (0.408 seconds)
0: jdbc:hive2://10.17.80.41:10000/default> create table test (a int, b double) stored as parquet;
No rows affected (0.28 seconds)
0: jdbc:hive2://10.17.80.41:10000/default> insert overwrite table test values (1000, 1000);
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1444272676566_0015
INFO  : Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.80.40:8020, Ident: (HDFS_DELEGATION_TOKEN token 78 for hive)
INFO  : The url to track the job: http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0015/
INFO  : Starting Job = job_1444272676566_0015, Tracking URL = http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0015/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/bin/hadoop job  -kill job_1444272676566_0015
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO  : 2015-10-08 04:14:07,188 Stage-1 map = 0%,  reduce = 0%
INFO  : 2015-10-08 04:14:17,845 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.49 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 490 msec
INFO  : Ended Job = job_1444272676566_0015
INFO  : Stage-4 is selected by condition resolver.
INFO  : Stage-3 is filtered out by condition resolver.
INFO  : Stage-5 is filtered out by condition resolver.
INFO  : Moving data to: hdfs://host-10-17-80-40.coe.cloudera.com:8020/user/hive/warehouse/test/.hive-staging_hive_2015-10-08_04-13-54_259_91084841227249311-1/-ext-10000 from 
hdfs://host-10-17-80-40.coe.cloudera.com:8020/user/hive/warehouse/test/.hive-staging_hive_2015-10-08_04-13-54_259_91084841227249311-1/-ext-10002
INFO  : Loading data to table default.test from hdfs://host-10-17-80-40.coe.cloudera.com:8020/user/hive/warehouse/test/.hive-staging_hive_2015-10-08_04-13-54_259_91084841227249311-1/-ext-10000
INFO  : Table default.test stats: [numFiles=1, numRows=1, totalSize=272, rawDataSize=2]
No rows affected (25.223 seconds)
0: jdbc:hive2://10.17.80.41:10000/default> select * from test;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1444272676566_0016
INFO  : Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.80.40:8020, Ident: (HDFS_DELEGATION_TOKEN token 79 for hive)
INFO  : The url to track the job: http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0016/
INFO  : Starting Job = job_1444272676566_0016, Tracking URL = http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0016/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/bin/hadoop job  -kill job_1444272676566_0016
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO  : 2015-10-08 04:14:36,141 Stage-1 map = 0%,  reduce = 0%
INFO  : 2015-10-08 04:14:46,540 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.82 sec
INFO  : MapReduce Total cumulative CPU time: 2 seconds 820 msec
INFO  : Ended Job = job_1444272676566_0016
+-------+---------+--+
|   a   |    b    |
+-------+---------+--+
| 1000  | 1000.0  |
+-------+---------+--+
1 row selected (28.204 seconds)
0: jdbc:hive2://10.17.80.41:10000/default> alter table test change a a double;
No rows affected (0.378 seconds)
0: jdbc:hive2://10.17.80.41:10000/default> select * from test;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1444272676566_0017
INFO  : Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.80.40:8020, Ident: (HDFS_DELEGATION_TOKEN token 81 for hive)
INFO  : The url to track the job: http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0017/
INFO  : Starting Job = job_1444272676566_0017, Tracking URL = http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0017/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/bin/hadoop job  -kill job_1444272676566_0017
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO  : 2015-10-08 04:15:04,794 Stage-1 map = 0%,  reduce = 0%
INFO  : 2015-10-08 04:15:39,006 Stage-1 map = 100%,  reduce = 0%
ERROR : Ended Job = job_1444272676566_0017 with errors
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
0: jdbc:hive2://10.17.80.41:10000/default>

After some research I have found the following JIRAs for this issue: HIVE-6784 and HIVE-12080.

The original issue HIVE-6784 was created a while back, the upstream developer, however, due to concerns about performance issue for the patch, has rejected it and never got fixed.

A newer JIRA was created recently HIVE-12080, which is pretty much the same as HIVE-6784, but requesting to fix the issue without introducing any performance hit.

So, at the time of writing, there is no fixes for this problem yet, the workaround is to re-generate the table by running “INSERT OVERWRITE {table_name} SELECT * FROM {table_name}” so that all data will be updated to the new type.

For the time being, just re-generate the table and you will be good to go.

Leave a Reply

Your email address will not be published. Required fields are marked *