Today I have discovered a bug in Hive, that when user submits an “INSERT OVERWRITE” query with dynamic partitioning, Hive does not lock the underlining table “exclusively”, rather it only applies “shared” lock. To confirm the problem, I created a simple table:
CREATE TABLE test (a int) PARTITIONED BY (p string) STORED AS TEXTFILE;
Then I issued a dynamic partitioning query:
INSERT OVERWRITE TABLE test PARTITION (p) SELECT COUNT(1), 'p1' FROM sample_table;
And then in another terminal, I check the locking status:
SHOW LOCKS test;
And the result is:
default@test	SHARED
Finally in another terminal, I issued the same update again:
INSERT OVERWRITE TABLE test PARTITION (p) SELECT COUNT(1), 'p1' FROM sample_table;
I can see that both INSERT OVERWRITE are running concurrently, which is not correct. This will easily lead to data corruption if not carefully enough. This is confirmed under CDH5.3 and latest CDH5.4. I have filed a JIRA issue for engineering team to fix. There is no simple workaround at this stage, just need to be mindful of this issue until the next update comes.

    Leave a Reply

    Your email address will not be published. Required fields are marked *