Another bug identified today in Impala while helping customers solving a weird Impala issue.

The problem is that “SHOW COLUMN STATS” command in Impala shows incorrect stats information, either shows “-1” for distinct values or the number is not matching with real distinct values:

query: show column stats test
+--------+------------+------------------+--------+----------+----------+
| Column | Type       | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+------------+------------------+--------+----------+----------+
| a      | INT        | 3                | -1     | 4        | 4        |
| b      | INT        | 3                | -1     | 4        | 4        |
| c      | TIMESTAMP  | -1               | -1     | 4        | 4        |
+--------+------------+------------------+--------+----------+----------+

Identified this is a bug in Hive HMS API when renaming tables, and old table name’s stats becomes stale. This bug exists in CDH5.3.x. In CDH5.4.x, column stats got deleted after a table is renamed.

So, to avoid facing this issue:

1) Avoid renaming tables
2) If renaming tables can’t be avoided, manually remove column stats from table “TAB_COL_STATS” in the HMS DB after table is renamed

DELETE FROM "TAB_COL_STATS" WHERE "TABLE_NAME" = '<table_name>' AND "TBL_ID" = <table_id>;

And then re-compute the stats for the table. You can find out the “TBL_ID” from “TBLS” table in the HMS DB.

Please refer to the follow upstream JIRA issues:

HIVE-9720: Metastore does not properly migrate column stats when renaming a table across databases.
HIVE-9866: Changing a column’s type doesn’t change column stats type in metastore

Hope this helps.

Leave a Reply

Your email address will not be published. Required fields are marked *