Incremental Stats got reset in Impala

Incremental Stats got reset in Impala

Today I found an Impala bug that you will lose stats in both Impala and Hive in CDH5.3.x, to re-produce this bug, I did the following steps:
-- create table in hive
CREATE TABLE test (a int) PARTITIONED BY (d int) STORED AS PARQUETFILE; 
INSERT OVERWRITE TABLE test PARTITION (d) SELECT a, 1 AS d FROM test_beeline2 WHERE d = 2 LIMIT 1;

-- impala
[10.17.74.162:21000] > SHOW PARTITIONS test;
Query: show partitions test
+-------+-------+--------+------+--------------+--------+-------------------+
| d     | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+-------+--------+------+--------------+--------+-------------------+
| Total | -1    | 0      | 0B   | 0B           |        |                   |
+-------+-------+--------+------+--------------+--------+-------------------+
Fetched 1 row(s) in 0.03s
[10.17.74.162:21000] > INVALIDATE METADATA;
Query: invalidate metadata

Fetched 0 row(s) in 4.49s
[10.17.74.162:21000] > SHOW PARTITIONS test;
Query: show partitions test
+-------+-------+--------+------+--------------+---------+-------------------+
| d     | #Rows | #Files | Size | Bytes Cached | Format  | Incremental stats |
+-------+-------+--------+------+--------------+---------+-------------------+
| 1     | 1     | 1      | 130B | NOT CACHED   | PARQUET | false             |
| Total | -1    | 1      | 130B | 0B           |         |                   |
+-------+-------+--------+------+--------------+---------+-------------------+
Fetched 2 row(s) in 0.63s
[10.17.74.162:21000] > COMPUTE INCREMENTAL STATS test PARTITION (d=1);
Query: compute incremental stats test partition (d=1)
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
Fetched 1 row(s) in 3.50s
[10.17.74.162:21000] > SHOW PARTITIONS test;
Query: show partitions test
+-------+-------+--------+------+--------------+---------+-------------------+
| d     | #Rows | #Files | Size | Bytes Cached | Format  | Incremental stats |
+-------+-------+--------+------+--------------+---------+-------------------+
| 1     | 1     | 1      | 130B | NOT CACHED   | PARQUET | true              |
| Total | 1     | 1      | 130B | 0B           |         |                   |
+-------+-------+--------+------+--------------+---------+-------------------+
Fetched 2 row(s) in 0.02s

-- INSERT data from Hive
INSERT OVERWRITE TABLE test PARTITION (d) SELECT a, 2 as d FROM test_beeline2 WHERE d = 2 LIMIT 2;

-- impala
[10.17.74.162:21000] > INVALIDATE METADATA;
Query: invalidate metadata

Fetched 0 row(s) in 4.00s
[10.17.74.162:21000] > SHOW PARTITIONS test;
Query: show partitions test
+-------+-------+--------+------+--------------+---------+-------------------+
| d     | #Rows | #Files | Size | Bytes Cached | Format  | Incremental stats |
+-------+-------+--------+------+--------------+---------+-------------------+
| 1     | -1    | 1      | 130B | NOT CACHED   | PARQUET | true              |
| 2     | 2     | 1      | 130B | NOT CACHED   | PARQUET | false             |
| Total | 1     | 2      | 260B | 0B           |         |                   |
+-------+-------+--------+------+--------------+---------+-------------------+
Fetched 3 row(s) in 0.02s
You can see that stats got cleared when you INVALIDATE METADATA in Impala. This is caused by when Hive hive.stats.autogather is set to true, hive generates partition stat (filecount, row count, etc.) after creating it. If you run “compute incremental stats” in impala again. you will get the same RowCount, so the following check will not be satisfied and StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK will not be set.
      // Update table stats
      if (existingRowCount == null || !existingRowCount.equals(newRowCount)) {
        // The existing row count value wasn't set or has changed.
        msPartition.putToParameters(StatsSetupConst.ROW_COUNT, newRowCount);
        msPartition.putToParameters(StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK,
            StatsSetupConst.TRUE);
        updatedPartition = true;
      }
Later in Hive, partition will be reset:
    public static boolean updatePartitionStatsFast(PartitionSpecProxy.PartitionIterator part, Warehouse wh,
      boolean madeDir, boolean forceRecompute) throws MetaException {
...
        if(!params.containsKey(StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK)) {
          // invalidate stats requiring scan since this is a regular ddl alter case
          for (String stat : StatsSetupConst.statsRequireCompute) {
            params.put(stat, "-1");
          }
          params.put(StatsSetupConst.COLUMN_STATS_ACCURATE, StatsSetupConst.FALSE);
        }
so if partition stats already exists but not computed by impala, compute incremental stats will cause stats been reset back to -1. The workaround is to just COMPUTE STATS once, either in Hive by setting “SET hive.stats.autogather=true;”, or setting “SET hive.stats.autogather=false;” and run compute incremental stats in Impala. CDH5.4.x should have contained the fix so it won’t happen in latest versions. This was in-directly fixed by HIVE-8250 – Truncating table doesnt invalidate stats.

Loading

Leave a Reply

Your email address will not be published. Required fields are marked *

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!