Locate Hive Tables with Same HDFS Location

When you create Hive table, you can specify the LOCATION of the table, regardless if the table is managed or external table. The syntax is like below:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  ....
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
  ]
  [LOCATION hdfs_path]

If LOCATION clause is omitted, then Hive will use default location under /user/hive/warehouse and then use the DB and Table name specified in the table create statement.

Hive, however, will not stop you from creating different tables that link to the same HDFS location. This can cause issue sometimes if user make mistakes.

Currently, Hive provides no functionality to allow user to report back on how many tables have the same HDFS location, in case he/she wants to do clean up and make sure if tables were created correctly.

The only way I can see is to query the backend database that Hive uses, typically MySQL, Oracle and PostgreSQL etc.

Below is the query that you can run to retrieve the list of tables in Hive that have the same HDFS location:

SELECT d.NAME, t.TBL_NAME, t.TBL_TYPE, s.LOCATION 
    FROM TBLS t 
    JOIN SDS s ON (t.SD_ID = s.SD_ID) 
    JOIN DBS d on (t.DB_ID = d.DB_ID) 
    JOIN (
        SELECT LOCATION 
        FROM SDS 
        GROUP BY LOCATION 
        HAVING COUNT(*) > 1 AND LOCATION IS NOT NULL
    ) l ON (l.LOCATION = s.LOCATION);

Please note that this query was based on MySQL, you might need to adjust it accordingly if you use other Databases.

I have created some tables and my test result looks like below:

+-------------+----------+----------------+---------------------------------------------------------+
| NAME        | TBL_NAME | TBL_TYPE       | LOCATION                                                |
+-------------+----------+----------------+---------------------------------------------------------+
| db1         | test     | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db1.db/test     |
| default     | mytest   | EXTERNAL_TABLE | hdfs://NN-HOST:8020/user/hive/warehouse/db1.db/test     |
| db2         | test3    | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db2.db/test2    |
| db2         | test4    | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db2.db/test2    |
+-------------+----------+----------------+---------------------------------------------------------+
4 rows in set (0.13 sec)

Hope above query can help!

Leave a Reply

Your email address will not be published. Required fields are marked *