When working with Hive, there are lots of scenarios that we need to move data from one cluster to another. For instance, sometimes we need to copy some production data into alpha/dev environment for testing purpose. Luckily that Hive provides two easy commands for us to do it.
Since version 0.8, Hive supports EXPORT and IMPORT features that allows you to export the metadata as well as the data for the corresponding table to a directory in HDFS, which can then be imported back to another database or Hive instance.
The command looks like this:
EXPORT TABLE <table_name> TO 'path/to/hdfs';
Please note that you don’t need the “/” in front, otherwise Hive will complain about invalid path or target is not an empty directory. And the final data will be written to /user/<user_name>/path/to/hdfs directory in HDFS (of course, it will need to be writable by the current user).
The next step is to copy the data across to another Hive instance, you can use the “distcp” command from Hadoop:
hadoop distcp hdfs://:8020/path/to/hdfs hdfs:///path/to/hdfs
Once the copy is finished, you can then use the IMPORT command on the new cluster to load the data into a new table:
IMPORT TABLE <table_name> FROM 'path/to/another/hdfs';
Now you will have the exact same data in the new cluster as the old one. The transfer should be smooth, the only thing is that if the data is big, it might take a while for “distcp” to copy the data across.
For more information, you can have a look at the Apache Hive’s Confluence Page