This article explains why Impala and Hive return different timestamp values on the same table that was created and value inserted from Hive. It also outlines the steps to force Impala to apply local time zone conversion when reading timestamp field stored in Parquet file format.
When Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time.
Impala, however on the other hand, does no conversion when reads the timestamp field out, hence, UTC time is returned instead of local time.
Both behaviors are by design and work in the right way. More information can be found at: TIMESTAMP Data Type
However, Impala can be set to apply the conversion as well to the timestamp field stored in Parquet file format (only available in Cloudera Manager 5.4), which is also mentioned in the link above. To do this, follow the steps below:
- Go to Impala Services home page
- Click on “Configuration“
- On the left side under “Filters“, click “Impala Daemon” under “Scope” and “Advanced” under “Category“
- Locate “Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve)“, and then enter the following:
- Save the changes
- Restart all Impala Daemons
To confirm that the change takes effect, follow the steps below:
- Go to Impala Home page
- Click on “Instances” tab
- Click on any “Impala Daemon” link (make sure you have restarted all of them)
- Under “Summary” > “Quick Links“, click on “Impala Daemon Web UI“
- A new page will open, click on the last tab on the top of the page named “/varz“
- Search “convert_legacy_hive_parquet_utc_timestamps” and confirm that it is set to “true”: –convert_legacy_hive_parquet_utc_timestamps=true
This enables Impala to do the time zone conversion when reading timestamp field from Parquet file.
Please be warned that this will have some performance hit if you go with this path, please refer to upstream Impala JIRA: IMPALA-3316 for more details.