Sqoop Teradata import truncates timestamp microseconds information

Last week, while I was working on Sqoop with Teradata, I noticed one bug that the microseconds part of a Timestamp field got truncated after importing into HDFS. The following is the steps to re-produce the issue:

1. Create a table in Teradata:

CREATE TABLE vmtest.test (a integer, b timestamp(6) 
FORMAT 'yyyy-mm-ddbhh:mi:ss.s(6)') PRIMARY INDEX (a);

INSERT INTO vmtest.test VALUES (1, '2017-03-14 15:20:20.711001');

2. And sqoop import command:

sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
    --username dbc --password dbc --target-dir /tmp/test --delete-target-dir \
    --as-textfile --fields-terminated-by "," --table test

3. data stored in HDFS as below:

[cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
1,2017-03-14 15:20:20.711

Notice the microseconds part truncated from 711001 to 711

This is caused by a bug in TDCH (TeraData Connector for Hadoop) from Teradata, which is used by Cloudera Connector Powered by Teradata.

The workaround is to make sure that the timestamp value is in String format before passing it to Sqoop, so that no conversion will happen. Below Sqoop command is an example:

sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
    --username dbc --password dbc --target-dir /tmp/test \
    --delete-target-dir --as-textfile --fields-terminated-by "," \
    --query "SELECT a, cast(cast(b as format 'YYYY-MM-DD HH:MI:SS.s(6)') as char(40)) from test WHERE \$CONDITIONS" \
    --split-by a

After importing, data is stored in HDFS correctly:

[cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
1,2017-03-14 15:20:20.711001

As mentioned above, this is a bug in Teradata connector, we have to wait for it to be fixed in TDCH. At the time of writing, the issue still exists in CDH5.8.x.

Leave a Reply

Your email address will not be published. Required fields are marked *