Sqoop Teradata import truncates timestamp microseconds information

Last week, while I was working on Sqoop with Teradata, I noticed one bug that the microseconds part of a Timestamp field got truncated after importing into HDFS. The following is the steps to re-produce the issue:

1. Create a table in Teradata:

CREATE TABLE vmtest.test (a integer, b timestamp(6) 
FORMAT 'yyyy-mm-ddbhh:mi:ss.s(6)') PRIMARY INDEX (a);

INSERT INTO vmtest.test VALUES (1, '2017-03-14 15:20:20.711001');

2. And sqoop import command:

sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
    --username dbc --password dbc --target-dir /tmp/test --delete-target-dir \
    --as-textfile --fields-terminated-by "," --table test

3. data stored in HDFS as below:

[cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
1,2017-03-14 15:20:20.711

Notice the microseconds part truncated from 711001 to 711

This is caused by a bug in TDCH (TeraData Connector for Hadoop) from Teradata, which is used by Cloudera Connector Powered by Teradata.

The workaround is to make sure that the timestamp value is in String format before passing it to Sqoop, so that no conversion will happen. Below Sqoop command is an example:

sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
    --username dbc --password dbc --target-dir /tmp/test \
    --delete-target-dir --as-textfile --fields-terminated-by "," \
    --query "SELECT a, cast(cast(b as format 'YYYY-MM-DD HH:MI:SS.s(6)') as char(40)) from test WHERE \$CONDITIONS" \
    --split-by a

After importing, data is stored in HDFS correctly:

[cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
1,2017-03-14 15:20:20.711001

As mentioned above, this is a bug in Teradata connector, we have to wait for it to be fixed in TDCH. At the time of writing, the issue still exists in CDH5.8.x.

Sqoop Teradata import truncates timestamp nano seconds information

In the last few weeks, I have been dealing the Teradata Support for a Sqoop issue that the value with Timestamp(6) data type in Teradata will lost last 3 digits of nano seconds after importing into HDFS using Sqoop command.

The following test case validates the issue:

  1. Create a table in Teradata:

    ​CREATE TABLE vmtest.test (a integer, b timestamp(6) 
    FORMAT 'yyyy-mm-ddbhh:mi:ss.s(6)') PRIMARY INDEX (a);
    
    INSERT INTO vmtest.test VALUES (1, '2016-04-05 11:27:24.699022');
    
  2. And sqoop import command:

    sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
        --username dbc --password dbc --target-dir /tmp/test --delete-target-dir \
        --as-textfile --fields-terminated-by "," --table test
    
  3. data stored in HDFS as below:
    [cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
    1,2016-04-05 11:27:24.699
    

Notice the nano seconds part truncated from 699022 to 699

This is caused by a bug in TDCH (TeraData Connector for Hadoop) from Teradata, which is used by Cloudera Connector Powered by Teradata.

The workaround is to make sure that the timestamp value is in String format before passing it to Sqoop, so that no conversion will happen. Below Sqoop command is an example:

sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
    --username dbc --password dbc --target-dir /tmp/test \
    --delete-target-dir --as-textfile --fields-terminated-by "," \
    --query "SELECT a, cast(cast(b as format 'YYYY-MM-DD HH:MI:SS.s(6)') as char(40)) from test WHERE \$CONDITIONS" \
    --split-by a

After import, data is stored in HDFS correctly:

[cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
1,2016-04-0511:27:24.699022

Sqoop job failed with “Malformed option in options file” error

Recently I discovered a bug in Sqoop that it wrongly detects a malformed query in options file, but in fact the query is totally correct. Example as below sqoop command:

sqoop --options-file /path/to/options-file.txt

and in the /path/to/options-file.txt, it contains the following content:

import
--connect
jdbc:mysql://......
--username
XXXXXXXX
--password
XXXXXXXX
--query
SELECT * FROM test_table WHERE a = 'b'
....

Sqoop will fail with the following error:

16/11/22 16:08:59 ERROR sqoop.Sqoop: Error while expanding arguments
java.lang.Exception: Malformed option in options file(/path/to/options-file.txt): 
SELECT * FROM test_table WHERE a = 'b'
at org.apache.sqoop.util.OptionsFileUtil.removeQuoteCharactersIfNecessary(OptionsFileUtil.java:170)
at org.apache.sqoop.util.OptionsFileUtil.removeQuotesEncolosingOption(OptionsFileUtil.java:136)
at org.apache.sqoop.util.OptionsFileUtil.expandArguments(OptionsFileUtil.java:90)
at com.cloudera.sqoop.util.OptionsFileUtil.expandArguments(OptionsFileUtil.java:33)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:199)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Malformed option in options file(/path/to/options-file.txt): SELECT * FROM test_table WHERE a = 'b'

This is caused by a bug in Sqoop code that Sqoop checks for the beginning and ending quotes in the query section of the options file. If a query does not start with a quote but ends with a quote, then it will fail with the mentioned error.

From the example shown in the above, the query had an ending single quote “‘”, Sqoop will wrongly treat it as an invalid query and then Exception will be throw. If you look at the function that does the check for quotes in the query string:

https://github.com/apache/sqoop/blob/release-1.4.6-rc3/src/java/org/apache/sqoop/util/OptionsFileUtil.java#L156-L175

  private static String removeQuoteCharactersIfNecessary(String fileName,
      String option, char quote) throws Exception {
    boolean startingQuote = (option.charAt(0) == quote);
    boolean endingQuote = (option.charAt(option.length() - 1) == quote);

    if (startingQuote && endingQuote) {
      if (option.length() == 1) {
        throw new Exception("Malformed option in options file("
            + fileName + "): " + option);
      }
      return option.substring(1, option.length() - 1);
    }

    if (startingQuote || endingQuote) {
       throw new Exception("Malformed option in options file("
           + fileName + "): " + option);
    }

    return option;
  }

Based on the above code, the sqoop command using options file will fail if the it either only starts or ends with a quote (single or double). I have created upstream JIRA for this bug SQOOP-3061 and provided a patch to fix the issue.

Currently, the workaround is to NOT end the query with a quote, either double or single quote, so simply add something like “AND 1=1” will resolve the issue.

Does Sqoop Mainframe Import Support SFTP?

Currently Sqoop leverages Apache Commons Net FTP library with some customisations to handle the way mainframe returns FTP listings. The current Apache Commons Net library does NOT support SFTP but does support FTPS.

FTPS and SFTP are very different protocols so we would have to use a different underlying library to incorporate SFTP capability into the mainframe module:

  • FTPS: FTPS is FTP with SSL for security. It uses a control channel and opens new connections for the data transfer. As it uses SSL, it requires a certificate.
  • SFTP: SFTP (SSH File Transfer Protocol/Secure File Transfer Protocol) was designed as an extension of SSH to provide file transfer capability, so it usually uses only the SSH port for both data and control.

In most SSH server installations you will have SFTP support, but FTPS would need the additional configuration of a supported FTP server.

However, even though Apache Commons Net FTP supports FTPS, Sqoop currently only uses class org.apache.commons.net.ftp.FTPClient, NOT org.apache.commons.net.ftp.FTPSClient which supports FTPS. Please check the source code here MainframeFTPClientUtils.java.

So, the conclusion is that currently Sqoop does NOT support either FTPS or SFTP, and the only protocol it supports is FTP.

How to import BLOB data into HBase directly using Sqoop

Recently I was dealing with an issue that I was not able to import BLOB data correctly into HBase from Oracle database. All other columns were imported successfully, however, the BLOB column failed to appear in HBase table.

My test table has three columns, ID:int, DATA_S:VARCHAR2 and DATA_B:BLOB. The following was the original command that failed to import BLOB column:

sqoop import -Dsqoop.hbase.add.row.key=true \
    --connect jdbc:oracle:thin:@//<oracle-host>:1521/orcl \
    --username USERNAME --password password --table TEST_TABLE \ 
    --hbase-create-table --hbase-table test_case_no_bulkload \
    --column-family cf --split-by ID \
    --hbase-row-key ID

Result as below:

ROW          COLUMN+CELL
 1           column=cf:DATA_S, timestamp=1475219854006, value=Test
 1           column=cf:ID, timestamp=1475219854006, value=1
 2           column=cf:DATA_S, timestamp=1475219894990, value=Test MOre
 2           column=cf:ID, timestamp=1475219894990, value=2
2 row(s) in 0.7070 seconds

You can see that DATA_B column was missing in the destination HBase table.

The fix here is to use the bulkload for HBase, see the command below:

sqoop import -Dsqoop.hbase.add.row.key=true \
    --connect jdbc:oracle:thin:@//<oracle-host>:1521/orcl \
    --username USERNAME --password password --table TEST_TABLE \ 
    --hbase-create-table --hbase-table test_case_bulkload \
    --column-family cf --split-by ID \
    --hbase-row-key ID \
    --hbase-bulkload

See the new result:

hbase(main):002:0> scan 'test_case_bulkload'
ROW       COLUMN+CELL
 1        column=cf:DATA_B, timestamp=1475220177891, value=2f 72 6f 6f 74 2f 31 30 39 33 33 31 2f 6c 65 69 73 61 5f 63 68 72 69 73 74 6d 61 73 5f 66 61 6c 73 65 5f 63 6f 6c 6f 72 2e 70 6e 67
 1        column=cf:DATA_S, timestamp=1475220177891, value=Test
 2        column=cf:DATA_B, timestamp=1475220177891, value=41 6e 6f 74 68 65 72 20 74 65 73 74
 2        column=cf:DATA_S, timestamp=1475220177891, value=Test MOre
2 row(s) in 0.0620 seconds

You can see that column DATA_B has been created in HBase. However, it comes to another problem. Do you notice the ID column in HBase is missing? We expect the ID column should be created as part of the column family because we specified “-Dsqoop.hbase.add.row.key=true” in the import command.

This is caused by a known issue that is reported by my colleague, see SQOOP-2952 for details.

Basically the issue is when using –hbase-bulkload, -Dsqoop.hbase.add.row.key=true will be ignored and the key will not be created as part of column family for the new HBase table.

If you do not care about the row key to be in the column family, then –hbase-bulkload is the solution for you, otherwise you will have to wait for SQOOP-2952 to be resolved.

Hope this helps.