Oracle Number(1,0) field maps to Boolean in Spark

Recently I was working on a issue that when importing data from Oracle into Hive table using Spark, the data of type Number(1,0) in Oracle was implicitly converted into Boolean data type. Before was on CDH5.5.x, it worked correctly, however, after upgrading to CDH5.10.x, the issue happened. See below Hive table output after import:

Before upgrade:

SELECT column1 FROM test_table limit 2;

After upgrade:

SELECT column1 FROM test_table limit 2;

After digging further, I discovered that this change was introduced by SPARK-16625, due to the integration required for Spark to work correctly with Oracle.

Since the change was intended, the following is the suggested workarounds:

  1. Cast the Boolean to a type of your choosing in the Spark code, before writing it to the Hive table
  2. Make sure that the mapped column in Hive is also of compatible data type, for example, TinyInt, rather than String, so that the value of True or False will be mapped to 1 or 0 respectively, rather than string value of “True” or “False” (the reason that the column got “False” and “True” values were because the column was of String data type)

Hope above helps.

How to import BLOB data into HBase directly using Sqoop

Recently I was dealing with an issue that I was not able to import BLOB data correctly into HBase from Oracle database. All other columns were imported successfully, however, the BLOB column failed to appear in HBase table.

My test table has three columns, ID:int, DATA_S:VARCHAR2 and DATA_B:BLOB. The following was the original command that failed to import BLOB column:

sqoop import -Dsqoop.hbase.add.row.key=true \
    --connect jdbc:oracle:thin:@//<oracle-host>:1521/orcl \
    --username USERNAME --password password --table TEST_TABLE \ 
    --hbase-create-table --hbase-table test_case_no_bulkload \
    --column-family cf --split-by ID \
    --hbase-row-key ID

Result as below:

 1           column=cf:DATA_S, timestamp=1475219854006, value=Test
 1           column=cf:ID, timestamp=1475219854006, value=1
 2           column=cf:DATA_S, timestamp=1475219894990, value=Test MOre
 2           column=cf:ID, timestamp=1475219894990, value=2
2 row(s) in 0.7070 seconds

You can see that DATA_B column was missing in the destination HBase table.

The fix here is to use the bulkload for HBase, see the command below:

sqoop import -Dsqoop.hbase.add.row.key=true \
    --connect jdbc:oracle:thin:@//<oracle-host>:1521/orcl \
    --username USERNAME --password password --table TEST_TABLE \ 
    --hbase-create-table --hbase-table test_case_bulkload \
    --column-family cf --split-by ID \
    --hbase-row-key ID \

See the new result:

hbase(main):002:0> scan 'test_case_bulkload'
 1        column=cf:DATA_B, timestamp=1475220177891, value=2f 72 6f 6f 74 2f 31 30 39 33 33 31 2f 6c 65 69 73 61 5f 63 68 72 69 73 74 6d 61 73 5f 66 61 6c 73 65 5f 63 6f 6c 6f 72 2e 70 6e 67
 1        column=cf:DATA_S, timestamp=1475220177891, value=Test
 2        column=cf:DATA_B, timestamp=1475220177891, value=41 6e 6f 74 68 65 72 20 74 65 73 74
 2        column=cf:DATA_S, timestamp=1475220177891, value=Test MOre
2 row(s) in 0.0620 seconds

You can see that column DATA_B has been created in HBase. However, it comes to another problem. Do you notice the ID column in HBase is missing? We expect the ID column should be created as part of the column family because we specified “-Dsqoop.hbase.add.row.key=true” in the import command.

This is caused by a known issue that is reported by my colleague, see SQOOP-2952 for details.

Basically the issue is when using –hbase-bulkload, -Dsqoop.hbase.add.row.key=true will be ignored and the key will not be created as part of column family for the new HBase table.

If you do not care about the row key to be in the column family, then –hbase-bulkload is the solution for you, otherwise you will have to wait for SQOOP-2952 to be resolved.

Hope this helps.

Unable to import Oracle table with CLOB column into HDFS using Sqoop

If you encountered the following error while importing Oracle table into HDFS:

org.apache.sqoop.mapreduce.TextExportMapper: Exception: Could not buffer record
at org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(
at org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(
at org.apache.hadoop.mapred.MapTask.runNewMapper(
at org.apache.hadoop.mapred.YarnChild$
at Method)
at org.apache.hadoop.mapred.YarnChild.main(
Caused by: java.lang.CloneNotSupportedException: com.cloudera.sqoop.lib.ClobRef
at java.lang.Object.clone(Native Method)
at org.apache.sqoop.lib.LobRef.clone(
at org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(

Chances are that you are hitting a Sqoop bug: SQOOP-117

At the time of writing, it is not fixed yet, but we can have a very simple workaround for this issue. Simply use –map-column-java option supported by Sqoop, information can be found on official Sqoop User Guide. For example:

$ sqoop import ... --map-column-java id=String,value=Integer

Hope this helps.