My Patch for SQOOP-3330 Committed

In the last few weeks, I have been working on an issue in Sqoop that Sqoop’s “–append” options does not work well with parameter “-Dmapreduce.output.basename”. The goal of adding “-Dmapreduce.output.basename” is to ask Sqoop to generate custom file names for the target file in HDFS, rather than using the default “part” prefix for all files.

However, Sqoop has a bug that when trying to move the file from temp folder to target folder in HDFS, it does not respect the setting for mapreduce.output.basename, hence resulted no data being moved to the final destination.

This can be confirmed by turning on the DEBUG logging for Sqoop (using –verbose):

sqoop import -Dmapreduce.output.basename="eric-test" --connect jdbc:mysql://mysql-host.com/test --username root --password 'root' --table test --target-dir /tmp/ericlin-test/sqoop/test --fields-terminated-by '\t' --verbose --append

18/05/28 22:24:44 INFO util.AppendUtils: Appending to directory test
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: _SUCCESS ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00000 ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00001 ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00002 ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Deleting temporary folder 14935e396acc4ea7b9a6236c66064c9b_test

From the output, you can see that Sqoop ignored all the files generated with prefix “eric-test”.

I have submitted a upstream JIRA SQOOP-3330, and after a few review cycles, my patch finally get accepted and committed to Sqoop’s trunk code base. I am looking forward for the fix to be backported into CDH in the near future.

My Patch for SQOOP-3042 Committed

I have got a lot complains from Cloudera customers that after Sqoop job finishes, the table class Jar files were not cleaned up. By default, they are saved under /tmp/sqoop-{username}/compile, to be used by current running jobs. They are not needed anymore after job finishes, so they should be cleaned up.

The content of the directory looks like below:

[root@localhost ~]# ll /tmp/sqoop-hadoop/compile/
total 16
drwxrwxr-x. 2 hadoop hadoop 4096 Jun  6 08:56 1496d8f8400052af2a7d3ede2cfe496d
drwxrwxr-x. 2 hadoop hadoop 4096 Jun  6 08:45 6360b964ea0c1fdf6bf6aaed7a35b986
drwxrwxr-x. 2 hadoop hadoop 4096 Jun  6 08:45 d4ccb83934494ba2874b5c6d1b51d2ac
drwxrwxr-x. 2 hadoop hadoop 4096 Jun  6 08:50 df37a566defbfac477f6f309cf227dec
[root@localhost ~]# ll /tmp/sqoop-hadoop/compile/1496d8f8400052af2a7d3ede2cfe496d
total 56
-rw-rw-r--. 1 hadoop hadoop   620 Jun  6 08:56 SQOOP_3042$1.class
-rw-rw-r--. 1 hadoop hadoop   617 Jun  6 08:56 SQOOP_3042$2.class
-rw-rw-r--. 1 hadoop hadoop   620 Jun  6 08:56 SQOOP_3042$3.class
-rw-rw-r--. 1 hadoop hadoop   516 Jun  6 08:56 SQOOP_3042.avsc
-rw-rw-r--. 1 hadoop hadoop 10389 Jun  6 08:56 SQOOP_3042.class
-rw-rw-r--. 1 hadoop hadoop   237 Jun  6 08:56 SQOOP_3042$FieldSetterCommand.class
-rw-rw-r--. 1 hadoop hadoop  6063 Jun  6 08:56 SQOOP_3042.jar
-rw-rw-r--. 1 hadoop hadoop 12847 Jun  6 08:56 SQOOP_3042.java

I created an upstream JIRA to track and fix it SQOOP-3042 in Nov 2016. I have provided the patch since then, but never got looked at due to lack of reviewers.

After getting help from Cloudera Sqoop Engineers in our Budapest team, I finally get the JIRA progressed in the last few weeks and it was committed to Sqoop trunk yesterday. Details can be seen here: https://github.com/apache/sqoop/commit/0cfbf56713f7574568ea3754f6854e82f5540954

The fix involves adding a new command line options “–delete-compile-dir” so that user can instruct Sqoop to remove those temp directories after job finishes. The reason to add such option is to avoid changing Sqoop’s behaviour, but at the same time, allow Sqoop to perform exact actions.

An example command would look like below:

sqoop import --connect jdbc:mysql://localhost/test --username root --password pass --table SQOOP_3042 --target-dir /tmp/erictest --delete-target-dir --verbose --delete-compile-dir

And you can see below message showing in the –verbose mode to verify that directory and files are removed:

....
18/06/06 17:39:27 INFO mapreduce.ImportJobBase: Transferred 52 bytes in 29.6139 seconds (1.7559 bytes/sec)
18/06/06 17:39:27 INFO mapreduce.ImportJobBase: Retrieved 4 records.
18/06/06 17:39:27 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader@6f1fba17
18/06/06 17:39:28 DEBUG util.DirCleanupHook: Removing directory: /tmp/sqoop-hadoop/compile/a9d8a87bc02a5f823a82014c49516736 in the clean up hook.

Sqoop job failed with ClassNotFoundException

In the last few weeks, I was dealing with an issue that when importing data from DB2 into HDFS, it kept failing with NoClassDefFoundError. Below was the command details:

sqoop             
import
--connect
jdbc:db2://<db2-host-url>:3700/db1
--username
user1
--password
changeme
--table
ZZZ001$.part_table
--target-dir
/path/in/hdfs
--fields-terminated-by
\001
-m
1
--validate

And the error message was:

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
        at org.apache.sqoop.mapreduce.db.DBRecordReader.createValue(DBRecordReader.java:197)
        at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:230)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
        at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)
        ... 14 more
Caused by: java.lang.NoClassDefFoundError: ZZZ001$_part_table$1
        at ZZZ001$_part_table.init0(ZZZ001$_part_table.java:43)
        at ZZZ001$_part_table.<init>(ZZZ001$_part_table.java:159)
        ... 19 more
Caused by: java.lang.ClassNotFoundException: ZZZ001$_part_table$1
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 21 more

By looking at the error message, it was highly suspicious that the class name ZZZ001$_part_table$1 looked wrong. This was caused by the table name itself in DB2 contained “$”: ZZZ001$.part_table. So when sqoop generated the class, the name became ZZZ001$_part_table$1, which is invalid Java class name.

To bypass this issue, the workaround is to force Sqoop to generate a customer class name by passing “–class-name” parameter. So the new command becomes:

sqoop             
import
--connect
jdbc:db2://<db2-host-url>:3700/db1
--username
user1
--password
changeme
--table
ZZZ001$.part_table
--target-dir
/path/in/hdfs
--fields-terminated-by
\001
-m
1
--class-name
ZZZ001_part_table
--validate

Hope above helps.

Unable to Import Data as Parquet into Encrypted HDFS Zone

Recently I have discovered an issue in Sqoop that when it is importing data into Hive table, whose location is in an encrypted HDFS zone, the Sqoop command will fail with the following errors:

Command:

sqoop import --connect <postgres_url> --username <username> --password <password> \
--table sourceTable --split-by id --hive-import --hive-database staging \
--hive-table hiveTable --as-parquetfile

Errors:

2017-05-24 13:38:51,539 INFO [Thread-84] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 
Setting job diagnostics to Job commit failed: org.kitesdk.data.DatasetIOException: Could not move contents of hdfs://nameservice1/tmp/staging/.
temp/job_1495453174050_1035/mr/job_1495453174050_1035 to 
hdfs://nameservice1/user/hive/warehouse/staging.db/hiveTable
        at org.kitesdk.data.spi.filesystem.FileSystemUtil.stageMove(FileSystemUtil.java:117)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:406)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:62)
        at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.commitJob(DatasetKeyOutputFormat.java:387)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:274)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): 
/tmp/staging/.temp/job_1495453174050_1035/mr/job_1495453174050_1035/964f7b5e-2f55-421d-bfb6-7613cc4bf26e.parquet 
can't be moved into an encryption zone.
        at org.apache.hadoop.hdfs.server.namenode.EncryptionZoneManager.checkMoveValidity(EncryptionZoneManager.java:284)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedRenameTo(FSDirectory.java:564)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.renameTo(FSDirectory.java:478)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInternal(FSNamesystem.java:3929)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInt(FSNamesystem.java:3891)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3856)

After some research, I have found out that it is caused by a known Sqoop bug: SQOOP-2943. This happens because Sqoop currently uses the Kite SDK to generate Parquet file, and the Kite SDK uses the /tmp directory to generate the parquet file on the fly. Because the /tmp directory is not encrypted and the hive warehouse directory is encrypted, the final move command to move the parquet file from the /tmp directory to hive warehouse will fail due to the encryption.

The import only fails with parquet format, the text file format currently works as expected.

As SQOOP-2943 is not fixed at this stage, and there is no direct workarounds, I would suggest the following two methods for importing the data into a Hive parquet table, inside the encrypted warehouse:

  • Import the data as text file format into Hive temporary table inside the Hive warehouse (encrypted), and then use Hive query to copy data into destination parquet table
  • Import the data as parquet file into non-encrypted temporary directory outside of Hive warehouse, and then again use Hive to copy data into destination parquet table inside the Hive warehouse (encrypted)

Hope above can help with anyone who noticed the similar issues.

Unable to Import Data as Parquet into Encrypted HDFS Zone | Sqoop Parquet Import

Recently I discovered an issue in Sqoop that when importing data into Hive table, whose location is in an encrypted HDFS zone, it will fail with “can’t be moved into an encryption zone” error:

Command:

sqoop import --connect <postgres_url> --username <username> --password <password> \
--table sourceTable --split-by id --hive-import --hive-database staging \
--hive-table hiveTable --as-parquetfile

Errors:

2017-05-24 13:38:51,539 INFO [Thread-84] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 
Setting job diagnostics to Job commit failed: org.kitesdk.data.DatasetIOException: Could not move contents of hdfs://nameservice1/tmp/staging/.
temp/job_1495453174050_1035/mr/job_1495453174050_1035 to 
hdfs://nameservice1/user/hive/warehouse/staging.db/hiveTable
        at org.kitesdk.data.spi.filesystem.FileSystemUtil.stageMove(FileSystemUtil.java:117)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:406)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:62)
        at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.commitJob(DatasetKeyOutputFormat.java:387)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:274)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): 
/tmp/staging/.temp/job_1495453174050_1035/mr/job_1495453174050_1035/964f7b5e-2f55-421d-bfb6-7613cc4bf26e.parquet 
can't be moved into an encryption zone.
        at org.apache.hadoop.hdfs.server.namenode.EncryptionZoneManager.checkMoveValidity(EncryptionZoneManager.java:284)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedRenameTo(FSDirectory.java:564)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.renameTo(FSDirectory.java:478)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInternal(FSNamesystem.java:3929)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInt(FSNamesystem.java:3891)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3856)

This is caused by a known Sqoop bug: SQOOP-2943. This happens because Sqoop uses Kite SDK to generate Parquet file, and Kite SDK uses /tmp directory to generate the parquet file on the fly. And because /tmp directory is not encrypted and hive warehouse directory is encrypted, the final move command to move the parquet file from /tmp to hive warehouse will fail due to the encryption. The import only fails with parquet format, text file format works as expected.

Currently SQOOP-2943 is not fixed and there is no direct workaround.

For the time being, the workaround is to:

  1. Import data as text file format into Hive temporary table, and then use Hive query to copy data into destination parquet table. OR
  2. Import data as parquet file into temporary directory outside of Hive warehouse, and then again use Hive to copy data into destination parquet table