In the last few weeks, I have been working on an issue in Sqoop that Sqoop’s “–append” options does not work well with parameter “-Dmapreduce.output.basename”. The goal of adding “-Dmapreduce.output.basename” is to ask Sqoop to generate custom file names for the target file in HDFS, rather than using the default “part” prefix for all files.
However, Sqoop has a bug that when trying to move the file from temp folder to target folder in HDFS, it does not respect the setting for mapreduce.output.basename, hence resulted no data being moved to the final destination.
This can be confirmed by turning on the DEBUG logging for Sqoop (using –verbose):
sqoop import -Dmapreduce.output.basename="eric-test" \ --connect jdbc:mysql://mysql-host.com/test \ --username root --password 'root' --table test \ --target-dir /tmp/ericlin-test/sqoop/test --fields-terminated-by '\t' \ --verbose --append 18/05/28 22:24:44 INFO util.AppendUtils: Appending to directory test 18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: _SUCCESS ignored 18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00000 ignored 18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00001 ignored 18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00002 ignored 18/05/28 22:24:44 DEBUG util.AppendUtils: Deleting temporary folder 14935e396acc4ea7b9a6236c66064c9b_test
From the output, you can see that Sqoop ignored all the files generated with prefix “eric-test”.
I have submitted a upstream JIRA SQOOP-3330, and after a few review cycles, my patch finally get accepted and committed to Sqoop’s trunk code base. I am looking forward for the fix to be backported into CDH in the near future.
Update (28th of April, 2020): This JIRA is confirmed to be backported into CDH since version 6.1.0. So if you face this issue, upgrade to latest CDH/CDP would have the fix. The workaround is to simply add “part_” prefix to mapreduce.output.basename, so like “part_eric_test” to force Sqoop to copy those files, as older Sqoop will look for filenames start with “part”.