Sqoop job failed with “Malformed option in options file” error

Recently I discovered a bug in Sqoop that it wrongly detects a malformed query in options file, but in fact the query is totally correct. Example as below sqoop command:

sqoop --options-file /path/to/options-file.txt

and in the /path/to/options-file.txt, it contains the following content:

import
--connect
jdbc:mysql://......
--username
XXXXXXXX
--password
XXXXXXXX
--query
SELECT * FROM test_table WHERE a = 'b'
....

Sqoop will fail with the following error:

16/11/22 16:08:59 ERROR sqoop.Sqoop: Error while expanding arguments
java.lang.Exception: Malformed option in options file(/path/to/options-file.txt): 
SELECT * FROM test_table WHERE a = 'b'
at org.apache.sqoop.util.OptionsFileUtil.removeQuoteCharactersIfNecessary(OptionsFileUtil.java:170)
at org.apache.sqoop.util.OptionsFileUtil.removeQuotesEncolosingOption(OptionsFileUtil.java:136)
at org.apache.sqoop.util.OptionsFileUtil.expandArguments(OptionsFileUtil.java:90)
at com.cloudera.sqoop.util.OptionsFileUtil.expandArguments(OptionsFileUtil.java:33)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:199)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Malformed option in options file(/path/to/options-file.txt): SELECT * FROM test_table WHERE a = 'b'

This is caused by a bug in Sqoop code that Sqoop checks for the beginning and ending quotes in the query section of the options file. If a query does not start with a quote but ends with a quote, then it will fail with the mentioned error.

From the example shown in the above, the query had an ending single quote “‘”, Sqoop will wrongly treat it as an invalid query and then Exception will be throw. If you look at the function that does the check for quotes in the query string:

https://github.com/apache/sqoop/blob/release-1.4.6-rc3/src/java/org/apache/sqoop/util/OptionsFileUtil.java#L156-L175

  private static String removeQuoteCharactersIfNecessary(String fileName,
      String option, char quote) throws Exception {
    boolean startingQuote = (option.charAt(0) == quote);
    boolean endingQuote = (option.charAt(option.length() - 1) == quote);

    if (startingQuote && endingQuote) {
      if (option.length() == 1) {
        throw new Exception("Malformed option in options file("
            + fileName + "): " + option);
      }
      return option.substring(1, option.length() - 1);
    }

    if (startingQuote || endingQuote) {
       throw new Exception("Malformed option in options file("
           + fileName + "): " + option);
    }

    return option;
  }

Based on the above code, the sqoop command using options file will fail if the it either only starts or ends with a quote (single or double). I have created upstream JIRA for this bug SQOOP-3061 and provided a patch to fix the issue.

Currently, the workaround is to NOT end the query with a quote, either double or single quote, so simply add something like “AND 1=1” will resolve the issue.

How to setup multiple KDCs through Cloudera Manager

Currently Cloudera Manager does not support setting up multiple KDCs for the krb5.conf file natively, this article explains the workarounds we can have using the existing feature provided by Cloudera Manager.

This article also assumes that you have krb5.conf file managed by Cloudera Manager.

If you are using Cloudera Manager prior to 5.7, following the steps below:

  1. Go to CM > Administration > Settings > click on “Kerberos” on Filters on the left side > locate “KDC Server Host”, enter the KDC host in the text field:
    kdc-host1.com
    
  2. On the same page, locate “Advanced Configuration Snippet (Safety Valve) for the Default Realm in krb5.conf”, and enter the following into the text area:
    kdc = kdc-host2.com
    
  3. Save and then “Deploy Kerberos Client Configuration” (you might need to stop all service first before you can do this)
    The [realm] section in the krb5.conf will be updated like below:

    [realms]
    TEST.COM = {
    kdc = kdc-host1.com
    admin_server = kdc-host1.com
    kdc = kdc-host2.com
    }
    

If you are using CM5.7 and above, you can also do the following (above steps should still work):

  1. Go to CM > Administration > Settings > click on “Kerberos” on Filters on the left side > locate “KDC Server Host”, empty the KDC host in the text field, so that it contains no value
  2. On the same page, locate “Advanced Configuration Snippet (Safety Valve) for the Default Realm in krb5.conf”, and enter the following into the text area:
    kdc = kdc-host1.com
    kdc = kdc-host2.com
    admin_server = kdc-host1.com
    
  3. Save and then “Deploy Kerberos Client Configuration” (you might need to stop all service first before you can do this)
    The [realm] section in the krb5.conf will be updated like below:

    [realms]
    TEST.COM = {
    kdc = kdc-host1.com
    kdc = kdc-host2.com
    admin_server = kdc-host1.com
    }
    

The second option does not work prior to CM5.7 is because the older version of CM will generate the following line in krb5.conf if the KDC Server Host is empty:

kdc =

which will break the syntax in krb5.conf file.

Impala query failed with error “IllegalStateException”

This article examples ONE of the possible causes for the issue that Impala query failed with IllegalStateException error.

Recently I was dealing with an Impala issue that when runnnig a simple SELECT query against a table failed with IllegalStateException error:

SELECT * FROM <table_name>;
Query: SELECT * FROM <table_name>
ERROR: IllegalStateException: null

Checking the impala daemon log shows the following stacktrace:

I1114 17:13:04.425281 18380 jni-util.cc:177] java.lang.IllegalStateException
at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
at com.cloudera.impala.analysis.SlotRef.getIdsHelper(SlotRef.java:206)
at com.cloudera.impala.analysis.Expr.getIds(Expr.java:887)
at com.cloudera.impala.analysis.Analyzer.materializeSlots(Analyzer.java:2063)
at com.cloudera.impala.planner.SingleNodePlanner.createSingleNodePlan(SingleNodePlanner.java:131)
at com.cloudera.impala.planner.Planner.createPlan(Planner.java:58)
at com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:897)
at com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:147)

Based on the issues I had seen in the past, most of the cause of the error was due to INVALID_TYPE for the table that Impala was trying to query. Check the DDL of the table in question by running the following query:

SHOW CREATE TABLE <table_name>;

[impala-damon-host:21000] > SHOW CREATE TABLE test;
Query: SHOW CREATE TABLE test
+---------------------------------+
| result                          |
+---------------------------------+
| CREATE EXTERNAL TABLE test (    |
| col1 INVALID_TYPE
.....

You can see that Impala complained about INVALID_TYPE for column “col1”. If you run the same query in beeline, you will get the correct output:

> SHOW CREATE TABLE test;
Query: SHOW CREATE TABLE test
+---------------------------------+
| result                          |
+---------------------------------+
| CREATE EXTERNAL TABLE test (    |
| col1 date
.....

This is because currently Impala still does not support “Date” data type, please refer to the following docs:

Impala Data Types
HiveQL Features not Available in Impala

The only solution for now is to use “Timestamp” data type which is supported by Impala.

Does Sqoop Mainframe Import Support SFTP?

Currently Sqoop leverages Apache Commons Net FTP library with some customisations to handle the way mainframe returns FTP listings. The current Apache Commons Net library does NOT support SFTP but does support FTPS.

FTPS and SFTP are very different protocols so we would have to use a different underlying library to incorporate SFTP capability into the mainframe module:

  • FTPS: FTPS is FTP with SSL for security. It uses a control channel and opens new connections for the data transfer. As it uses SSL, it requires a certificate.
  • SFTP: SFTP (SSH File Transfer Protocol/Secure File Transfer Protocol) was designed as an extension of SSH to provide file transfer capability, so it usually uses only the SSH port for both data and control.

In most SSH server installations you will have SFTP support, but FTPS would need the additional configuration of a supported FTP server.

However, even though Apache Commons Net FTP supports FTPS, Sqoop currently only uses class org.apache.commons.net.ftp.FTPClient, NOT org.apache.commons.net.ftp.FTPSClient which supports FTPS. Please check the source code here MainframeFTPClientUtils.java.

So, the conclusion is that currently Sqoop does NOT support either FTPS or SFTP, and the only protocol it supports is FTP.

Hive MetaStore migration from Embedded Postgres to MySQL

Recently I was dealing with a case where customer wanted to migrate from Cloudera Manager’s Embedded PostgreSQL to MySQL due to the fact that the Embedded PostgreSQL provided by Cloudera Manager is not recommended for production use, instead, using external databases like MySQL, PostgreSQL or Oracle etc is highly recommended. The Embedded PostgreSQL should only be considered to be used by development or testing clusters.

For more information, please refer to Cloudera’s documentation regarding Cloudera Manager and Managed Service Datastores.

So, this article is focusing on how to migrate from Embedded PostgreSQL to MySQL, due to some incompatibilities between database schema for Hive under MySQL and PostgreSQL (see HIVE-7018 for details, at the time of writing HIVE-7018 is not available in any CDH releases). Assuming that you are using Cloudera Manager, please follow the steps below:

  1. Stop cluster so that no changes will happen to the cluster during the migration
  2. Backup the old PostgreSQL database, including schema and data
  3. Create a new database and user in MySQL, with proper permissions:
    CREATE DATABASE metastore DEFAULT CHARACTER SET utf8;
    GRANT ALL ON metastore.* TO 'hive'@'%' IDENTIFIED BY 'hive_password';
    
  4. Update the database connection details in Cloudera Manager > Hive > Configuration page to use the new database
  5. Go to Hive > “Actions” menu on the top right corner and click on “Create Hive MetaStore Database Tables”
  6. Export the PostgreSQL DB without schema definition, i.e., data only, using the following command:
    pg_dump -U <username> --column-inserts --format p -h <postgre-host> -p <port-number> -f hive-in.sql -a <database>
    

    the “-a” for “data only” option

  7. Download a php program from http://www.lightbox.ca/pg2mysql/pg2mysql-1.9.tar.bz2. This program can help us to do the SQL conversion from PostgreSQL to MySQL
  8. Install php-cli on your machine. If you use CentOS, simply run:
    sudo yum install php-cli
    
  9. Run the following command to generate the hive-out.sql file (tested on PHP5.3)
    php pg2mysql-1.9/pg2mysql_cli.php hive-in.sql hive-out.sql InnoDB
    
  10. Open the hive-out.sql file in any editor and add:
    SET FOREIGN_KEY_CHECKS=0;
    

    to the beginning of the file and:

    SET FOREIGN_KEY_CHECKS=1;
    

    to the end of the file, so that no foreign keys will be checked during data import.

  11. Update the following three tables to UPPER CASE, so change from:
    INSERT INTO next_compaction_queue_id (ncq_next) VALUES (1);
    INSERT INTO next_lock_id (nl_next) VALUES (1);
    INSERT INTO next_txn_id (ntxn_next) VALUES (1);
    

    to:

    INSERT INTO NEXT_COMPACTION_QUEUE_ID (ncq_next) VALUES (1);
    INSERT INTO NEXT_LOCK_ID (nl_next) VALUES (1);
    INSERT INTO NEXT_TXN_ID (ntxn_next) VALUES (1);
    

    because those tables are in the UPPER case in MySQL version.

  12. Remove the following line in the hive-out.sql file:
    INSERT INTO VERSION (VER_ID, SCHEMA_VERSION, VERSION_COMMENT) VALUES (1, '1.1.0', 'Hive release version 1.1.0');
    

    As we have used Cloudera Manager to generate the schema for us in MySQL at step 5 and this table has been populated automatically.

  13. Put hive-out.sql file on the MySQL server host
  14. Log into MySQL and then run command:
    USE metastore;
    SOURCE /path/to/hive-out.sql;
    
  15. Now ready to “Start” Hive service and test out if everything is OK.