How to enable HiveServer2 audit log through Cloudera Manager

This article explains the steps required to enable audit log for HiveServer2, so that all queries run through HiveServer2 will be audited into a central log file.

Please follow the steps below:

  1. Go to Cloudera Manager home page > Hive > Configuration
  2. Tick “Enable Audit Collection”
  3. Ensure “Audit Log Directory” location point to a path that has enough disk space
  4. Go to Cloudera Manager home page > click on “Cloudera Management Service” > Instances
  5. Click on “Add Role Instances” button on the top right corner of the page
  6. Choose a host for Navigator Audit Server & Navigator Metadata Server
  7. Then follow on screen instructions to finish adding the new roles
  8. Once the roles are added successfully, Cloudera Manager will ask you to restart a few services, including Hive
  9. Go ahead and restart Hive

After restarting, Hive’s audit log will be enabled and logged into /var/log/hive/audit directory by default.

Please note that you are not required start Navigator services, so if you don’t need them running, you can just leave them at STOP state, the Hive’s audit logs should still function as normal. However, it is a requirement to have Navigator installed for the audit log to function properly, as there are some libraries from Navigator are required for audit to work.

Sqoop Teradata import truncates timestamp microseconds information

Last week, while I was working on Sqoop with Teradata, I noticed one bug that the microseconds part of a Timestamp field got truncated after importing into HDFS. The following is the steps to re-produce the issue:

1. Create a table in Teradata:

CREATE TABLE vmtest.test (a integer, b timestamp(6) 
FORMAT 'yyyy-mm-ddbhh:mi:ss.s(6)') PRIMARY INDEX (a);

INSERT INTO vmtest.test VALUES (1, '2017-03-14 15:20:20.711001');

2. And sqoop import command:

sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
    --username dbc --password dbc --target-dir /tmp/test --delete-target-dir \
    --as-textfile --fields-terminated-by "," --table test

3. data stored in HDFS as below:

[cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
1,2017-03-14 15:20:20.711

Notice the microseconds part truncated from 711001 to 711

This is caused by a bug in TDCH (TeraData Connector for Hadoop) from Teradata, which is used by Cloudera Connector Powered by Teradata.

The workaround is to make sure that the timestamp value is in String format before passing it to Sqoop, so that no conversion will happen. Below Sqoop command is an example:

sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
    --username dbc --password dbc --target-dir /tmp/test \
    --delete-target-dir --as-textfile --fields-terminated-by "," \
    --query "SELECT a, cast(cast(b as format 'YYYY-MM-DD HH:MI:SS.s(6)') as char(40)) from test WHERE \$CONDITIONS" \
    --split-by a

After importing, data is stored in HDFS correctly:

[cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
1,2017-03-14 15:20:20.711001

As mentioned above, this is a bug in Teradata connector, we have to wait for it to be fixed in TDCH. At the time of writing, the issue still exists in CDH5.8.x.

Beeline Failed To Start With OOM Error When Calling getConsoleReader Method

If you get the following error when trying to start up beeline from command line:

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
at java.util.Arrays.copyOf(Arrays.java:2271) 
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) 
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:122) 
at org.apache.hive.beeline.BeeLine.getConsoleReader(BeeLine.java:854) 
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:766) 
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) 
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
at java.lang.reflect.Method.invoke(Method.java:606) 
at org.apache.hadoop.util.RunJar.run(RunJar.java:221) 
at org.apache.hadoop.util.RunJar.main(RunJar.java:136) 

Based on the stacktrace, we can see that Beeline was at startup phase and was trying to initialize through getConsoleReader method, which will read data from beeline’s history file:

    try {
      // now load in the previous history
      if (hist != null) {
        History h = consoleReader.getHistory();
        if (h instanceof FileHistory) {
          ((FileHistory) consoleReader.getHistory()).load(new ByteArrayInputStream(hist
              .toByteArray()));
        } else {
          consoleReader.getHistory().add(hist.toString());
        }
      }
    } catch (Exception e) {
        handleException(e);
    }

By default, the history file is located under ~/.beeline/history and beeline will load the latest 500 rows into memory. If those queries are super big, containing lots of characters, it is possible that the history file size will reach as big as a few GBs. When beeline is trying to load such big history file into memory, it will eventually fail with OutOfMemory error.

I have reported such issue in the Hive upstream JIRA: HIVE-15166, and I am in the middle of submitting a patch for it.

For the time being, the best way is to remove the ~/.beeline/history file before you fire up beeline.

Spark jobs failed with delegation token renewal error

An Oozie Spark job failed with the following error:

Job aborted due to stage failure: Task 103 in stage 194576.0 failed 4 times, most recent failure: Lost task 103.3 in stage 194576.0 
(TID 119674041, ): org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): 
token (token for sparkpse: HDFS_DELEGATION_TOKEN owner=@HADOOP.CHARTER.COM, renewer=yarn, realUser=, issueDate=1482494610879, 
maxDate=1483099410879, sequenceNumber=274718, masterKeyId=166) can't be found in cache 
at org.apache.hadoop.ipc.Client.call(Client.java:1471) 
at org.apache.hadoop.ipc.Client.call(Client.java:1408) 
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) 
at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)

This is caused by long running Spark job in a kerberized environment the checkpointing fails as Token is not renewed properly.

The workaround is to add “–conf spark.hadoop.fs.hdfs.impl.disable.cache=true” to Spark job command line parameters to disable the token cache from spark side.

How to load different version of Spark into Oozie

This article explains the steps needed to load Spark2 into Oozie under CDH5.9.x which comes with Spark1.6. Although this was tested under CDH5.9.0, it should be similar for earlier releases.

Please follow the steps below:

  1. Locate the current shared-lib directory by running:
    oozie admin -oozie http://<oozie-server-host>:11000/oozie -sharelibupdate
    

    you will get something like below:

    [ShareLib update status]
    host = http://<oozie-server-host>:11000/oozie
    status = Successful
    sharelibDirOld = hdfs://<oozie-server-host>:8020/user/oozie/share/lib/lib_20161202183044
    sharelibDirNew = hdfs://<oozie-server-host>:8020/user/oozie/share/lib/lib_20161202183044
    

    This tells me that the current sharelib directory is /user/oozie/share/lib/lib_20161202183044

  2. Create a new directory for spark2.0 under this directory:
    hadoop fs -mkdir /user/oozie/share/lib/lib_20161202183044/spark2
    
  3. Put all your spark 2 jars under this directory, please also make sure that oozie-sharelib-spark-4.1.0-cdh5.9.0.jar is there too
  4. Update the sharelib by running:
    oozie admin -oozie http://<oozie-server-host>:11000/oozie -sharelibupdate
    
  5. Confirm that the spark2 has been added to the shared lib path:
    oozie admin -oozie http://<oozie-server-host>:11000/oozie -shareliblist
    

    you should get something like below:

    [Available ShareLib]
    spark2
    oozie
    hive
    distcp
    hcatalog
    sqoop
    mapreduce-streaming
    spark
    pig
    
  6. Go back to spark workflow and add the following configuration under Spark action:
    <property>
        <name>oozie.action.sharelib.for.spark</name>
        <value>spark2</value>
    </property>
    
  7. Save workflow and run to test if it will pick up the correct JARs now.

Please be advised that although this can work, it will put Spark action in Oozie not supported by Cloudera, because it is not tested and it should not be recommended. But if you are still willing to go ahead, the steps above should help.