Alternative Timestamp Support in Hive (ISO-8601)

Hive does not support for ISO-8601 timestamp format, like this “2017-02-16T11:24:29.000Z” by default.

Check the following test case:

1. Create a file with the following content:

2017-02-16T11:24:29.000Z
2017-02-16 11:24:29

2. Put the file in HDFS:

hadoop fs -put test.txt /tmp/test/data

3. Create an external table links to it:

CREATE EXTERNAL TABLE ts_test (a timestamp) ROW FORMAT DELIMITED FIELDS TERMINATED by ',' LOCATION '/tmp/test/data';

4. When you select the table, first record will be NULL:

+------------------------+--+
|       ts_test.a        |
+------------------------+--+
| NULL                   |
| 2017-02-16 11:24:29.0  |
+------------------------+--+

This is due to Hive not able to recognise timestamp format of “2017-02-16T11:24:29.000Z”.

As of CDH5.7.x or Hive 1.2, Hive supports reading alternative timestamp formats, see HIVE-9298

To make it work, run the following Hive query:

ALTER TABLE ts_test SET SERDEPROPERTIES ("timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss.SSSZ");

Then data can be read correctly by Hive:

+------------------------+--+
|       ts_test.a        |
+------------------------+--+
| 2017-02-16 03:24:29.0  |
| 2017-02-16 11:24:29.0  |
+------------------------+--+

The different values is due to timezone conversion (Z is for UTC). Hive treats “2017-02-16T11:24:29.000Z” as UTC and then converts it to server’s local time, in the case of second value of “2017-02-16 11:24:29”, no conversion is done so original value is returned.

Hope this helps.

How to enable HiveServer2 audit log through Cloudera Manager

This article explains the steps required to enable audit log for HiveServer2, so that all queries run through HiveServer2 will be audited into a central log file.

Please follow the steps below:

  1. Go to Cloudera Manager home page > Hive > Configuration
  2. Tick “Enable Audit Collection”
  3. Ensure “Audit Log Directory” location point to a path that has enough disk space
  4. Go to Cloudera Manager home page > click on “Cloudera Management Service” > Instances
  5. Click on “Add Role Instances” button on the top right corner of the page
  6. Choose a host for Navigator Audit Server & Navigator Metadata Server
  7. Then follow on screen instructions to finish adding the new roles
  8. Once the roles are added successfully, Cloudera Manager will ask you to restart a few services, including Hive
  9. Go ahead and restart Hive

After restarting, Hive’s audit log will be enabled and logged into /var/log/hive/audit directory by default.

Please note that you are not required start Navigator services, so if you don’t need them running, you can just leave them at STOP state, the Hive’s audit logs should still function as normal. However, it is a requirement to have Navigator installed for the audit log to function properly, as there are some libraries from Navigator are required for audit to work.

Sqoop Teradata import truncates timestamp microseconds information

Last week, while I was working on Sqoop with Teradata, I noticed one bug that the microseconds part of a Timestamp field got truncated after importing into HDFS. The following is the steps to re-produce the issue:

1. Create a table in Teradata:

CREATE TABLE vmtest.test (a integer, b timestamp(6) 
FORMAT 'yyyy-mm-ddbhh:mi:ss.s(6)') PRIMARY INDEX (a);

INSERT INTO vmtest.test VALUES (1, '2017-03-14 15:20:20.711001');

2. And sqoop import command:

sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
    --username dbc --password dbc --target-dir /tmp/test --delete-target-dir \
    --as-textfile --fields-terminated-by "," --table test

3. data stored in HDFS as below:

[cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
1,2017-03-14 15:20:20.711

Notice the microseconds part truncated from 711001 to 711

This is caused by a bug in TDCH (TeraData Connector for Hadoop) from Teradata, which is used by Cloudera Connector Powered by Teradata.

The workaround is to make sure that the timestamp value is in String format before passing it to Sqoop, so that no conversion will happen. Below Sqoop command is an example:

sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
    --username dbc --password dbc --target-dir /tmp/test \
    --delete-target-dir --as-textfile --fields-terminated-by "," \
    --query "SELECT a, cast(cast(b as format 'YYYY-MM-DD HH:MI:SS.s(6)') as char(40)) from test WHERE \$CONDITIONS" \
    --split-by a

After importing, data is stored in HDFS correctly:

[cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
1,2017-03-14 15:20:20.711001

As mentioned above, this is a bug in Teradata connector, we have to wait for it to be fixed in TDCH. At the time of writing, the issue still exists in CDH5.8.x.

Beeline Failed To Start With OOM Error When Calling getConsoleReader Method

If you get the following error when trying to start up beeline from command line:

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
at java.util.Arrays.copyOf(Arrays.java:2271) 
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) 
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:122) 
at org.apache.hive.beeline.BeeLine.getConsoleReader(BeeLine.java:854) 
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:766) 
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) 
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
at java.lang.reflect.Method.invoke(Method.java:606) 
at org.apache.hadoop.util.RunJar.run(RunJar.java:221) 
at org.apache.hadoop.util.RunJar.main(RunJar.java:136) 

Based on the stacktrace, we can see that Beeline was at startup phase and was trying to initialize through getConsoleReader method, which will read data from beeline’s history file:

    try {
      // now load in the previous history
      if (hist != null) {
        History h = consoleReader.getHistory();
        if (h instanceof FileHistory) {
          ((FileHistory) consoleReader.getHistory()).load(new ByteArrayInputStream(hist
              .toByteArray()));
        } else {
          consoleReader.getHistory().add(hist.toString());
        }
      }
    } catch (Exception e) {
        handleException(e);
    }

By default, the history file is located under ~/.beeline/history and beeline will load the latest 500 rows into memory. If those queries are super big, containing lots of characters, it is possible that the history file size will reach as big as a few GBs. When beeline is trying to load such big history file into memory, it will eventually fail with OutOfMemory error.

I have reported such issue in the Hive upstream JIRA: HIVE-15166, and I am in the middle of submitting a patch for it.

For the time being, the best way is to remove the ~/.beeline/history file before you fire up beeline.