Oracle Number(1,0) field maps to Boolean in Spark

Recently I was working on a issue that when importing data from Oracle into Hive table using Spark, the data of type Number(1,0) in Oracle was implicitly converted into Boolean data type. Before was on CDH5.5.x, it worked correctly, however, after upgrading to CDH5.10.x, the issue happened. See below Hive table output after import:

Before upgrade:

SELECT column1 FROM test_table limit 2;
0
1

After upgrade:

SELECT column1 FROM test_table limit 2;
False
True

After digging further, I discovered that this change was introduced by SPARK-16625, due to the integration required for Spark to work correctly with Oracle.

Since the change was intended, the following is the suggested workarounds:

  1. Cast the Boolean to a type of your choosing in the Spark code, before writing it to the Hive table
  2. Make sure that the mapped column in Hive is also of compatible data type, for example, TinyInt, rather than String, so that the value of True or False will be mapped to 1 or 0 respectively, rather than string value of “True” or “False” (the reason that the column got “False” and “True” values were because the column was of String data type)

Hope above helps.

How to Use Beeline to connect to Impala

You can certainly connect to Impala using Hive Driver from beeline, like below command:

beeline -u 'jdbc:hive2://<impala-daemon-host>:21050/default;auth=noSasl'

However, the result output format does not work properly:

> show tables;
customers
dim_prod
mansi
sample_07
sample_08
small
web_logs
+-------+--+
| name  |
+-------+--+
+-------+--+

Notice the output is not inside the columns?

The better approach is to use Cloudera Impala JDBC Driver, please follow the steps below:

1. Download the driver from Cloudera Impala JDBC Driver
2. Extract the files and put it somewhere on the host that you need to run beeline command, in my case is under /root/impala-jdbc/jdbc
3. Run the following command to update the HADOOP_CLASSPATH to include the Impala JDBC Driver JARs

export HADOOP_CLASSPATH=`hadoop classpath`:/root/impala-jdbc/jdbc/*

4. Finally you are ready to start beeline:

beeline -d "com.cloudera.impala.jdbc41.Driver" -u 'jdbc:impala://<impala-daemon-host>:21050;AuthMech=0'

You will need to tell beeline the class name for the driver using “-d” option, in my case the driver class is com.cloudera.impala.jdbc41.Driver

The output looks much better:

> show tables;
+------------+--+
|    name    |
+------------+--+
| customers  |
| dim_prod   |
| mansi      |
| sample_07  |
| sample_08  |
| small      |
| web_logs   |
+------------+--+
7 rows selected (0.219 seconds)

However, the best way is to connect from impala-shell which is designed for Impala natively.

Hope this helps.

Alternative Timestamp Support in Hive (ISO-8601)

Hive does not support for ISO-8601 timestamp format, like this “2017-02-16T11:24:29.000Z” by default.

Check the following test case:

1. Create a file with the following content:

2017-02-16T11:24:29.000Z
2017-02-16 11:24:29

2. Put the file in HDFS:

hadoop fs -put test.txt /tmp/test/data

3. Create an external table links to it:

CREATE EXTERNAL TABLE ts_test (a timestamp) ROW FORMAT DELIMITED FIELDS TERMINATED by ',' LOCATION '/tmp/test/data';

4. When you select the table, first record will be NULL:

+------------------------+--+
|       ts_test.a        |
+------------------------+--+
| NULL                   |
| 2017-02-16 11:24:29.0  |
+------------------------+--+

This is due to Hive not able to recognise timestamp format of “2017-02-16T11:24:29.000Z”.

As of CDH5.7.x or Hive 1.2, Hive supports reading alternative timestamp formats, see HIVE-9298

To make it work, run the following Hive query:

ALTER TABLE ts_test SET SERDEPROPERTIES ("timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss.SSSZ");

Then data can be read correctly by Hive:

+------------------------+--+
|       ts_test.a        |
+------------------------+--+
| 2017-02-16 03:24:29.0  |
| 2017-02-16 11:24:29.0  |
+------------------------+--+

The different values is due to timezone conversion (Z is for UTC). Hive treats “2017-02-16T11:24:29.000Z” as UTC and then converts it to server’s local time, in the case of second value of “2017-02-16 11:24:29”, no conversion is done so original value is returned.

Hope this helps.

How to enable HiveServer2 audit log through Cloudera Manager

This article explains the steps required to enable audit log for HiveServer2, so that all queries run through HiveServer2 will be audited into a central log file.

Please follow the steps below:

  1. Go to Cloudera Manager home page > Hive > Configuration
  2. Tick “Enable Audit Collection”
  3. Ensure “Audit Log Directory” location point to a path that has enough disk space
  4. Go to Cloudera Manager home page > click on “Cloudera Management Service” > Instances
  5. Click on “Add Role Instances” button on the top right corner of the page
  6. Choose a host for Navigator Audit Server & Navigator Metadata Server
  7. Then follow on screen instructions to finish adding the new roles
  8. Once the roles are added successfully, Cloudera Manager will ask you to restart a few services, including Hive
  9. Go ahead and restart Hive

After restarting, Hive’s audit log will be enabled and logged into /var/log/hive/audit directory by default.

Please note that you are not required start Navigator services, so if you don’t need them running, you can just leave them at STOP state, the Hive’s audit logs should still function as normal. However, it is a requirement to have Navigator installed for the audit log to function properly, as there are some libraries from Navigator are required for audit to work.

Sqoop Teradata import truncates timestamp microseconds information

Last week, while I was working on Sqoop with Teradata, I noticed one bug that the microseconds part of a Timestamp field got truncated after importing into HDFS. The following is the steps to re-produce the issue:

1. Create a table in Teradata:

CREATE TABLE vmtest.test (a integer, b timestamp(6) 
FORMAT 'yyyy-mm-ddbhh:mi:ss.s(6)') PRIMARY INDEX (a);

INSERT INTO vmtest.test VALUES (1, '2017-03-14 15:20:20.711001');

2. And sqoop import command:

sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
    --username dbc --password dbc --target-dir /tmp/test --delete-target-dir \
    --as-textfile --fields-terminated-by "," --table test

3. data stored in HDFS as below:

[cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
1,2017-03-14 15:20:20.711

Notice the microseconds part truncated from 711001 to 711

This is caused by a bug in TDCH (TeraData Connector for Hadoop) from Teradata, which is used by Cloudera Connector Powered by Teradata.

The workaround is to make sure that the timestamp value is in String format before passing it to Sqoop, so that no conversion will happen. Below Sqoop command is an example:

sqoop import --connect jdbc:teradata://<teradata-host>/database=vmtest \
    --username dbc --password dbc --target-dir /tmp/test \
    --delete-target-dir --as-textfile --fields-terminated-by "," \
    --query "SELECT a, cast(cast(b as format 'YYYY-MM-DD HH:MI:SS.s(6)') as char(40)) from test WHERE \$CONDITIONS" \
    --split-by a

After importing, data is stored in HDFS correctly:

[cloudera@quickstart ~]$ hadoop fs -cat /tmp/test/part*
1,2017-03-14 15:20:20.711001

As mentioned above, this is a bug in Teradata connector, we have to wait for it to be fixed in TDCH. At the time of writing, the issue still exists in CDH5.8.x.