Hive Lateral View to Flatten Array Data

In this blog post, I will show how to transform the following data in Hive

[1,2,3],[4,5,6],[7,8,9]

and turn it into a table with the following output:

1,4,7
2,5,8
3,6,9

1. To prepare the table, firstly need to create a dummy table with one record:

CREATE TABLE dummy (a int);
INSERT INTO TABLE dummy VALUES (1);

This is to allow us to insert array data into our testing table by SELECTing from our 1 row dummy table.

2. Create a table with array data type:

CREATE TABLE array_table (a array<int>, b array<int>, c array<int>);
INSERT INTO array_test SELECT array(1,2,3), array(3,4,5), array(6,7,8) FROM dummy;

This will get us ready with a table having the following data:

+-------------+-------------+-------------+--+
| a           | b           | c           |
+-------------+-------------+-------------+--+
| [1,2,3]     | [3,4,5]     | [6,7,8]     |
+-------------+-------------+-------------+--+

3. Now run the following query using LATERAL VIEW will give the output we want:

SELECT key1, key2, key3 FROM array_test 
LATERAL VIEW posexplode(a) t1 AS q1, key1 
LATERAL VIEW posexplode(b) t2 AS q2, key2 
LATERAL VIEW posexplode(c) t3 AS q3, key3 
WHERE q1 = q2 AND q1 = q3;

+-------+-------+-------+--+
| key1  | key2  | key3  |
+-------+-------+-------+--+
| 1     | 3     | 6     |
| 2     | 4     | 7     |
| 3     | 5     | 8     |
+-------+-------+-------+--+

This might not be the best solution, but at least it works. However, how well the performance goes on real Big Data set, you have to test out and confirm yourself. Hope above helps.

Enabling Kerberos Debug for Hive

From time to time, we need to do troubleshooting steps for locating the root cause of Kerberos failure in Hive. I will outline below steps in order to turn on debugging message from both Client and HiveServer2 server side.

  1. To enable on Hive Client side (beeline), simply add the following export commands before you run beeline command:
    export HADOOP_JAAS_DEBUG=true;
    export HADOOP_OPTS='-Dsun.security.krb5.debug=true -Dsun.security.jgss.debug=true'
    

    Then the debug message will be printed on the shell when you run beeline.

  2. To enable kerberos debug on HiveServer2 side (assuming you are using Cloudera Manager)
    1. To to CM > Hive > Configuration
    2. locate “HiveServer2 Environment Advanced Configuration Snippet (Safety Valve)”
    3. add following to the textarea:
      HADOOP_OPTS='-Dsun.security.krb5.debug=true -Dsun.security.jgss.debug=true'
      
    4. Save and restart Hive service

    Once restarted, you will be able to locate the kerberos debug message from HiveServer2’s process directory on the server host, which is located under /var/run/cloudera-scm-agent/process/XXX-hive-HIVESERVER2/logs/stdout.log, where XXX is the largest number under the directory for HiveServer2

The sample debug message for kerberos looks like below:

Java config name: null
Native config name: /etc/krb5.conf
Loaded from native config
[UnixLoginModule]: succeeded importing info:
uid = 0
gid = 0
supp gid = 0
Debug is true storeKey false useTicketCache true useKeyTab false doNotPrompt true ticketCache is null isInitiator true KeyTab is null refreshKrb5Config is false principal is null tryFirstPass is false useFirstPass is false storePass is false clearPass is false
Acquire TGT from Cache
>>>KinitOptions cache name is /tmp/krb5cc_0
>>>DEBUG client principal is impala/{host-name}@REAL.COM
>>>DEBUG server principal is krbtgt/REAL.COM@REAL.COM
>>>DEBUG key type: 23
>>>DEBUG auth time: Sun Aug 13 21:07:46 PDT 2017
>>>DEBUG start time: Sun Aug 13 21:07:46 PDT 2017
>>>DEBUG end time: Mon Aug 14 07:07:46 PDT 2017
>>>DEBUG renew_till time: Sun Aug 20 21:07:46 PDT 2017
>>> CCacheInputStream: readFlags() FORWARDABLE; RENEWABLE; INITIAL; PRE_AUTH;
>>>DEBUG client principal is impala/{host-name}@REAL.COM
>>>DEBUG server principal is X-CACHECONF:/krb5_ccache_conf_data/pa_type/krbtgt/REAL.COM@REAL.COM
>>>DEBUG key type: 0
>>>DEBUG auth time: Wed Dec 31 16:00:00 PST 1969
>>>DEBUG start time: null
>>>DEBUG end time: Wed Dec 31 16:00:00 PST 1969
>>>DEBUG renew_till time: null
>>> CCacheInputStream: readFlags()
Principal is impala/{host-name}@REAL.COM
[UnixLoginModule]: added UnixPrincipal,
UnixNumericUserPrincipal,
UnixNumericGroupPrincipal(s),
to Subject
Commit Succeeded

Search Subject for Kerberos V5 INIT cred (<>, sun.security.jgss.krb5.Krb5InitCredential)
Found ticket for impala/{host-name}@REAL.COM to go to krbtgt/REAL.COM@REAL.COM expiring on Mon Aug 14 07:07:46 PDT 2017
Entered Krb5Context.initSecContext with state=STATE_NEW
Found ticket for impala/{host-name}@REAL.COM to go to krbtgt/REAL.COM@REAL.COM expiring on Mon Aug 14 07:07:46 PDT 2017
Service ticket not found in the subject
>>> Credentials acquireServiceCreds: same realm
default etypes for default_tgs_enctypes: 23.
>>> CksumType: sun.security.krb5.internal.crypto.RsaMd5CksumType
>>> EType: sun.security.krb5.internal.crypto.ArcFourHmacEType
>>> KdcAccessibility: reset
>>> KrbKdcReq send: kdc=kdc-host.com TCP:88, timeout=3000, number of retries =3, #bytes=1607
>>> KDCCommunication: kdc=kdc-host.com TCP:88, timeout=3000,Attempt =1, #bytes=1607
>>>DEBUG: TCPClient reading 1581 bytes
>>> KrbKdcReq send: #bytes read=1581
>>> KdcAccessibility: remove kdc-host.com
>>> EType: sun.security.krb5.internal.crypto.ArcFourHmacEType
>>> KrbApReq: APOptions are 00100000 00000000 00000000 00000000
>>> EType: sun.security.krb5.internal.crypto.ArcFourHmacEType
Krb5Context setting mySeqNumber to: 789412608
Created InitSecContextToken:

From above message, you can see at least below info:

  • Client config file for kerberos /etc/krb5.conf
  • Ticket case file: /tmp/krb5cc_0
  • Client principal name: impala/{host-name}@REAL.COM
  • KDC server host: kdc=kdc-host.com and using TCP connection via port 88 (TCP:88)
  • and a lot more others that might be useful for your troubleshooting

Hope above helps.

Unable to Import Data as Parquet into Encrypted HDFS Zone

Recently I have discovered an issue in Sqoop that when it is importing data into Hive table, whose location is in an encrypted HDFS zone, the Sqoop command will fail with the following errors:

Command:

sqoop import --connect <postgres_url> --username <username> --password <password> \
--table sourceTable --split-by id --hive-import --hive-database staging \
--hive-table hiveTable --as-parquetfile

Errors:

2017-05-24 13:38:51,539 INFO [Thread-84] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 
Setting job diagnostics to Job commit failed: org.kitesdk.data.DatasetIOException: Could not move contents of hdfs://nameservice1/tmp/staging/.
temp/job_1495453174050_1035/mr/job_1495453174050_1035 to 
hdfs://nameservice1/user/hive/warehouse/staging.db/hiveTable
        at org.kitesdk.data.spi.filesystem.FileSystemUtil.stageMove(FileSystemUtil.java:117)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:406)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:62)
        at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.commitJob(DatasetKeyOutputFormat.java:387)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:274)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): 
/tmp/staging/.temp/job_1495453174050_1035/mr/job_1495453174050_1035/964f7b5e-2f55-421d-bfb6-7613cc4bf26e.parquet 
can't be moved into an encryption zone.
        at org.apache.hadoop.hdfs.server.namenode.EncryptionZoneManager.checkMoveValidity(EncryptionZoneManager.java:284)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedRenameTo(FSDirectory.java:564)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.renameTo(FSDirectory.java:478)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInternal(FSNamesystem.java:3929)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInt(FSNamesystem.java:3891)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3856)

After some research, I have found out that it is caused by a known Sqoop bug: SQOOP-2943. This happens because Sqoop currently uses the Kite SDK to generate Parquet file, and the Kite SDK uses the /tmp directory to generate the parquet file on the fly. Because the /tmp directory is not encrypted and the hive warehouse directory is encrypted, the final move command to move the parquet file from the /tmp directory to hive warehouse will fail due to the encryption.

The import only fails with parquet format, the text file format currently works as expected.

As SQOOP-2943 is not fixed at this stage, and there is no direct workarounds, I would suggest the following two methods for importing the data into a Hive parquet table, inside the encrypted warehouse:

  • Import the data as text file format into Hive temporary table inside the Hive warehouse (encrypted), and then use Hive query to copy data into destination parquet table
  • Import the data as parquet file into non-encrypted temporary directory outside of Hive warehouse, and then again use Hive to copy data into destination parquet table inside the Hive warehouse (encrypted)

Hope above can help with anyone who noticed the similar issues.

Hive Query Failed with Token Renewer Error | Hive on Spark

If you run Hive on Spark on some CDH versions, you might run into issues when Hive is trying to renew HDFS delegation tokens. See below error message (that you can find from HiveServer2 server logs):

2017-06-27 17:04:08,836 INFO org.apache.hive.spark.client.SparkClientImpl: [stderr-redir-1]: 17/06/27 17:04:08 
WARN security.UserGroupInformation: PriviledgedActionException as:testuser (auth:PROXY) via 
hive/example.hadoop.com@REALM.COM (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: 
testuser tries to renew a token with renewer hive

If you do some googling, you should be able to locate the corresponding upstream Hive JIRA for this issue: HIVE-15485. And from this JIRA, you should also be able to identify that the issue was introduced by HIVE-14383. This is due to the fact that Spark needs the principal/keytab passed in via –principal and –keytab options, and does the renewal by copying the keytab to the cluster and handling login to kerberos inside the application. But the option –principal and –keytab could not work with –proxy-user in spark-submit.sh, so at this moment we could support either the token renewal or the impersonation, but not both.

The only way to avoid such issue is to upgrade CDH to the version that has the fix for HIVE-15485, which has been fixed in the following releases:

CDH5.8.5
CDH5.9.2
CDH5.10.1, CDH5.10.2
CDH5.11.0, CDH5.11.1

Since HIVE-14383 was introduced in the following CDH:

CDH5.8.3, CDH5.8.4, CDH5.8.5
CDH5.9.1, CDH5.9.2
CDH5.10.0, CDH5.10.1, CDH5.10.2 
CDH5.11.0, CDH5.11.1

This makes the following CDH currently will have such issues:

CDH5.8.3, CDH5.8.4, CDH5.9.1, CDH5.10.0

Please deploy the latest maintenance release for your major version to avoid such issue in Hive on Spark.

Hope above helps.

Read Files under Sub-Directories for Hive and Impala

Sometimes you might want to store data under sub-directories in HDFS and then you want Hive or Impala to read from those sub-directories. For example, you have the following directory structure:

root hdfs     231206 2017-06-30 02:45 /test/table1/000000_0
root hdfs          0 2017-06-30 02:45 /test/table1/child_directory
root hdfs     231206 2017-06-30 02:45 /test/table1/child_directory/000000_0

By default, Hive will only look for files in the root of directory specified, in my test case is /test/table1. However, Hive supports to read all data under the root table’s sub-directories as well. This can be achieved by updating the following settings:

SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;

Impala however, on the other side, currently does not support reading files from table’s sub-directories. This has been reported in the upstream JIRA of IMPALA-1944. Currently there is no immediate plan to support such feature, but it might be in the future release of Impala.

Hope above information is useful.