Hive Lateral View to Flatten Array Data

In this blog post, I will show how to transform the following data in Hive

[1,2,3],[4,5,6],[7,8,9]

and turn it into a table with the following output:

1,4,7
2,5,8
3,6,9

1. To prepare the table, firstly need to create a dummy table with one record:

CREATE TABLE dummy (a int);
INSERT INTO TABLE dummy VALUES (1);

This is to allow us to insert array data into our testing table by SELECTing from our 1 row dummy table.

2. Create a table with array data type:

CREATE TABLE array_table (a array<int>, b array<int>, c array<int>);
INSERT INTO array_test SELECT array(1,2,3), array(3,4,5), array(6,7,8) FROM dummy;

This will get us ready with a table having the following data:

+-------------+-------------+-------------+--+
| a           | b           | c           |
+-------------+-------------+-------------+--+
| [1,2,3]     | [3,4,5]     | [6,7,8]     |
+-------------+-------------+-------------+--+

3. Now run the following query using LATERAL VIEW will give the output we want:

SELECT key1, key2, key3 FROM array_test 
LATERAL VIEW posexplode(a) t1 AS q1, key1 
LATERAL VIEW posexplode(b) t2 AS q2, key2 
LATERAL VIEW posexplode(c) t3 AS q3, key3 
WHERE q1 = q2 AND q1 = q3;

+-------+-------+-------+--+
| key1  | key2  | key3  |
+-------+-------+-------+--+
| 1     | 3     | 6     |
| 2     | 4     | 7     |
| 3     | 5     | 8     |
+-------+-------+-------+--+

This might not be the best solution, but at least it works. However, how well the performance goes on real Big Data set, you have to test out and confirm yourself. Hope above helps.

Enabling Kerberos Debug for Hive

From time to time, we need to do troubleshooting steps for locating the root cause of Kerberos failure in Hive. I will outline below steps in order to turn on debugging message from both Client and HiveServer2 server side.

  1. To enable on Hive Client side (beeline), simply add the following export commands before you run beeline command:
    export HADOOP_JAAS_DEBUG=true;
    export HADOOP_OPTS='-Dsun.security.krb5.debug=true -Dsun.security.jgss.debug=true'
    

    Then the debug message will be printed on the shell when you run beeline.

  2. To enable kerberos debug on HiveServer2 side (assuming you are using Cloudera Manager)
    1. To to CM > Hive > Configuration
    2. locate “HiveServer2 Environment Advanced Configuration Snippet (Safety Valve)”
    3. add following to the textarea:
      HADOOP_OPTS='-Dsun.security.krb5.debug=true -Dsun.security.jgss.debug=true'
      
    4. Save and restart Hive service

    Once restarted, you will be able to locate the kerberos debug message from HiveServer2’s process directory on the server host, which is located under /var/run/cloudera-scm-agent/process/XXX-hive-HIVESERVER2/logs/stdout.log, where XXX is the largest number under the directory for HiveServer2

The sample debug message for kerberos looks like below:

Java config name: null
Native config name: /etc/krb5.conf
Loaded from native config
[UnixLoginModule]: succeeded importing info:
uid = 0
gid = 0
supp gid = 0
Debug is true storeKey false useTicketCache true useKeyTab false doNotPrompt true ticketCache is null isInitiator true KeyTab is null refreshKrb5Config is false principal is null tryFirstPass is false useFirstPass is false storePass is false clearPass is false
Acquire TGT from Cache
>>>KinitOptions cache name is /tmp/krb5cc_0
>>>DEBUG client principal is impala/{host-name}@REAL.COM
>>>DEBUG server principal is krbtgt/REAL.COM@REAL.COM
>>>DEBUG key type: 23
>>>DEBUG auth time: Sun Aug 13 21:07:46 PDT 2017
>>>DEBUG start time: Sun Aug 13 21:07:46 PDT 2017
>>>DEBUG end time: Mon Aug 14 07:07:46 PDT 2017
>>>DEBUG renew_till time: Sun Aug 20 21:07:46 PDT 2017
>>> CCacheInputStream: readFlags() FORWARDABLE; RENEWABLE; INITIAL; PRE_AUTH;
>>>DEBUG client principal is impala/{host-name}@REAL.COM
>>>DEBUG server principal is X-CACHECONF:/krb5_ccache_conf_data/pa_type/krbtgt/REAL.COM@REAL.COM
>>>DEBUG key type: 0
>>>DEBUG auth time: Wed Dec 31 16:00:00 PST 1969
>>>DEBUG start time: null
>>>DEBUG end time: Wed Dec 31 16:00:00 PST 1969
>>>DEBUG renew_till time: null
>>> CCacheInputStream: readFlags()
Principal is impala/{host-name}@REAL.COM
[UnixLoginModule]: added UnixPrincipal,
UnixNumericUserPrincipal,
UnixNumericGroupPrincipal(s),
to Subject
Commit Succeeded

Search Subject for Kerberos V5 INIT cred (<>, sun.security.jgss.krb5.Krb5InitCredential)
Found ticket for impala/{host-name}@REAL.COM to go to krbtgt/REAL.COM@REAL.COM expiring on Mon Aug 14 07:07:46 PDT 2017
Entered Krb5Context.initSecContext with state=STATE_NEW
Found ticket for impala/{host-name}@REAL.COM to go to krbtgt/REAL.COM@REAL.COM expiring on Mon Aug 14 07:07:46 PDT 2017
Service ticket not found in the subject
>>> Credentials acquireServiceCreds: same realm
default etypes for default_tgs_enctypes: 23.
>>> CksumType: sun.security.krb5.internal.crypto.RsaMd5CksumType
>>> EType: sun.security.krb5.internal.crypto.ArcFourHmacEType
>>> KdcAccessibility: reset
>>> KrbKdcReq send: kdc=kdc-host.com TCP:88, timeout=3000, number of retries =3, #bytes=1607
>>> KDCCommunication: kdc=kdc-host.com TCP:88, timeout=3000,Attempt =1, #bytes=1607
>>>DEBUG: TCPClient reading 1581 bytes
>>> KrbKdcReq send: #bytes read=1581
>>> KdcAccessibility: remove kdc-host.com
>>> EType: sun.security.krb5.internal.crypto.ArcFourHmacEType
>>> KrbApReq: APOptions are 00100000 00000000 00000000 00000000
>>> EType: sun.security.krb5.internal.crypto.ArcFourHmacEType
Krb5Context setting mySeqNumber to: 789412608
Created InitSecContextToken:

From above message, you can see at least below info:

  • Client config file for kerberos /etc/krb5.conf
  • Ticket case file: /tmp/krb5cc_0
  • Client principal name: impala/{host-name}@REAL.COM
  • KDC server host: kdc=kdc-host.com and using TCP connection via port 88 (TCP:88)
  • and a lot more others that might be useful for your troubleshooting

Hope above helps.

Impala query failed with error: “Incompatible Parquet Schema”

Yesterday, I was dealing with an issue that when running a very simple Impala SELECT query, it failed with “Incompatible Parquet schema” error. I have confirmed the following workflow that triggered the error:

  1. Parquet file is created from external library
  2. Load the parquet file into Hive/Impala table
  3. Query the table through Impala will fail with below error message
    incompatible Parquet schema for column 'db_name.tbl_name.col_name'. 
    Column type: DECIMAL(19, 0), Parquet schema:\noptional byte_array col_name [i:2 d:1 r:0]
  4. The same query works well in Hive

This is due to impala currently does not support all decimal specs that are supported by Parquet. Currently Parquet supports the following specs:

  • int32: for 1 <= precision <= 9
  • int64: for 1 <= precision <= 18; precision < 10 will produce a warning
  • fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
  • binaryprecision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

Please refer to Parquet Logical Type Definitions page for details.

However, Impala only supports fixed_len_byte_array, but no others. This has been reported in the upstream JIRA: IMPALA-2494

The only workaround for now is to create a parquet file that will use supported specs for Decimal column, or simply create parquet file through either Hive or Impala.

Impala Query fails with NoSuchObjectException error

In the last few months, I have seem CDH users hitting Impala query returning NoSuchObjectException error very often. This happens when running query against a particular table with INT partition types and it failed with below message:

WARNINGS:
ImpalaRuntimeException: Error making 'alter_partitions' RPC to Hive Metastore:
CAUSED BY: InvalidOperationException: Alter partition operation failed: NoSuchObjectException(message:partition values=[2017, 6, 1, 8])

We have confirmed that the table has four partitions with Integer data type, and select individual partition works.

The following scenario will trigger such error:

  • Partitions with INT data type
  • Partition data was inserted from Hive with zero prefixes, something like below query:
    INSERT OVERWRITE TABLE test_tbl PARTITION (year = '2017', month = '06'....) .....
    
  • Partition data will be created under HDFS location like below:
    hdfs://nameservice1/user/hive/warehouse/test_tbl/year=2017/month=06/day=01/hour=08
    
  • When query through Impala, since the data type is INT, Impala will convert values from “06” to 6, “01” to 1 etc, and will be looking for location :
    hdfs://nameservice1/user/hive/warehouse/test_tbl/year=2017/month=6/day=1/hour=8
    

    instead​ of:

    hdfs://nameservice1/user/hive/warehouse/test_tbl/year=2017/month=06/day=01/hour=08
    

    hence triggered NoSuchObjectException error.

To fix the issue, there are two options:

  1. Convert the data type of partition columns to String, instead of Integer:
    ALTER TABLE test_tbl PARTITION COLUMN (year string);
    ALTER TABLE test_tbl PARTITION COLUMN (month string);
    ALTER TABLE test_tbl PARTITION COLUMN (day string);
    ALTER TABLE test_tbl PARTITION COLUMN (hour string);
    
  2. if integer type need to be kept, then we will need to re-build the table into a new one and store them into locations without leading zeros. This can be done by running the following queries from Impala:
    CREATE TABLE new_test_tbl LIKE test_tbl;
    
    INSERT OVERWRITE TABLE new_test_tbl PARTITION (year, month, day, hour) as SELECT * FROM test_tbl;
    

    The new table will have leading zeros in partitions removed and then we can switch over to use the new table. When writing more data into the new table through Hive, please be sure to remove all leading zeros to prevent the issue from happening again.

Above steps should help resolve the issue. Hope they will help.

Unable to Import Data as Parquet into Encrypted HDFS Zone

Recently I have discovered an issue in Sqoop that when it is importing data into Hive table, whose location is in an encrypted HDFS zone, the Sqoop command will fail with the following errors:

Command:

sqoop import --connect <postgres_url> --username <username> --password <password> \
--table sourceTable --split-by id --hive-import --hive-database staging \
--hive-table hiveTable --as-parquetfile

Errors:

2017-05-24 13:38:51,539 INFO [Thread-84] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 
Setting job diagnostics to Job commit failed: org.kitesdk.data.DatasetIOException: Could not move contents of hdfs://nameservice1/tmp/staging/.
temp/job_1495453174050_1035/mr/job_1495453174050_1035 to 
hdfs://nameservice1/user/hive/warehouse/staging.db/hiveTable
        at org.kitesdk.data.spi.filesystem.FileSystemUtil.stageMove(FileSystemUtil.java:117)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:406)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:62)
        at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.commitJob(DatasetKeyOutputFormat.java:387)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:274)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): 
/tmp/staging/.temp/job_1495453174050_1035/mr/job_1495453174050_1035/964f7b5e-2f55-421d-bfb6-7613cc4bf26e.parquet 
can't be moved into an encryption zone.
        at org.apache.hadoop.hdfs.server.namenode.EncryptionZoneManager.checkMoveValidity(EncryptionZoneManager.java:284)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedRenameTo(FSDirectory.java:564)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.renameTo(FSDirectory.java:478)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInternal(FSNamesystem.java:3929)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInt(FSNamesystem.java:3891)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3856)

After some research, I have found out that it is caused by a known Sqoop bug: SQOOP-2943. This happens because Sqoop currently uses the Kite SDK to generate Parquet file, and the Kite SDK uses the /tmp directory to generate the parquet file on the fly. Because the /tmp directory is not encrypted and the hive warehouse directory is encrypted, the final move command to move the parquet file from the /tmp directory to hive warehouse will fail due to the encryption.

The import only fails with parquet format, the text file format currently works as expected.

As SQOOP-2943 is not fixed at this stage, and there is no direct workarounds, I would suggest the following two methods for importing the data into a Hive parquet table, inside the encrypted warehouse:

  • Import the data as text file format into Hive temporary table inside the Hive warehouse (encrypted), and then use Hive query to copy data into destination parquet table
  • Import the data as parquet file into non-encrypted temporary directory outside of Hive warehouse, and then again use Hive to copy data into destination parquet table inside the Hive warehouse (encrypted)

Hope above can help with anyone who noticed the similar issues.