Locate Hive Tables with Same HDFS Location

When you create Hive table, you can specify the LOCATION of the table, regardless if the table is managed or external table. The syntax is like below:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  ....
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
  ]
  [LOCATION hdfs_path]

If LOCATION clause is omitted, then Hive will use default location under /user/hive/warehouse and then use the DB and Table name specified in the table create statement.

Hive, however, will not stop you from creating different tables that link to the same HDFS location. This can cause issue sometimes if user make mistakes.

Currently, Hive provides no functionality to allow user to report back on how many tables have the same HDFS location, in case he/she wants to do clean up and make sure if tables were created correctly.

The only way I can see is to query the backend database that Hive uses, typically MySQL, Oracle and PostgreSQL etc.

Below is the query that you can run to retrieve the list of tables in Hive that have the same HDFS location:

SELECT d.NAME, t.TBL_NAME, t.TBL_TYPE, s.LOCATION 
    FROM TBLS t 
    JOIN SDS s ON (t.SD_ID = s.SD_ID) 
    JOIN DBS d on (t.DB_ID = d.DB_ID) 
    JOIN (
        SELECT LOCATION 
        FROM SDS 
        GROUP BY LOCATION 
        HAVING COUNT(*) > 1 AND LOCATION IS NOT NULL
    ) l ON (l.LOCATION = s.LOCATION);

Please note that this query was based on MySQL, you might need to adjust it accordingly if you use other Databases.

I have created some tables and my test result looks like below:

+-------------+----------+----------------+---------------------------------------------------------+
| NAME        | TBL_NAME | TBL_TYPE       | LOCATION                                                |
+-------------+----------+----------------+---------------------------------------------------------+
| db1         | test     | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db1.db/test     |
| default     | mytest   | EXTERNAL_TABLE | hdfs://NN-HOST:8020/user/hive/warehouse/db1.db/test     |
| db2         | test3    | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db2.db/test2    |
| db2         | test4    | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db2.db/test2    |
+-------------+----------+----------------+---------------------------------------------------------+
4 rows in set (0.13 sec)

Hope above query can help!

Access S3 in Hive through hadoop.security.credential.provider.path

A couple of years ago, I wrote a blog about how to securely managing passwords in Sqoop, so that the RDBMS’ password won’t be exposed to end users when running Sqoop jobs. The details can be found here.

Recently, we have one customer tested such feature with Hive accessing S3 credentials, and tried to set hadoop.security.credential.provider.path in Hive with the value of the jceks file’s location on HDFS, but not able to get it working. I have spent quite a few days researching on this topic and concluded that currently Hive does not support such feature to read S3A passwords from JCEK credential file.

I have the following findings:

1. When you create credential store files using below command:

hadoop credential create fs.s3a.access.key -value XXXXXXXXXXX \
    -provider jceks://hdfs/keystore/s3-key.jceks

Hadoop will produce below warning:

WARNING: You have accepted the use of the default provider password
by not configuring a password in one of the two following locations:
    * In the environment variable HADOOP_CREDSTORE_PASSWORD
    * In a file referred to by the configuration entry
      hadoop.security.credstore.java-keystore-provider.password-file.
Please review the documentation regarding provider passwords in
the keystore passwords section of the Credential Provider API
Continuing with the default provider password.

I believe that the default password for HADOOP_CREDSTORE_PASSWORD is “none” if you do not set it before running the command.

2. Both HiveServer2 and HiveMetaStore will generate their own HADOOP_CREDSTORE_PASSWORD by Cloudera Manager after each restart. To get the values for them, you can try to run below commands on their hosts:

strings /proc/$(pgrep -f HiveServer2)/environ | grep HADOOP_CREDSTORE_PASSWORD

strings /proc/$(pgrep -f HiveMetaStore)/environ | grep HADOOP_CREDSTORE_PASSWORD

In my case, it returned below output:

strings /proc/$(pgrep -f HiveServer2)/environ | grep HADOOP_CREDSTORE_PASSWORD
HADOOP_CREDSTORE_PASSWORD=dq440oxl9rgopo9c593k86vch

strings /proc/$(pgrep -f HiveMetaStore)/environ | grep HADOOP_CREDSTORE_PASSWORD
HADOOP_CREDSTORE_PASSWORD=8sttjj54e9ffdo6ovqcufxmp4

You can see that the HADOOP_CREDSTORE_PASSWORD is different between them, so both HiveServer2 and HiveMetaStore will not be able to read the manually generated jceks files as password does not match.

3. I also tried to override the environment variable HADOOP_CREDSTORE_PASSWORD for both HiveServer2 and HiveMetaStore, via “Hive Metastore Server Environment Advanced Configuration Snippet (Safety Valve)” and “HiveServer2 Environment Advanced Configuration Snippet (Safety Valve)” under Cloudera Manager > Hive > Configuration page, however, this will break HiveMetaStore with below errors:

2018-11-03 02:44:39,569 ERROR org.apache.hadoop.hive.metastore.HiveMetaStore: [main]: Metastore Thrift Server threw an exception...
java.lang.RuntimeException: Error getting metastore password: null
	at org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:403)
	at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:279)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:60)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:69)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:682)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:660)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:709)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:508)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:78)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6475)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6470)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:6720)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:6647)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException
	at org.apache.hadoop.hive.shims.Hadoop23Shims.getPassword(Hadoop23Shims.java:1144)
	at org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:397)
	... 21 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.hive.shims.Hadoop23Shims.getPassword(Hadoop23Shims.java:1138)
	... 22 more
Caused by: java.io.IOException: Configuration problem with provider path.
	at org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:2118)
	at org.apache.hadoop.conf.Configuration.getPassword(Configuration.java:2037)
	... 27 more
Caused by: java.io.IOException: Keystore was tampered with, or password was incorrect
	at com.sun.crypto.provider.JceKeyStore.engineLoad(JceKeyStore.java:865)
	at java.security.KeyStore.load(KeyStore.java:1445)
	at org.apache.hadoop.security.alias.AbstractJavaKeyStoreProvider.locateKeystore(AbstractJavaKeyStoreProvider.java:335)
	at org.apache.hadoop.security.alias.AbstractJavaKeyStoreProvider.<init>(AbstractJavaKeyStoreProvider.java:88)
	at org.apache.hadoop.security.alias.LocalJavaKeyStoreProvider.<init>(LocalJavaKeyStoreProvider.java:58)
	at org.apache.hadoop.security.alias.LocalJavaKeyStoreProvider.<init>(LocalJavaKeyStoreProvider.java:50)
	at org.apache.hadoop.security.alias.LocalJavaKeyStoreProvider$Factory.createProvider(LocalJavaKeyStoreProvider.java:177)
	at org.apache.hadoop.security.alias.CredentialProviderFactory.getProviders(CredentialProviderFactory.java:63)
	at org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:2098)
	... 28 more

I have not got time to investigate into why HiveMetaStore would fail, but HiveServer2 is OK. However, this confirms that such feature is currently not supported in Hive.

I think this is a good feature to have because currently you have to have the S3 access and secret keys saved in Hive’s configuration file. This means that everyone will be able to have access to S3. Having the keys saved in crednetial files, only the user who has access to those files will be able to access S3, so it should be more secure. So I filed Cloudera internal JIRA to request this feature, but at the time of writing, it is not fixed yet.

For now, if you want Hive table to access S3 location, the access and secret key need to be configured inside the configuration XML file at global level.

WebHCat Request Failed With Error “id: HTTP: no such user”

WebHCat, previously known as Templeton, is the REST API for HCatalog, a table and storage management layer for Hadoop. Users can use WebHCat to access metadata information from HCatalog, as well as submitting jobs for MapReduce, Hive & Pig.

Below is an example of how to retrieve a list of databases via WebHCat API:

curl --negotiate -u: http://{webhcat-hostname}:50111/templeton/v1/ddl/database/

Please note that port 50111 is the default port number for WebHCat. And sample output looks like below:

{"databases":["default","mytest","s3_test","udf_db"]}

However, recently I was facing an issue that WebHCat request failed with below error:

2018-08-08 17:18:44,413 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hive (auth:PROXY) via HTTP/{webhcat-hostname}@CDH511.COM (auth:KERBEROS) cause:org.apache.thrift.transport.TTrans
portException: java.net.SocketException: Connection reset
2018-08-08 17:18:44,414 ERROR org.apache.hive.hcatalog.templeton.CatchallExceptionMapper: java.lang.reflect.UndeclaredThrowableException
java.io.IOException: java.lang.reflect.UndeclaredThrowableException
...
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
....
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1882)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1872)
....

From the stacktrace, we can see that WebHCat failed when trying to collect delegation token from HiveMetaStore. So checking the HMS server log, I found below error:

2018-08-08 17:18:19,675 WARN  org.apache.hadoop.security.ShellBasedUnixGroupsMapping: [pool-7-thread-2]: unable to return groups for user HTTP
PartialGroupNameException The user name 'HTTP' is not found. id: HTTP: no such user
id: HTTP: no such user

It is pretty clear that HMS failed due to user “HTTP” is missing. Adding “HTTP” user on the HMS server host resolved the issue.

Research further, I realized that this was due to in Hive configuration, hadoop.proxyuser.hive.groups was set to a list of groups, rather than “*”, and “HTTP” was one in the group list. You will not get such error if the hadoop.proxyuser.hive.groups is set at “*”, and only failed if “HTTP” was added manually (it is required to be on this list if the value is not “*”, because “hive” user need to be able to impersonate as “HTTP” user for the request to work).

The reason for such failure is because when hadoop.proxyuser.hive.groups is set as “*”, Hive will not bother to check for user’s existence, since every user is allowed. However, when a list of users are defined here, when Hive impersonates as those users, it will try to make sure that those users exist on the host that Hive runs. In our case, “HTTP” user did not exist on HMS host, HMS failed with the error we saw earlier. So we just need to add this user to resolve the issue.

Hope above helps for anyone who also have the same issue.

Oozie Hive2 Action Failed with Error: “HiveSQLException: Failed to execute session hooks”

If you have an Oozie Hive2 job that fails with below error message randomly, which can be found in Oozie’s server log, located by default under /var/log/oozie:

2018-06-02 09:00:01,103 WARN org.apache.oozie.action.hadoop.Hive2Credentials: SERVER[hlp3058p.oocl.com] 
USER[dmsa_appln] GROUP[-] TOKEN[] APP[DMSA_CMTX_PCON_ETL_ONLY] JOB[0010548-180302135253124-oozie-oozi-W] 
ACTION[0010548-180302135253124-oozie-oozi-W@spark-6799] Exception in addtoJobConf
org.apache.hive.service.cli.HiveSQLException: Failed to execute session hooks
        at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:241)
        at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:232)
        at org.apache.hive.jdbc.HiveConnection.openSession(HiveConnection.java:491)
        at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:181)
        at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
        at java.sql.DriverManager.getConnection(DriverManager.java:571)
        at java.sql.DriverManager.getConnection(DriverManager.java:233)
        at org.apache.oozie.action.hadoop.Hive2Credentials.addtoJobConf(Hive2Credentials.java:66)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.setCredentialTokens(JavaActionExecutor.java:1213)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1063)
        at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1295)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
        at org.apache.oozie.command.XCommand.call(XCommand.java:286)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:332)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:261)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:179)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hive.service.cli.HiveSQLException: Failed to execute session hooks
        at org.apache.hive.service.cli.session.SessionManager.openSession(SessionManager.java:308)
        at org.apache.hive.service.cli.CLIService.openSession(CLIService.java:178)
        at org.apache.hive.service.cli.thrift.ThriftCLIService.getSessionHandle(ThriftCLIService.java:422)
        at org.apache.hive.service.cli.thrift.ThriftCLIService.OpenSession(ThriftCLIService.java:316)
        at org.apache.hive.service.cli.thrift.TCLIService$Processor$OpenSession.getResult(TCLIService.java:1253)
        at org.apache.hive.service.cli.thrift.TCLIService$Processor$OpenSession.getResult(TCLIService.java:1238)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:746)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
        ... 3 more
Caused by: java.lang.IllegalStateException: zip file closed
        at java.util.zip.ZipFile.ensureOpen(ZipFile.java:634)
        at java.util.zip.ZipFile.getEntry(ZipFile.java:305)
        at java.util.jar.JarFile.getEntry(JarFile.java:227)
        at sun.net.www.protocol.jar.URLJarFile.getEntry(URLJarFile.java:128)
        at sun.net.www.protocol.jar.JarURLConnection.connect(JarURLConnection.java:132)
        at sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnection.java:150)
        at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:233)
        at javax.xml.parsers.SecuritySupport$4.run(SecuritySupport.java:94)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.xml.parsers.SecuritySupport.getResourceAsStream(SecuritySupport.java:87)
        at javax.xml.parsers.FactoryFinder.findJarServiceProvider(FactoryFinder.java:283)
        at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:255)
        at javax.xml.parsers.DocumentBuilderFactory.newInstance(DocumentBuilderFactory.java:121)
        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2526)
        at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2513)
        at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2409)
        at org.apache.hadoop.conf.Configuration.get(Configuration.java:982)
        at org.apache.sentry.binding.hive.conf.HiveAuthzConf.<init>(HiveAuthzConf.java:162)
        at org.apache.sentry.binding.hive.HiveAuthzBindingHook.loadAuthzConf(HiveAuthzBindingHook.java:131)
        at org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook.run(HiveAuthzBindingSessionHook.java:108)
        at org.apache.hive.service.cli.session.SessionManager.executeSessionHooks(SessionManager.java:420)
        at org.apache.hive.service.cli.session.SessionManager.openSession(SessionManager.java:300)
        ... 12 more

It is likely that you are hitting a possible issue with JDK. Please refer to HADOOP-13809 for details. There is no prove at this stage that it is JDK bug, but workaround is at JDK level. As mentioned in the JIRA, you can add below parameters to HiveServer2’s Java options:

-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl

If you are using Cloudera Manager, you can go to:

CM > Hive > Configuration > Search “Java configuration options for HiveServer2”

and add above parameters to the end of the string, and don’t forget an extra space before it.

Then restart HiveServer2 through CM. This should help to avoid the issue.

Hive CLI Prints SLF4J Error to Standard Output

If you have both Hive and Spark running on the same cluster, chances are that Hive CLI will probably produce the following WARNING message upon exit of each session:

WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked. 
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.

Sample full output looks like below:

[root@localhost ~]# hive -e "show tables;"

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/jars/hive-common-1.1.0-cdh5.12.1.jar!/hive-log4j.properties
OK
sample_07
sample_08
web_logs
Time taken: 1.281 seconds, Fetched: 3 row(s)
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.

Even though Hive CLI has been deprecated in CDH, there are still lots of enterprise users out there are still stuck with Hive CLI due to legacy reasons and it is not easy for them to migrate to use Beeline or query Hive through ODBC or JDBC in their applications.

This is not an issue if you just run Hive CLI from command line and view the output. However, if you want to capture the result set from Hive CLI’s stdout, it will be a trouble, see test case below:

[root@localhost ~]# output=`hive -e "show tables;"`

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/jars/hive-common-1.1.0-cdh5.12.1.jar!/hive-log4j.properties
OK
Time taken: 1.281 seconds, Fetched: 3 row(s)

And when you echo out the variable $output, it will contain the WARNING message:

[root@localhost ~]# echo $output
sample_07
sample_08
web_logs
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.

And when you want to use this output as other inputs in your application, things will go crazy.

This happens due to the below code under hive’s script file (/opt/cloudera/parcels/CDH/lib/hive/bin/hive) that loads Spark’s JAR file into Hive’s CLASSPATH:

# add Spark assembly jar to the classpath
if [[ -n "$SPARK_HOME" && !("$HIVE_SKIP_SPARK_ASSEMBLY" = "true") ]]
then
  sparkAssemblyPath=`ls ${SPARK_HOME}/lib/spark-assembly-*.jar`
  CLASSPATH="${CLASSPATH}:${sparkAssemblyPath}"
fi

Luckily, the latest CDH release, in fact from CDH 5.12.0, Cloudera has backported an upstream JIRA HIVE-12179, which added a checking for environment variable called “HIVE_SKIP_SPARK_ASSEMBLY”. So we can use this variable to disable the loading of Spark JARs for Hive CLI if you do not need to use Hive on Spark.

So the workaround is as simple as setting “HIVE_SKIP_SPARK_ASSEMBLY” to “true” so that the “if” statement will be skipped. See below example:

[root@localhost ~]# output=`export HIVE_SKIP_SPARK_ASSEMBLY=true; hive -e "show tables;"`

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/jars/hive-common-1.1.0-cdh5.12.1.jar!/hive-log4j.properties
OK
Time taken: 1.281 seconds, Fetched: 3 row(s)

And the output will be clean like below:

[root@localhost ~]# echo $output
sample_07
sample_08
web_logs

This workaround will not work if you need to use Hive on Spark in Hive CLI, because it essentially disables the loading of Spark JARs. And of course, using of Hive CLI is strongly NOT recommended, and migration to Beeline or use ODBC/JDBC to connect to HiveServer2 is the right way to go in the long run.

Hope above information helps.