Locate Hive Tables with Same HDFS Location

When you create Hive table, you can specify the LOCATION of the table, regardless if the table is managed or external table. The syntax is like below:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  ....
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
  ]
  [LOCATION hdfs_path]

If LOCATION clause is omitted, then Hive will use default location under /user/hive/warehouse and then use the DB and Table name specified in the table create statement.

Hive, however, will not stop you from creating different tables that link to the same HDFS location. This can cause issue sometimes if user make mistakes.

Currently, Hive provides no functionality to allow user to report back on how many tables have the same HDFS location, in case he/she wants to do clean up and make sure if tables were created correctly.

The only way I can see is to query the backend database that Hive uses, typically MySQL, Oracle and PostgreSQL etc.

Below is the query that you can run to retrieve the list of tables in Hive that have the same HDFS location:

SELECT d.NAME, t.TBL_NAME, t.TBL_TYPE, s.LOCATION 
    FROM TBLS t 
    JOIN SDS s ON (t.SD_ID = s.SD_ID) 
    JOIN DBS d on (t.DB_ID = d.DB_ID) 
    JOIN (
        SELECT LOCATION 
        FROM SDS 
        GROUP BY LOCATION 
        HAVING COUNT(*) > 1 AND LOCATION IS NOT NULL
    ) l ON (l.LOCATION = s.LOCATION);

Please note that this query was based on MySQL, you might need to adjust it accordingly if you use other Databases.

I have created some tables and my test result looks like below:

+-------------+----------+----------------+---------------------------------------------------------+
| NAME        | TBL_NAME | TBL_TYPE       | LOCATION                                                |
+-------------+----------+----------------+---------------------------------------------------------+
| db1         | test     | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db1.db/test     |
| default     | mytest   | EXTERNAL_TABLE | hdfs://NN-HOST:8020/user/hive/warehouse/db1.db/test     |
| db2         | test3    | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db2.db/test2    |
| db2         | test4    | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db2.db/test2    |
+-------------+----------+----------------+---------------------------------------------------------+
4 rows in set (0.13 sec)

Hope above query can help!

Access S3 in Hive through hadoop.security.credential.provider.path

A couple of years ago, I wrote a blog about how to securely managing passwords in Sqoop, so that the RDBMS’ password won’t be exposed to end users when running Sqoop jobs. The details can be found here.

Recently, we have one customer tested such feature with Hive accessing S3 credentials, and tried to set hadoop.security.credential.provider.path in Hive with the value of the jceks file’s location on HDFS, but not able to get it working. I have spent quite a few days researching on this topic and concluded that currently Hive does not support such feature to read S3A passwords from JCEK credential file.

I have the following findings:

1. When you create credential store files using below command:

hadoop credential create fs.s3a.access.key -value XXXXXXXXXXX \
    -provider jceks://hdfs/keystore/s3-key.jceks

Hadoop will produce below warning:

WARNING: You have accepted the use of the default provider password
by not configuring a password in one of the two following locations:
    * In the environment variable HADOOP_CREDSTORE_PASSWORD
    * In a file referred to by the configuration entry
      hadoop.security.credstore.java-keystore-provider.password-file.
Please review the documentation regarding provider passwords in
the keystore passwords section of the Credential Provider API
Continuing with the default provider password.

I believe that the default password for HADOOP_CREDSTORE_PASSWORD is “none” if you do not set it before running the command.

2. Both HiveServer2 and HiveMetaStore will generate their own HADOOP_CREDSTORE_PASSWORD by Cloudera Manager after each restart. To get the values for them, you can try to run below commands on their hosts:

strings /proc/$(pgrep -f HiveServer2)/environ | grep HADOOP_CREDSTORE_PASSWORD

strings /proc/$(pgrep -f HiveMetaStore)/environ | grep HADOOP_CREDSTORE_PASSWORD

In my case, it returned below output:

strings /proc/$(pgrep -f HiveServer2)/environ | grep HADOOP_CREDSTORE_PASSWORD
HADOOP_CREDSTORE_PASSWORD=dq440oxl9rgopo9c593k86vch

strings /proc/$(pgrep -f HiveMetaStore)/environ | grep HADOOP_CREDSTORE_PASSWORD
HADOOP_CREDSTORE_PASSWORD=8sttjj54e9ffdo6ovqcufxmp4

You can see that the HADOOP_CREDSTORE_PASSWORD is different between them, so both HiveServer2 and HiveMetaStore will not be able to read the manually generated jceks files as password does not match.

3. I also tried to override the environment variable HADOOP_CREDSTORE_PASSWORD for both HiveServer2 and HiveMetaStore, via “Hive Metastore Server Environment Advanced Configuration Snippet (Safety Valve)” and “HiveServer2 Environment Advanced Configuration Snippet (Safety Valve)” under Cloudera Manager > Hive > Configuration page, however, this will break HiveMetaStore with below errors:

2018-11-03 02:44:39,569 ERROR org.apache.hadoop.hive.metastore.HiveMetaStore: [main]: Metastore Thrift Server threw an exception...
java.lang.RuntimeException: Error getting metastore password: null
	at org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:403)
	at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:279)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:60)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:69)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:682)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:660)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:709)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:508)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:78)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6475)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6470)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:6720)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:6647)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException
	at org.apache.hadoop.hive.shims.Hadoop23Shims.getPassword(Hadoop23Shims.java:1144)
	at org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:397)
	... 21 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.hive.shims.Hadoop23Shims.getPassword(Hadoop23Shims.java:1138)
	... 22 more
Caused by: java.io.IOException: Configuration problem with provider path.
	at org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:2118)
	at org.apache.hadoop.conf.Configuration.getPassword(Configuration.java:2037)
	... 27 more
Caused by: java.io.IOException: Keystore was tampered with, or password was incorrect
	at com.sun.crypto.provider.JceKeyStore.engineLoad(JceKeyStore.java:865)
	at java.security.KeyStore.load(KeyStore.java:1445)
	at org.apache.hadoop.security.alias.AbstractJavaKeyStoreProvider.locateKeystore(AbstractJavaKeyStoreProvider.java:335)
	at org.apache.hadoop.security.alias.AbstractJavaKeyStoreProvider.<init>(AbstractJavaKeyStoreProvider.java:88)
	at org.apache.hadoop.security.alias.LocalJavaKeyStoreProvider.<init>(LocalJavaKeyStoreProvider.java:58)
	at org.apache.hadoop.security.alias.LocalJavaKeyStoreProvider.<init>(LocalJavaKeyStoreProvider.java:50)
	at org.apache.hadoop.security.alias.LocalJavaKeyStoreProvider$Factory.createProvider(LocalJavaKeyStoreProvider.java:177)
	at org.apache.hadoop.security.alias.CredentialProviderFactory.getProviders(CredentialProviderFactory.java:63)
	at org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:2098)
	... 28 more

I have not got time to investigate into why HiveMetaStore would fail, but HiveServer2 is OK. However, this confirms that such feature is currently not supported in Hive.

I think this is a good feature to have because currently you have to have the S3 access and secret keys saved in Hive’s configuration file. This means that everyone will be able to have access to S3. Having the keys saved in crednetial files, only the user who has access to those files will be able to access S3, so it should be more secure. So I filed Cloudera internal JIRA to request this feature, but at the time of writing, it is not fixed yet.

For now, if you want Hive table to access S3 location, the access and secret key need to be configured inside the configuration XML file at global level.

What an experience applying Indian Visa

I need to travel to India at the end of November this year to train up our new colleagues in the Chennai new office, so I have started applying the Indian Visa online through their offical website IndiaVisaOnline. After trying several times, I finally got through the final stage, however, I would like to share my extreme experience in the last couple of weeks.

1. I crashed their website at least 5-6 times on the same page, and each time took them around 5 minutes to be back and running again. Based on my test, I believe entering a website without “http://” prefix could crash it. I am really thinking about to test it again..

2. I had to squeeze in phone numbers and addresses into the same field that has 80 character limit, which forced me to truncate random characters to make them fit and at the same time it became totally unrecognisable address. So what’s the point here??

3. The passport and business card need to be uploaded in PDF format, well, fair enough, but why 300K limit? I had to keep shrinking the size in PNG format, convert and then repeat the loop, until I can barely see the text on my passport!! THIS ISN’T 1980s folks!!! GO TO OFFICEWORKS AND BUY A HARD DRIVE!!!

4. Now, after several hours of trying, I finally got to the last step. And what a surprise in the success message, with every piece of information missing on the confirmation page. I am wondering, should I call them to confirm if I made through or not?

Use my manager’s word, “A monkey with no arms could code better than whatever el-cheapo graduate they fished out of the fail pond to write this website”!!

I am sure that there are other guys out there have the same experience as mine, has anyone complained to the officials?

How to Disable Actions in Oozie

Oozie is an orchestration system for managing Hadoop jobs. Currently it supports various actions, including, but not limited to, Spark1, Spark2, Shell, SSH, Hive, Sqoop and Java etc. However, due to certain business requirements, sometimes we want to disable some of the actions so that we can control how users use Oozie to run jobs.

For example, currently when you setup a SSH action, it is required that you need to setup passwordless login for a certain user from Oozie server host to the target host. This will also allow any user to be able to setup a job to run on the remote machine, so long as you know the username and remote host domain name. This is a security concern. There are thoughts on how to improve it, but not yet have a solution at this stage.

So if business has such concerns, we can disable SSH action easily. Please follow below steps (assuming that you are using Cloudera Manager to manage CDH Hadoop):

1. Go to Cloudera Manager home page > Oozie > Configuration
2. Locate configuration called “Oozie Server Advanced Configuration Snippet (Safety Valve) for oozie-site.xml”
3. Click on the “Add” button and enter “oozie.service.ActionService.executor.classes” for name and value as below:

org.apache.oozie.action.decision.DecisionActionExecutor,org.apache.oozie.action.hadoop.JavaActionExecutor,org.apache.oozie.action.hadoop.FsActionExecutor,org.apache.oozie.action.hadoop.MapReduceActionExecutor,org.apache.oozie.action.hadoop.PigActionExecutor,org.apache.oozie.action.hadoop.HiveActionExecutor,org.apache.oozie.action.hadoop.ShellActionExecutor,org.apache.oozie.action.hadoop.SqoopActionExecutor,org.apache.oozie.action.hadoop.DistcpActionExecutor,org.apache.oozie.action.hadoop.Hive2ActionExecutor,org.apache.oozie.action.oozie.SubWorkflowActionExecutor,org.apache.oozie.action.email.EmailActionExecutor,org.apache.oozie.action.hadoop.SparkActionExecutor

The full list is:

org.apache.oozie.action.decision.DecisionActionExecutor,org.apache.oozie.action.hadoop.JavaActionExecutor,org.apache.oozie.action.hadoop.FsActionExecutor,org.apache.oozie.action.hadoop.MapReduceActionExecutor,org.apache.oozie.action.hadoop.PigActionExecutor,org.apache.oozie.action.hadoop.HiveActionExecutor,org.apache.oozie.action.hadoop.ShellActionExecutor,org.apache.oozie.action.hadoop.SqoopActionExecutor,org.apache.oozie.action.hadoop.DistcpActionExecutor,org.apache.oozie.action.hadoop.Hive2ActionExecutor,org.apache.oozie.action.ssh.SshActionExecutor,org.apache.oozie.action.oozie.SubWorkflowActionExecutor,org.apache.oozie.action.email.EmailActionExecutor,org.apache.oozie.action.hadoop.SparkActionExecutor

so we just need to remove org.apache.oozie.action.ssh.SshActionExecutor action class. Basically, just remove the corresponding action classes that you do not want Oozie to support.

4. Save and restart Oozie

After that, if you try to run a SSH action through Oozie, it will fail. And sample error from Hue looks like below:

Please keep in mind that if you do make this change, remember to check the version of Oozie that you will upgrade to in the future, to make sure that the new supported classes are added to this list, otherwise other jobs will fail. For example, currently Oozie can only support running Spark1 action, Spark2 is not supported. However, in latest CDH version 6.x, Spark2 is now supported and the list will need to be updated.

How to Compress and Extract Multipart Zip Files on Linux

This blog post explains how to create multipart zip files and then extract them in another host which runs on Linux, in case that the single zip file is too big to transport from one host to another. I will demonstrate this on CentOS host, other distributions will be similar, apart from installation command.

1. Firstly, you will need to install p7zip utility:

sudo yum -y install p7zip

2. Then, you can create archive file using below command:

7za -v100m a test.zip test

The above command tells 7za to create files with volume of 100MB each, the archive file name is test.zip and the source of the directory is “test”, which contains the files that you need to create archive from.

My sample output looks like below:

[root@localhost ~]# 7za -v100m a test.zip test

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_AU.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (406F1),ASM,AES-NI)

Scanning the drive:
1500 folders, 7363 files, 1347676548 bytes (1286 MiB)

Creating archive: test.zip

Items to compress: 8863


Files read from disk: 7363
Archive size: 473633025 bytes (452 MiB)
Everything is Ok

3. The following files will be created under the current directory:

-rw-r--r-- 1 root root 104857600 Sep 16 20:26 test.zip.001
-rw-r--r-- 1 root root 104857600 Sep 16 20:26 test.zip.002
-rw-r--r-- 1 root root 104857600 Sep 16 20:26 test.zip.003
-rw-r--r-- 1 root root 104857600 Sep 16 20:27 test.zip.004
-rw-r--r-- 1 root root  54202625 Sep 16 20:27 test.zip.005

4. After all the files being transported to the destination host, you can run below command to unzip those files:

7za x test.zip.001

All you need to specify is the first splitted file with “.001” extension and 7za will manage to find the rest. My sample output looks like below:

[root@localhost ~]# 7za x test.zip.001

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_AU.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (406F1),ASM,AES-NI)

Scanning the drive for archives:
1 file, 209715200 bytes (200 MiB)

Extracting archive: test.zip.001
--
Path = test.zip.001
Type = Split
Physical Size = 209715200
Volumes = 3
Total Physical Size = 473633025
----
Path = test.zip
Size = 473633025
--
Path = test.zip
Type = zip
Physical Size = 473633025

Everything is Ok

Folders: 1500
Files: 7363
Size:       1347676548
Compressed: 473633025

Hope above helps.