What another Experience with Singapore Visa and Flight

I have had so many bad experiences with visa applications before, including Canadian and Indian visas, which I had blogged before here. This time, it is Singapore’s turn.

I have planned trip to Cloudera’s Chennai office next week. Since I have never been to Singapore before and my flight will transit in Singapore, I decided to bring my wife with me for a short holiday before heading to Chennai. As I am holding Chinese passport, I will need a visa to enter Singapore. After checking the official website, I knew that we needed to wait until within 30 days before we could make an appointment with VFS Global, the third party vendor who processes Singapore visa.

Our planned date was 21st of Nov, so I tried to make a booking on 22nd of Oct, but only to find out that the earliest date available for us was 15th of Nov, which was just 6 days before our flight, and only 4 working days. I emailed them complained that it was too rush to apply visa only less than a week before our flight, and we could only get our visa one day before (it needs 4 working days to process visa). If anything happens, we will not be able to get another chance. They rejected my request to get appointment earlier, actually, they ignored me without a reply!!

Same deal for my parents, their planned flight is on 11th of Dec to Singapore, and I could only make appointment on 6th of Dec, just 5 days earlier, including weekends.

I understand about the 30 days rule to apply for a visa, but I do NOT understand why it also applies for making appointments. This is non-sense.

On the day we took our documents for interview in VFS Global’s Melbourne office, what was even more non-sense(er) was that they asked every one of us to change the Date field next to our signature to that day’s date on the main application form, and then sign again because we changed the field. But why? No one would take an empty application form and fill in on the spot!! And when you fill in the form beforehand, you would only fill in using the date you were on, not the interview date, which would be in the future. Why would this date matter anyway? I have seen pretty much everyone on that day had to change this field and sign again. What’s the point!! If this is a requirement, make it clear on the form PLEASE!!

OK! Now the Visa is all good, we got it just the day before our flight, PHEW!! This morning, on the day of our flight, when I was checking my email at 6AM, I noticed that JetStar notified us that our flight JQ007, scheduled to fly out of Melbourne at 12PM had been CANCELLED. What the hell? By following the recovery options, I was not able to make alternative flight because all others on the same day had been fully booked, possibly due to this flight’s cancellation. And I had to move the flight to tomorrow, which would be 22nd of Nov.

I called our hotel that I booked in Singapore, and confirmed that our hotel fee is non-refundable.

Hoping to get another flight today, so that we do not waste one day, I searched on Skyscanner and found out that Scoot had flights available with also very cheap price. So I called JetStar again to cancel our trip from Mel->Singapore and I made another booking straight on Scoot’s website. The flight was scheduled at 1:20PM this afternoon.

We packed up everything, called in DiDi Rider and reached at Melbourne Airport at around 10:30AM. However, when we checked in, we realised that the flight had also been delayed by 8 hours and was re-scheduled at 9:15PM instead of 1:20PM!! We just bought the ticket 1 hour earlier online, but never got notified about this huge delay. They just do not care about passengers who had to travel all the way to airport and only found out that we had to stuck in airport for 8+ hours.

In the end, Scoot provided us with AUD$20 meal voucher to be used in the airport, but we have to pass this long period of waiting time without knowing what to do!!

Budget airline? Think twice in the future before making decisions, JetStar and Scoot are certainly on the list to look out for!!

Locate Hive Tables with Same HDFS Location

When you create Hive table, you can specify the LOCATION of the table, regardless if the table is managed or external table. The syntax is like below:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  ....
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
  ]
  [LOCATION hdfs_path]

If LOCATION clause is omitted, then Hive will use default location under /user/hive/warehouse and then use the DB and Table name specified in the table create statement.

Hive, however, will not stop you from creating different tables that link to the same HDFS location. This can cause issue sometimes if user make mistakes.

Currently, Hive provides no functionality to allow user to report back on how many tables have the same HDFS location, in case he/she wants to do clean up and make sure if tables were created correctly.

The only way I can see is to query the backend database that Hive uses, typically MySQL, Oracle and PostgreSQL etc.

Below is the query that you can run to retrieve the list of tables in Hive that have the same HDFS location:

SELECT d.NAME, t.TBL_NAME, t.TBL_TYPE, s.LOCATION 
    FROM TBLS t 
    JOIN SDS s ON (t.SD_ID = s.SD_ID) 
    JOIN DBS d on (t.DB_ID = d.DB_ID) 
    JOIN (
        SELECT LOCATION 
        FROM SDS 
        GROUP BY LOCATION 
        HAVING COUNT(*) > 1 AND LOCATION IS NOT NULL
    ) l ON (l.LOCATION = s.LOCATION);

Please note that this query was based on MySQL, you might need to adjust it accordingly if you use other Databases.

I have created some tables and my test result looks like below:

+-------------+----------+----------------+---------------------------------------------------------+
| NAME        | TBL_NAME | TBL_TYPE       | LOCATION                                                |
+-------------+----------+----------------+---------------------------------------------------------+
| db1         | test     | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db1.db/test     |
| default     | mytest   | EXTERNAL_TABLE | hdfs://NN-HOST:8020/user/hive/warehouse/db1.db/test     |
| db2         | test3    | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db2.db/test2    |
| db2         | test4    | MANAGED_TABLE  | hdfs://NN-HOST:8020/user/hive/warehouse/db2.db/test2    |
+-------------+----------+----------------+---------------------------------------------------------+
4 rows in set (0.13 sec)

Hope above query can help!

Access S3 in Hive through hadoop.security.credential.provider.path

A couple of years ago, I wrote a blog about how to securely managing passwords in Sqoop, so that the RDBMS’ password won’t be exposed to end users when running Sqoop jobs. The details can be found here.

Recently, we have one customer tested such feature with Hive accessing S3 credentials, and tried to set hadoop.security.credential.provider.path in Hive with the value of the jceks file’s location on HDFS, but not able to get it working. I have spent quite a few days researching on this topic and concluded that currently Hive does not support such feature to read S3A passwords from JCEK credential file.

I have the following findings:

1. When you create credential store files using below command:

hadoop credential create fs.s3a.access.key -value XXXXXXXXXXX \
    -provider jceks://hdfs/keystore/s3-key.jceks

Hadoop will produce below warning:

WARNING: You have accepted the use of the default provider password
by not configuring a password in one of the two following locations:
    * In the environment variable HADOOP_CREDSTORE_PASSWORD
    * In a file referred to by the configuration entry
      hadoop.security.credstore.java-keystore-provider.password-file.
Please review the documentation regarding provider passwords in
the keystore passwords section of the Credential Provider API
Continuing with the default provider password.

I believe that the default password for HADOOP_CREDSTORE_PASSWORD is “none” if you do not set it before running the command.

2. Both HiveServer2 and HiveMetaStore will generate their own HADOOP_CREDSTORE_PASSWORD by Cloudera Manager after each restart. To get the values for them, you can try to run below commands on their hosts:

strings /proc/$(pgrep -f HiveServer2)/environ | grep HADOOP_CREDSTORE_PASSWORD

strings /proc/$(pgrep -f HiveMetaStore)/environ | grep HADOOP_CREDSTORE_PASSWORD

In my case, it returned below output:

strings /proc/$(pgrep -f HiveServer2)/environ | grep HADOOP_CREDSTORE_PASSWORD
HADOOP_CREDSTORE_PASSWORD=dq440oxl9rgopo9c593k86vch

strings /proc/$(pgrep -f HiveMetaStore)/environ | grep HADOOP_CREDSTORE_PASSWORD
HADOOP_CREDSTORE_PASSWORD=8sttjj54e9ffdo6ovqcufxmp4

You can see that the HADOOP_CREDSTORE_PASSWORD is different between them, so both HiveServer2 and HiveMetaStore will not be able to read the manually generated jceks files as password does not match.

3. I also tried to override the environment variable HADOOP_CREDSTORE_PASSWORD for both HiveServer2 and HiveMetaStore, via “Hive Metastore Server Environment Advanced Configuration Snippet (Safety Valve)” and “HiveServer2 Environment Advanced Configuration Snippet (Safety Valve)” under Cloudera Manager > Hive > Configuration page, however, this will break HiveMetaStore with below errors:

2018-11-03 02:44:39,569 ERROR org.apache.hadoop.hive.metastore.HiveMetaStore: [main]: Metastore Thrift Server threw an exception...
java.lang.RuntimeException: Error getting metastore password: null
	at org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:403)
	at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:279)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:60)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:69)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:682)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:660)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:709)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:508)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:78)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6475)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6470)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:6720)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:6647)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException
	at org.apache.hadoop.hive.shims.Hadoop23Shims.getPassword(Hadoop23Shims.java:1144)
	at org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:397)
	... 21 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.hive.shims.Hadoop23Shims.getPassword(Hadoop23Shims.java:1138)
	... 22 more
Caused by: java.io.IOException: Configuration problem with provider path.
	at org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:2118)
	at org.apache.hadoop.conf.Configuration.getPassword(Configuration.java:2037)
	... 27 more
Caused by: java.io.IOException: Keystore was tampered with, or password was incorrect
	at com.sun.crypto.provider.JceKeyStore.engineLoad(JceKeyStore.java:865)
	at java.security.KeyStore.load(KeyStore.java:1445)
	at org.apache.hadoop.security.alias.AbstractJavaKeyStoreProvider.locateKeystore(AbstractJavaKeyStoreProvider.java:335)
	at org.apache.hadoop.security.alias.AbstractJavaKeyStoreProvider.<init>(AbstractJavaKeyStoreProvider.java:88)
	at org.apache.hadoop.security.alias.LocalJavaKeyStoreProvider.<init>(LocalJavaKeyStoreProvider.java:58)
	at org.apache.hadoop.security.alias.LocalJavaKeyStoreProvider.<init>(LocalJavaKeyStoreProvider.java:50)
	at org.apache.hadoop.security.alias.LocalJavaKeyStoreProvider$Factory.createProvider(LocalJavaKeyStoreProvider.java:177)
	at org.apache.hadoop.security.alias.CredentialProviderFactory.getProviders(CredentialProviderFactory.java:63)
	at org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:2098)
	... 28 more

I have not got time to investigate into why HiveMetaStore would fail, but HiveServer2 is OK. However, this confirms that such feature is currently not supported in Hive.

I think this is a good feature to have because currently you have to have the S3 access and secret keys saved in Hive’s configuration file. This means that everyone will be able to have access to S3. Having the keys saved in crednetial files, only the user who has access to those files will be able to access S3, so it should be more secure. So I filed Cloudera internal JIRA to request this feature, but at the time of writing, it is not fixed yet.

For now, if you want Hive table to access S3 location, the access and secret key need to be configured inside the configuration XML file at global level.