What an experience applying Indian Visa

I need to travel to India at the end of November this year to train up our new colleagues in the Chennai new office, so I have started applying the Indian Visa online through their offical website IndiaVisaOnline. After trying several times, I finally got through the final stage, however, I would like to share my extreme experience in the last couple of weeks.

1. I crashed their website at least 5-6 times on the same page, and each time took them around 5 minutes to be back and running again. Based on my test, I believe entering a website without “http://” prefix could crash it. I am really thinking about to test it again..

2. I had to squeeze in phone numbers and addresses into the same field that has 80 character limit, which forced me to truncate random characters to make them fit and at the same time it became totally unrecognisable address. So what’s the point here??

3. The passport and business card need to be uploaded in PDF format, well, fair enough, but why 300K limit? I had to keep shrinking the size in PNG format, convert and then repeat the loop, until I can barely see the text on my passport!! THIS ISN’T 1980s folks!!! GO TO OFFICEWORKS AND BUY A HARD DRIVE!!!

4. Now, after several hours of trying, I finally got to the last step. And what a surprise in the success message, with every piece of information missing on the confirmation page. I am wondering, should I call them to confirm if I made through or not?

Use my manager’s word, “A monkey with no arms could code better than whatever el-cheapo graduate they fished out of the fail pond to write this website”!!

I am sure that there are other guys out there have the same experience as mine, has anyone complained to the officials?

How to Disable Actions in Oozie

Oozie is an orchestration system for managing Hadoop jobs. Currently it supports various actions, including, but not limited to, Spark1, Spark2, Shell, SSH, Hive, Sqoop and Java etc. However, due to certain business requirements, sometimes we want to disable some of the actions so that we can control how users use Oozie to run jobs.

For example, currently when you setup a SSH action, it is required that you need to setup passwordless login for a certain user from Oozie server host to the target host. This will also allow any user to be able to setup a job to run on the remote machine, so long as you know the username and remote host domain name. This is a security concern. There are thoughts on how to improve it, but not yet have a solution at this stage.

So if business has such concerns, we can disable SSH action easily. Please follow below steps (assuming that you are using Cloudera Manager to manage CDH Hadoop):

1. Go to Cloudera Manager home page > Oozie > Configuration
2. Locate configuration called “Oozie Server Advanced Configuration Snippet (Safety Valve) for oozie-site.xml”
3. Click on the “Add” button and enter “oozie.service.ActionService.executor.classes” for name and value as below:

org.apache.oozie.action.decision.DecisionActionExecutor,org.apache.oozie.action.hadoop.JavaActionExecutor,org.apache.oozie.action.hadoop.FsActionExecutor,org.apache.oozie.action.hadoop.MapReduceActionExecutor,org.apache.oozie.action.hadoop.PigActionExecutor,org.apache.oozie.action.hadoop.HiveActionExecutor,org.apache.oozie.action.hadoop.ShellActionExecutor,org.apache.oozie.action.hadoop.SqoopActionExecutor,org.apache.oozie.action.hadoop.DistcpActionExecutor,org.apache.oozie.action.hadoop.Hive2ActionExecutor,org.apache.oozie.action.oozie.SubWorkflowActionExecutor,org.apache.oozie.action.email.EmailActionExecutor,org.apache.oozie.action.hadoop.SparkActionExecutor

The full list is:

org.apache.oozie.action.decision.DecisionActionExecutor,org.apache.oozie.action.hadoop.JavaActionExecutor,org.apache.oozie.action.hadoop.FsActionExecutor,org.apache.oozie.action.hadoop.MapReduceActionExecutor,org.apache.oozie.action.hadoop.PigActionExecutor,org.apache.oozie.action.hadoop.HiveActionExecutor,org.apache.oozie.action.hadoop.ShellActionExecutor,org.apache.oozie.action.hadoop.SqoopActionExecutor,org.apache.oozie.action.hadoop.DistcpActionExecutor,org.apache.oozie.action.hadoop.Hive2ActionExecutor,org.apache.oozie.action.ssh.SshActionExecutor,org.apache.oozie.action.oozie.SubWorkflowActionExecutor,org.apache.oozie.action.email.EmailActionExecutor,org.apache.oozie.action.hadoop.SparkActionExecutor

so we just need to remove org.apache.oozie.action.ssh.SshActionExecutor action class. Basically, just remove the corresponding action classes that you do not want Oozie to support.

4. Save and restart Oozie

After that, if you try to run a SSH action through Oozie, it will fail. And sample error from Hue looks like below:

Please keep in mind that if you do make this change, remember to check the version of Oozie that you will upgrade to in the future, to make sure that the new supported classes are added to this list, otherwise other jobs will fail. For example, currently Oozie can only support running Spark1 action, Spark2 is not supported. However, in latest CDH version 6.x, Spark2 is now supported and the list will need to be updated.

How to Compress and Extract Multipart Zip Files on Linux

This blog post explains how to create multipart zip files and then extract them in another host which runs on Linux, in case that the single zip file is too big to transport from one host to another. I will demonstrate this on CentOS host, other distributions will be similar, apart from installation command.

1. Firstly, you will need to install p7zip utility:

sudo yum -y install p7zip

2. Then, you can create archive file using below command:

7za -v100m a test.zip test

The above command tells 7za to create files with volume of 100MB each, the archive file name is test.zip and the source of the directory is “test”, which contains the files that you need to create archive from.

My sample output looks like below:

[root@localhost ~]# 7za -v100m a test.zip test

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_AU.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (406F1),ASM,AES-NI)

Scanning the drive:
1500 folders, 7363 files, 1347676548 bytes (1286 MiB)

Creating archive: test.zip

Items to compress: 8863


Files read from disk: 7363
Archive size: 473633025 bytes (452 MiB)
Everything is Ok

3. The following files will be created under the current directory:

-rw-r--r-- 1 root root 104857600 Sep 16 20:26 test.zip.001
-rw-r--r-- 1 root root 104857600 Sep 16 20:26 test.zip.002
-rw-r--r-- 1 root root 104857600 Sep 16 20:26 test.zip.003
-rw-r--r-- 1 root root 104857600 Sep 16 20:27 test.zip.004
-rw-r--r-- 1 root root  54202625 Sep 16 20:27 test.zip.005

4. After all the files being transported to the destination host, you can run below command to unzip those files:

7za x test.zip.001

All you need to specify is the first splitted file with “.001” extension and 7za will manage to find the rest. My sample output looks like below:

[root@localhost ~]# 7za x test.zip.001

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_AU.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (406F1),ASM,AES-NI)

Scanning the drive for archives:
1 file, 209715200 bytes (200 MiB)

Extracting archive: test.zip.001
--
Path = test.zip.001
Type = Split
Physical Size = 209715200
Volumes = 3
Total Physical Size = 473633025
----
Path = test.zip
Size = 473633025
--
Path = test.zip
Type = zip
Physical Size = 473633025

Everything is Ok

Folders: 1500
Files: 7363
Size:       1347676548
Compressed: 473633025

Hope above helps.

Impala Query Profile Explained – Part 1

If you work with Impala, but have no idea how to interpret the Impala query PROFILEs, it would be very hard to understand what’s going on and how to make your query run at its full potential. I think this is the case for lots of Impala users, so I would like to write a simple blog post to share my experience and hope that it can help with anyone who like to learn more.

This is the Part 1 of the series, so I will go with the basics and just cover the main things to look out for when examining the PROFILE.

So first thing first, how do you collect Impala query PROFILE? Well, there are a couple of ways. The simplest way is to just run “PROFILE” after your query in impala-shell, like below:

[impala-daemon-host.com:21000] > SELECT COUNT(*) FROM sample_07;
Query: SELECT COUNT(*) FROM sample_07
Query submitted at: 2018-09-14 15:57:35 (Coordinator: https://impala-daemon-host.com:25000)
dQuery progress can be monitored at: https://impala-daemon-host.com:25000/query_plan?query_id=36433472787e1cab:29c30e7800000000
+----------+
| count(*) |
+----------+
| 823      |
+----------+
Fetched 1 row(s) in 6.68s

[impala-daemon-host.com:21000] > PROFILE; <-- Simply run "PROFILE" as a query
Query Runtime Profile:
Query (id=36433472787e1cab:29c30e7800000000):
Summary:
Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
Session Type: BEESWAX
.....

You can also collect from Cloudera Manager Web UI, by navigating to CM > Impala > Queries, locate the query you just ran and click on “Query Details”

Then scroll down a bit to locate “Download Profile” button:

Last, but not least, you can navigate to Impala Daemon’s web UI and download from there. Go to the Impala Daemon that is used as the coordinator to run the query:

https://{impala-daemon-url}:25000/queries

The list of queries will be displayed:

Click through the “Details” link and then to “Profile” tab:

All right, so we have the PROFILE now, let’s dive into the details.

Below is the snippet of Query PROFILE we will go through today, which is the Summary section at the top of the PROFILE:

Query (id=36433472787e1cab:29c30e7800000000):
Summary:
Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
Session Type: BEESWAX
Start Time: 2018-09-14 15:57:35.883111000
End Time: 2018-09-14 15:57:42.565042000
Query Type: QUERY
Query State: FINISHED
Query Status: OK
Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build 50eddf4550faa6200f51e98413de785bf1bf0de1)
User: hive@VPC.CLOUDERA.COM
Connected User: hive@VPC.CLOUDERA.COM
Delegated User:
Network Address: ::ffff:172.26.26.117:58834
Default Db: default
Sql Statement: SELECT COUNT(*) FROM sample_07
Coordinator: impala-daemon-url.com:22000
Query Options (set by configuration):
Query Options (set by configuration and planner): MT_DOP=0
Plan:
----------------

Let’s break it into sections and walk through one by one. There are a few important information here that used more often:

a. Query ID:

Query (id=36433472787e1cab:29c30e7800000000):

This is useful to identify relevant Query related information from Impala Daemon logs. Simply search this query ID and you can find out what it was doing behind the scene, especially useful for finding out related error messages.

b. Session Type:

Session Type: BEESWAX

This can tell us where the connection is from. BEESWAX means that the query ran from impala-shell client. If you run from Hue, the type will be “HIVESERVER2” since Hue connects via HiveServer2 thrift.

c. Start and End time:

Start Time: 2018-09-14 15:57:35.883111000
End Time: 2018-09-14 15:57:42.565042000

This is useful to tell how long the query ran for. Please keep it in mind that this time includes session idle time. So if you run a simple query that returns in a few seconds in Hue, since Hue keeps session open until session is closed or user runs another query, so the time here might show longer time than normal. The start and end time should match exactly the run time if run through impala-shell however, since impala-shell closes query handler straightaway after query finishes.

d. Query status:

Query Status: OK

This tells if the query finished successfully or not. OK means good. If there are errors, normally will show here, for example, cancelled by user, session timeout, Exceptions etc.

e. Impala version:

Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build 50eddf4550faa6200f51e98413de785bf1bf0de1)

This confirms the version that is used to run the query, if you see this is not matching with your installation, then something is not setup properly.

f. User information:

User: hive@XXX.XXXXXX.COM
Connected User: hive@XXX.XXXXXX.COM
Delegated User:

You can find out who ran the query from this session, so you know who to blame :).

g. DB selected on connection:

Default Db: default

Not used a lot, but good to know.

h. The query that used to return this PROFILE:

Sql Statement: SELECT COUNT(*) FROM sample_07

You will need this info if you are helping others to troubleshoot, as you need to know how query was constructed and what tables are involved. In lots of cases that a simple rewrite of the query will help to resolve issues or boost query performance.

i. The impala daemon that is used to run the query, what we called the Coordinator:

Coordinator: impala-daemon-host.com:22000

This is important piece of information, as you will determine which host to get the impala daemon log should you wish to check for INFO, WARNING and ERROR level logs.

j. Query Options used for this query:

Query Options (set by configuration):
Query Options (set by configuration and planner): MT_DOP=0

This section tells you what kind of QUERY OPTIONS being applied to the current query, if there are any. This is useful to see if there is any user level, or pool level overrides that will affect this query. One example would be if Impala Daemon’s memory is set at, say 120GB, but a small query still fails with OutOfMemory error. This is the place you will check if user accidentally set MEM_LIMIT in their session to a lower value that could results in OutOfMemory error.

This concludes the part 1 of the series to explain the Summary section of the query to understand the basic information. In the next part of the series, I will explain in detail on Query Plan as well as the Execution Summary of the PROFILE.

Any comments or suggestions, please let me know from the comments section below. Thanks

Simple Tool to Enable SSL/TLS for CM/CDH Cluster

Since earlier this year, Cloudera has started a new program that allows each Support Engineer to do a full week offline self-learning. Topics can be chosen by each individual engineer so long as the outcome has a value to the business, It can be either the engineer skilled up with a certification that helps with day to day work, or a presentation to share with the rest of the team what he/she had learnt from the week doing self-learning. Last week, from 27th of August to 31st of August was my turn.

After a careful consideration, I thought that my knowledge on SSL/TLS area needed to be skilled up, so I had decided to find some SSL/TLS related courses on either SafariOnline or Lynda, and then see if I could try to enable Cloudera Manager as well as most of the CDH services with SSL/TLS, ideally to put everything into a script so that this process can be automated. I discussed this with my manager and we agreed on my plan.

On the first two days, I found a couple of very useful video courses from Lynda.com, see below link:

SSL Certificates For Web Developers
Learning Secure Sockets Layer

They were very useful in helping me getting a better understanding of the fundamental of SSL/TLS and how to generate keys and sign the cerficate all by yourself.

After that I reviewed Cloudera’s official online documentation on how to enable SSL/TLS for Cloudera Manager as well as the rest of CDH services and built a little tool that is written in shell script to allow anyone to generate certificates on the fly and enable SSL/TLS for his/her cluster with a simple couple of commands.

The documentation links can be found below:

Configuring TLS Encryption for Cloudera Manager
Configuring TLS/SSL Encryption for CDH Services

I have published this little tool on github and is available here. Currently it supports enabling SSL/TLS for the following services:

Cloudera Manager (from Level 1 to Level 3 security)
HDFS
YARN
Hive
Impala
Oozie
HBase
Hue

With this tool, user can enable SSL/TLS for any of the above services with ease in a few minutes.

If you have any suggestions or comments, please leave them in the comment section below, thanks.