How to Compress and Extract Multipart Zip Files on Linux

This blog post explains how to create multipart zip files and then extract them in another host which runs on Linux, in case that the single zip file is too big to transport from one host to another. I will demonstrate this on CentOS host, other distributions will be similar, apart from installation command.

1. Firstly, you will need to install p7zip utility:

sudo yum -y install p7zip

2. Then, you can create archive file using below command:

7za -v100m a test.zip test

The above command tells 7za to create files with volume of 100MB each, the archive file name is test.zip and the source of the directory is “test”, which contains the files that you need to create archive from.

My sample output looks like below:

[root@localhost ~]# 7za -v100m a test.zip test

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_AU.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (406F1),ASM,AES-NI)

Scanning the drive:
1500 folders, 7363 files, 1347676548 bytes (1286 MiB)

Creating archive: test.zip

Items to compress: 8863


Files read from disk: 7363
Archive size: 473633025 bytes (452 MiB)
Everything is Ok

3. The following files will be created under the current directory:

-rw-r--r-- 1 root root 104857600 Sep 16 20:26 test.zip.001
-rw-r--r-- 1 root root 104857600 Sep 16 20:26 test.zip.002
-rw-r--r-- 1 root root 104857600 Sep 16 20:26 test.zip.003
-rw-r--r-- 1 root root 104857600 Sep 16 20:27 test.zip.004
-rw-r--r-- 1 root root  54202625 Sep 16 20:27 test.zip.005

4. After all the files being transported to the destination host, you can run below command to unzip those files:

7za x test.zip.001

All you need to specify is the first splitted file with “.001” extension and 7za will manage to find the rest. My sample output looks like below:

[root@localhost ~]# 7za x test.zip.001

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_AU.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (406F1),ASM,AES-NI)

Scanning the drive for archives:
1 file, 209715200 bytes (200 MiB)

Extracting archive: test.zip.001
--
Path = test.zip.001
Type = Split
Physical Size = 209715200
Volumes = 3
Total Physical Size = 473633025
----
Path = test.zip
Size = 473633025
--
Path = test.zip
Type = zip
Physical Size = 473633025

Everything is Ok

Folders: 1500
Files: 7363
Size:       1347676548
Compressed: 473633025

Hope above helps.

Impala Query Profile Explained – Part 1

If you work with Impala, but have no idea how to interpret the Impala query PROFILEs, it would be very hard to understand what’s going on and how to make your query run at its full potential. I think this is the case for lots of Impala users, so I would like to write a simple blog post to share my experience and hope that it can help with anyone who like to learn more.

This is the Part 1 of the series, so I will go with the basics and just cover the main things to look out for when examining the PROFILE.

So first thing first, how do you collect Impala query PROFILE? Well, there are a couple of ways. The simplest way is to just run “PROFILE” after your query in impala-shell, like below:

[impala-daemon-host.com:21000] > SELECT COUNT(*) FROM sample_07;
Query: SELECT COUNT(*) FROM sample_07
Query submitted at: 2018-09-14 15:57:35 (Coordinator: https://impala-daemon-host.com:25000)
dQuery progress can be monitored at: https://impala-daemon-host.com:25000/query_plan?query_id=36433472787e1cab:29c30e7800000000
+----------+
| count(*) |
+----------+
| 823      |
+----------+
Fetched 1 row(s) in 6.68s

[impala-daemon-host.com:21000] > PROFILE; <-- Simply run "PROFILE" as a query
Query Runtime Profile:
Query (id=36433472787e1cab:29c30e7800000000):
Summary:
Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
Session Type: BEESWAX
.....

You can also collect from Cloudera Manager Web UI, by navigating to CM > Impala > Queries, locate the query you just ran and click on “Query Details”

Then scroll down a bit to locate “Download Profile” button:

Last, but not least, you can navigate to Impala Daemon’s web UI and download from there. Go to the Impala Daemon that is used as the coordinator to run the query:

https://{impala-daemon-url}:25000/queries

The list of queries will be displayed:

Click through the “Details” link and then to “Profile” tab:

All right, so we have the PROFILE now, let’s dive into the details.

Below is the snippet of Query PROFILE we will go through today, which is the Summary section at the top of the PROFILE:

Query (id=36433472787e1cab:29c30e7800000000):
Summary:
Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
Session Type: BEESWAX
Start Time: 2018-09-14 15:57:35.883111000
End Time: 2018-09-14 15:57:42.565042000
Query Type: QUERY
Query State: FINISHED
Query Status: OK
Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build 50eddf4550faa6200f51e98413de785bf1bf0de1)
User: hive@VPC.CLOUDERA.COM
Connected User: hive@VPC.CLOUDERA.COM
Delegated User:
Network Address: ::ffff:172.26.26.117:58834
Default Db: default
Sql Statement: SELECT COUNT(*) FROM sample_07
Coordinator: impala-daemon-url.com:22000
Query Options (set by configuration):
Query Options (set by configuration and planner): MT_DOP=0
Plan:
----------------

Let’s break it into sections and walk through one by one. There are a few important information here that used more often:

a. Query ID:

Query (id=36433472787e1cab:29c30e7800000000):

This is useful to identify relevant Query related information from Impala Daemon logs. Simply search this query ID and you can find out what it was doing behind the scene, especially useful for finding out related error messages.

b. Session Type:

Session Type: BEESWAX

This can tell us where the connection is from. BEESWAX means that the query ran from impala-shell client. If you run from Hue, the type will be “HIVESERVER2” since Hue connects via HiveServer2 thrift.

c. Start and End time:

Start Time: 2018-09-14 15:57:35.883111000
End Time: 2018-09-14 15:57:42.565042000

This is useful to tell how long the query ran for. Please keep it in mind that this time includes session idle time. So if you run a simple query that returns in a few seconds in Hue, since Hue keeps session open until session is closed or user runs another query, so the time here might show longer time than normal. The start and end time should match exactly the run time if run through impala-shell however, since impala-shell closes query handler straightaway after query finishes.

d. Query status:

Query Status: OK

This tells if the query finished successfully or not. OK means good. If there are errors, normally will show here, for example, cancelled by user, session timeout, Exceptions etc.

e. Impala version:

Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build 50eddf4550faa6200f51e98413de785bf1bf0de1)

This confirms the version that is used to run the query, if you see this is not matching with your installation, then something is not setup properly.

f. User information:

User: hive@XXX.XXXXXX.COM
Connected User: hive@XXX.XXXXXX.COM
Delegated User:

You can find out who ran the query from this session, so you know who to blame :).

g. DB selected on connection:

Default Db: default

Not used a lot, but good to know.

h. The query that used to return this PROFILE:

Sql Statement: SELECT COUNT(*) FROM sample_07

You will need this info if you are helping others to troubleshoot, as you need to know how query was constructed and what tables are involved. In lots of cases that a simple rewrite of the query will help to resolve issues or boost query performance.

i. The impala daemon that is used to run the query, what we called the Coordinator:

Coordinator: impala-daemon-host.com:22000

This is important piece of information, as you will determine which host to get the impala daemon log should you wish to check for INFO, WARNING and ERROR level logs.

j. Query Options used for this query:

Query Options (set by configuration):
Query Options (set by configuration and planner): MT_DOP=0

This section tells you what kind of QUERY OPTIONS being applied to the current query, if there are any. This is useful to see if there is any user level, or pool level overrides that will affect this query. One example would be if Impala Daemon’s memory is set at, say 120GB, but a small query still fails with OutOfMemory error. This is the place you will check if user accidentally set MEM_LIMIT in their session to a lower value that could results in OutOfMemory error.

This concludes the part 1 of the series to explain the Summary section of the query to understand the basic information. In the next part of the series, I will explain in detail on Query Plan as well as the Execution Summary of the PROFILE.

Any comments or suggestions, please let me know from the comments section below. Thanks

Simple Tool to Enable SSL/TLS for CM/CDH Cluster

Since earlier this year, Cloudera has started a new program that allows each Support Engineer to do a full week offline self-learning. Topics can be chosen by each individual engineer so long as the outcome has a value to the business, It can be either the engineer skilled up with a certification that helps with day to day work, or a presentation to share with the rest of the team what he/she had learnt from the week doing self-learning. Last week, from 27th of August to 31st of August was my turn.

After a careful consideration, I thought that my knowledge on SSL/TLS area needed to be skilled up, so I had decided to find some SSL/TLS related courses on either SafariOnline or Lynda, and then see if I could try to enable Cloudera Manager as well as most of the CDH services with SSL/TLS, ideally to put everything into a script so that this process can be automated. I discussed this with my manager and we agreed on my plan.

On the first two days, I found a couple of very useful video courses from Lynda.com, see below link:

SSL Certificates For Web Developers
Learning Secure Sockets Layer

They were very useful in helping me getting a better understanding of the fundamental of SSL/TLS and how to generate keys and sign the cerficate all by yourself.

After that I reviewed Cloudera’s official online documentation on how to enable SSL/TLS for Cloudera Manager as well as the rest of CDH services and built a little tool that is written in shell script to allow anyone to generate certificates on the fly and enable SSL/TLS for his/her cluster with a simple couple of commands.

The documentation links can be found below:

Configuring TLS Encryption for Cloudera Manager
Configuring TLS/SSL Encryption for CDH Services

I have published this little tool on github and is available here. Currently it supports enabling SSL/TLS for the following services:

Cloudera Manager (from Level 1 to Level 3 security)
HDFS
YARN
Hive
Impala
Oozie
HBase
Hue

With this tool, user can enable SSL/TLS for any of the above services with ease in a few minutes.

If you have any suggestions or comments, please leave them in the comment section below, thanks.

How to Clean Up Deleted Projects in Cloudera Data Science Workbench

We have got a few customer complains about the fact that currently Cloudera Data Science Workbench (CDSW) does not release the underlining project files on disk after the project is deleted from within the CDSW web console.

This can be re-produced easily by creating a dummy project in CDSW, check the project directory created under /var/lib/cdsw/current/projects/projects/0, and then delete the project again, you will see that the newly created project directory is not removed on the file system.

This is a known bug and reported internally in Cloudera, however, it has not been fixed as yet.

To workaround the issue, you can setup a simple shell script to detect the orphaned project and remove it automatically.

Steps as below:

1. Get the list of Project IDs from the directory /var/lib/cdsw/current/projects/projects/0/ on the master host, in my example, it returned below:

ls /var/lib/cdsw/current/projects/projects/0/
1  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48

2. Run ‘cdsw status’ command on the master host to capture the DB pod id:

Sending detailed logs to [/tmp/cdsw_status_T8jRig.log] ...
CDSW Version: [1.3.0:9bb84f6]
OK: Application running as root check
OK: Sysctl params check
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                 NAME                |   STATUS   |           CREATED-AT          |   VERSION   |   EXTERNAL-IP   |          OS-IMAGE         |         KERNEL-VERSION         |   GPU   |   STATEFUL   |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   ericlin-xxx.xxx-1.com   |    True    |   2018-07-11 03:30:45+00:00   |   v1.6.11   |       None      |   CentOS Linux 7 (Core)   |   3.10.0-514.26.2.el7.x86_64   |    0    |    False     |
|   ericlin-xxx.xxx-2.com   |    True    |   2018-07-11 03:30:32+00:00   |   v1.6.11   |       None      |   CentOS Linux 7 (Core)   |   3.10.0-514.26.2.el7.x86_64   |    0    |     True     |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2/2 nodes are ready.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                             NAME                            |   READY   |    STATUS   |   RESTARTS   |           CREATED-AT          |       POD-IP      |      HOST-IP      |   ROLE   |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|             etcd-ericlin-xxx.xxx-1.com                      |    1/1    |   Running   |      5       |   2018-07-11 03:31:44+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|        kube-apiserver-ericlin-xxx.xxx-1.com                 |    1/1    |   Running   |      5       |   2018-07-11 03:30:31+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|   kube-controller-manager-ericlin-xxx.xxx-1.com             |    1/1    |   Running   |      5       |   2018-07-11 03:31:54+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|                  kube-dns-3911048160-30l05                  |    3/3    |   Running   |      15      |   2018-07-11 03:30:45+00:00   |    100.66.128.1   |   172.26.12.157   |   None   |
|                       kube-proxy-c4xk7                      |    1/1    |   Running   |      4       |   2018-07-11 03:30:45+00:00   |    172.26.14.58   |    172.26.14.58   |   None   |
|                       kube-proxy-k95s2                      |    1/1    |   Running   |      5       |   2018-07-11 03:30:45+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|        kube-scheduler-ericlin-xxx.xxx-1          .com       |    1/1    |   Running   |      5       |   2018-07-11 03:31:57+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|               node-problem-detector-v0.1-0624z              |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|               node-problem-detector-v0.1-b80tt              |    1/1    |   Running   |      4       |   2018-07-11 03:32:15+00:00   |    172.26.14.58   |    172.26.14.58   |   None   |
|                       weave-net-469fb                       |    2/2    |   Running   |      12      |   2018-07-11 03:30:45+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|                       weave-net-8dzx6                       |    2/2    |   Running   |      10      |   2018-07-11 03:30:45+00:00   |    172.26.14.58   |    172.26.14.58   |   None   |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster kube-system.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                  NAME                  |   READY   |    STATUS   |   RESTARTS   |           CREATED-AT          |       POD-IP      |      HOST-IP      |           ROLE           |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|         cron-1906902965-wzkp5          |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.9   |   172.26.12.157   |           cron           |
|          db-1165222207-dg98q           |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.6   |   172.26.12.157   |            db            |
|           engine-deps-1rvcl            |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.4   |   172.26.12.157   |       engine-deps        |
|           engine-deps-njwlc            |    1/1    |   Running   |      4       |   2018-07-11 03:32:15+00:00   |     100.66.0.5    |    172.26.14.58   |       engine-deps        |
|   ingress-controller-684706958-6fzh3   |    1/1    |   Running   |      5       |   2018-07-11 03:32:14+00:00   |   172.26.12.157   |   172.26.12.157   |    ingress-controller    |
|        livelog-2502658797-kmq4l        |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.3   |   172.26.12.157   |         livelog          |
|      reconciler-2738760185-1nnsp       |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.2   |   172.26.12.157   |        reconciler        |
|       spark-port-forwarder-krtw6       |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |   172.26.12.157   |   172.26.12.157   |   spark-port-forwarder   |
|       spark-port-forwarder-rbhc6       |    1/1    |   Running   |      4       |   2018-07-11 03:32:15+00:00   |    172.26.14.58   |    172.26.14.58   |   spark-port-forwarder   |
|          web-3320989329-7php0          |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.7   |   172.26.12.157   |           web            |
|          web-3320989329-ms63k          |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.5   |   172.26.12.157   |           web            |
|          web-3320989329-zdpcj          |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.8   |   172.26.12.157   |           web            |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster default.
All required Application services are configured.
All required config maps are ready.
All required secrets are available.
Persistent volumes are ready.
Persistent volume claims are ready.
Ingresses are ready.
Checking web at url: http://ericlin-xxx.xxx-1.com
OK: HTTP port check
Cloudera Data Science Workbench is ready!

You can see that from above example, the DB pod ID is: db-1165222207-dg98q

3. Run below command to connect to CDSW database:

kubectl exec db-1165222207-dg98q -ti -- psql -U sense

4. On prompt, run PostgreSQL to check which project has been delete (the one not in the DB)

sense=# SELECT id, user_id, name, slug FROM projects;
 id | user_id |      name      |      slug
----+---------+----------------+----------------
 41 |       1 | Impala Project | impala-project
 34 |       3 | Test           | test
  1 |       1 | Test           | test
 47 |      10 | tensortest     | tensortest
 44 |       9 | TestEnvVar     | testenvvar
 40 |       4 | hbase          | hbase
 36 |       2 | Scala Test     | scala-test
 46 |       1 | R Project      | r-project
 45 |       4 | spackshell     | spackshell
 37 |       4 | tim            | tim
 48 |      10 | rtest          | rtest
 35 |       1 | Scala Project  | scala-project
 39 |       5 | salim          | salim
 38 |       4 | timtest        | timtest
(14 rows)

To put all above together, I have below quick shell script that can do the job:

for project_id in `ls /var/lib/cdsw/current/projects/projects/0/`
do
  echo "Processing $project_id"
  rows=`kubectl exec $(cdsw status | grep 'db-' | cut -d '|' -f 2 | sed 's/ //g') -ti -- psql -U sense -c "SELECT * FROM projects WHERE ID = $project_id" | grep '0 row' | wc -l`
  if [ $rows -gt 0 ]; then
    echo "Project $project_id has been deleted, you can archive directory /var/lib/cdsw/current/projects/projects/0/$proejct_id"
  fi
done

The output looks like something below:

Processing project 1
Processing project 34
Processing project 35
Processing project 36
Processing project 37
Processing project 38
Processing project 39
Processing project 40
Processing project 41
Processing project 42
Project 42 has been deleted, you can archive directory /var/lib/cdsw/current/projects/projects/0/
Processing project 43
Project 43 has been deleted, you can archive directory /var/lib/cdsw/current/projects/projects/0/
Processing project 44
Processing project 45
Processing project 46
Processing project 47
Processing project 48

So, until there is a fix for this issue, I hope above simple shell script can help.

Any suggestions or ideas, please let me know in the comments section below, thanks a lot in advance.