How to Clean Up Deleted Projects in Cloudera Data Science Workbench

We have got a few customer complains about the fact that currently Cloudera Data Science Workbench (CDSW) does not release the underlining project files on disk after the project is deleted from within the CDSW web console.

This can be re-produced easily by creating a dummy project in CDSW, check the project directory created under /var/lib/cdsw/current/projects/projects/0, and then delete the project again, you will see that the newly created project directory is not removed on the file system.

This is a known bug and reported internally in Cloudera, however, it has not been fixed as yet.

To workaround the issue, you can setup a simple shell script to detect the orphaned project and remove it automatically.

Steps as below:

1. Get the list of Project IDs from the directory /var/lib/cdsw/current/projects/projects/0/ on the master host, in my example, it returned below:

ls /var/lib/cdsw/current/projects/projects/0/
1  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48

2. Run ‘cdsw status’ command on the master host to capture the DB pod id:

Sending detailed logs to [/tmp/cdsw_status_T8jRig.log] ...
CDSW Version: [1.3.0:9bb84f6]
OK: Application running as root check
OK: Sysctl params check
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                 NAME                |   STATUS   |           CREATED-AT          |   VERSION   |   EXTERNAL-IP   |          OS-IMAGE         |         KERNEL-VERSION         |   GPU   |   STATEFUL   |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   ericlin-xxx.xxx-1.com   |    True    |   2018-07-11 03:30:45+00:00   |   v1.6.11   |       None      |   CentOS Linux 7 (Core)   |   3.10.0-514.26.2.el7.x86_64   |    0    |    False     |
|   ericlin-xxx.xxx-2.com   |    True    |   2018-07-11 03:30:32+00:00   |   v1.6.11   |       None      |   CentOS Linux 7 (Core)   |   3.10.0-514.26.2.el7.x86_64   |    0    |     True     |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2/2 nodes are ready.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                             NAME                            |   READY   |    STATUS   |   RESTARTS   |           CREATED-AT          |       POD-IP      |      HOST-IP      |   ROLE   |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|             etcd-ericlin-xxx.xxx-1.com                      |    1/1    |   Running   |      5       |   2018-07-11 03:31:44+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|        kube-apiserver-ericlin-xxx.xxx-1.com                 |    1/1    |   Running   |      5       |   2018-07-11 03:30:31+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|   kube-controller-manager-ericlin-xxx.xxx-1.com             |    1/1    |   Running   |      5       |   2018-07-11 03:31:54+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|                  kube-dns-3911048160-30l05                  |    3/3    |   Running   |      15      |   2018-07-11 03:30:45+00:00   |    100.66.128.1   |   172.26.12.157   |   None   |
|                       kube-proxy-c4xk7                      |    1/1    |   Running   |      4       |   2018-07-11 03:30:45+00:00   |    172.26.14.58   |    172.26.14.58   |   None   |
|                       kube-proxy-k95s2                      |    1/1    |   Running   |      5       |   2018-07-11 03:30:45+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|        kube-scheduler-ericlin-xxx.xxx-1          .com       |    1/1    |   Running   |      5       |   2018-07-11 03:31:57+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|               node-problem-detector-v0.1-0624z              |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|               node-problem-detector-v0.1-b80tt              |    1/1    |   Running   |      4       |   2018-07-11 03:32:15+00:00   |    172.26.14.58   |    172.26.14.58   |   None   |
|                       weave-net-469fb                       |    2/2    |   Running   |      12      |   2018-07-11 03:30:45+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|                       weave-net-8dzx6                       |    2/2    |   Running   |      10      |   2018-07-11 03:30:45+00:00   |    172.26.14.58   |    172.26.14.58   |   None   |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster kube-system.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                  NAME                  |   READY   |    STATUS   |   RESTARTS   |           CREATED-AT          |       POD-IP      |      HOST-IP      |           ROLE           |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|         cron-1906902965-wzkp5          |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.9   |   172.26.12.157   |           cron           |
|          db-1165222207-dg98q           |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.6   |   172.26.12.157   |            db            |
|           engine-deps-1rvcl            |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.4   |   172.26.12.157   |       engine-deps        |
|           engine-deps-njwlc            |    1/1    |   Running   |      4       |   2018-07-11 03:32:15+00:00   |     100.66.0.5    |    172.26.14.58   |       engine-deps        |
|   ingress-controller-684706958-6fzh3   |    1/1    |   Running   |      5       |   2018-07-11 03:32:14+00:00   |   172.26.12.157   |   172.26.12.157   |    ingress-controller    |
|        livelog-2502658797-kmq4l        |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.3   |   172.26.12.157   |         livelog          |
|      reconciler-2738760185-1nnsp       |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.2   |   172.26.12.157   |        reconciler        |
|       spark-port-forwarder-krtw6       |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |   172.26.12.157   |   172.26.12.157   |   spark-port-forwarder   |
|       spark-port-forwarder-rbhc6       |    1/1    |   Running   |      4       |   2018-07-11 03:32:15+00:00   |    172.26.14.58   |    172.26.14.58   |   spark-port-forwarder   |
|          web-3320989329-7php0          |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.7   |   172.26.12.157   |           web            |
|          web-3320989329-ms63k          |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.5   |   172.26.12.157   |           web            |
|          web-3320989329-zdpcj          |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.8   |   172.26.12.157   |           web            |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster default.
All required Application services are configured.
All required config maps are ready.
All required secrets are available.
Persistent volumes are ready.
Persistent volume claims are ready.
Ingresses are ready.
Checking web at url: http://ericlin-xxx.xxx-1.com
OK: HTTP port check
Cloudera Data Science Workbench is ready!

You can see that from above example, the DB pod ID is: db-1165222207-dg98q

3. Run below command to connect to CDSW database:

kubectl exec db-1165222207-dg98q -ti -- psql -U sense

4. On prompt, run PostgreSQL to check which project has been delete (the one not in the DB)

sense=# SELECT id, user_id, name, slug FROM projects;
 id | user_id |      name      |      slug
----+---------+----------------+----------------
 41 |       1 | Impala Project | impala-project
 34 |       3 | Test           | test
  1 |       1 | Test           | test
 47 |      10 | tensortest     | tensortest
 44 |       9 | TestEnvVar     | testenvvar
 40 |       4 | hbase          | hbase
 36 |       2 | Scala Test     | scala-test
 46 |       1 | R Project      | r-project
 45 |       4 | spackshell     | spackshell
 37 |       4 | tim            | tim
 48 |      10 | rtest          | rtest
 35 |       1 | Scala Project  | scala-project
 39 |       5 | salim          | salim
 38 |       4 | timtest        | timtest
(14 rows)

To put all above together, I have below quick shell script that can do the job:

for project_id in `ls /var/lib/cdsw/current/projects/projects/0/`
do
  echo "Processing $project_id"
  rows=`kubectl exec $(cdsw status | grep 'db-' | cut -d '|' -f 2 | sed 's/ //g') -ti -- psql -U sense -c "SELECT * FROM projects WHERE ID = $project_id" | grep '0 row' | wc -l`
  if [ $rows -gt 0 ]; then
    echo "Project $project_id has been deleted, you can archive directory /var/lib/cdsw/current/projects/projects/0/$proejct_id"
  fi
done

The output looks like something below:

Processing project 1
Processing project 34
Processing project 35
Processing project 36
Processing project 37
Processing project 38
Processing project 39
Processing project 40
Processing project 41
Processing project 42
Project 42 has been deleted, you can archive directory /var/lib/cdsw/current/projects/projects/0/
Processing project 43
Project 43 has been deleted, you can archive directory /var/lib/cdsw/current/projects/projects/0/
Processing project 44
Processing project 45
Processing project 46
Processing project 47
Processing project 48

So, until there is a fix for this issue, I hope above simple shell script can help.

Any suggestions or ideas, please let me know in the comments section below, thanks a lot in advance.

ericlin.me Secured with SSL signed by “Let’s Encrypt”

Cloudera recently introduced a new way of self learning for Employees, especially in the Support organization, that allows each engineer to spend 1 week off our duty to do full self-learning offline, every 4-6 months. There are lots of topics to choose from, including some public available courses to anything that we are interested in and also useful for our day to day work. This week is my turn and I have chosen SSL/TLS to enhance my experience with CDH, because I have faced lots of issues from customers who face TLS related issues across wide range of CDH components, including Hive, Impala and Oozie etc, so I need to skill up my knowledge in this area.

This is my second day into the week and I have finished two courses about TLS on Lynda.com:

Learning Secure Sockets Layer
SSL Certificate for Web Developers

As part of the learning, I have enhanced my knowledge on SSL/TLS and understand the process of how to enable SSL/TLS for a website, from generating private key up until getting certificate signed and eventually enabled on Apache/Ngnix web server. And I think it is great time to enable SSL/TLS for my own blog as well, because the traffic to my blog has increased in the last couple of years and I do receive comments from my various blog posts every now and then. So securing my blog is a logical next step.

As part of the course in SSL Certificate for Web Developers that provided by Kevin Skoglund, Kevin has suggested that since 2016, Let’s Encrypt has started offering free, automated signed certificates to general public. So why not use it to get my blog secured? Even though the certificate needs to be renewed every 90 days, Certbot, the tool provided by Let’s Encrypt, not only installs certificate for you with ease, but also can setup a cron job to renew the certificate automatically. All you need to do is to install Certbot from here, select the web server and OS that match your site and follow instructions, and your site can be secured by SSL/TLS in a few minutes.

Don’t forget that you also need to open port 443 from your cloud service, if you are using AWS or Google Cloud, as by default port 443 is disabled. Instructions are different depending on where your host is, so please Google around as this topic goes beyond the scope of my post here.

I highly suggest you to do this, because the trend in all modern browsers is that they will all issue warnings to visitors if the site they are visiting is not secured and your site will just look unprofessional, insecure and visitors will think twice before entering any details on your site, including submitting a simple comment. The number of sites getting secured is increasing and it will become the standard, below are some stats from Let’s Encrypt since launched in 2016:

So, with this free service, it is time to secure your site without hesitation.

Oozie SSH Action Does Not Support Chained Commands – OOZIE-1974

I have seen quite a few CDH users who try to run chained Linux command via Oozie’s SSH Action. Example is like below:

<action name="sshTest">
  <ssh xmlns="uri:oozie:ssh-action:0.1">
    <host>${sshUserHost}</host>
    <command>kinit test.keytab test@TEST.COM ; python ....</command>
    <capture-output/>
  </ssh>
  <ok to="nextActino"/>
  <error to="kill"/>
</action>

We can see that the command to run on remote host is as below:

kinit test.keytab test@TEST.COM ; python ....

This is OK if both commands can finish successfully very quickly. However it will cause the SSH action to fail if the python command needs to run for a certain time, say more than 5-10 minutes. Below is the example log messages produced in Oozie’s server log while the SSH action is running:

2018-08-13 10:01:48,215 WARN org.apache.oozie.command.wf.CompletedActionXCommand: SERVER[{oozie-host}] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000234-172707121423674-oozie-oozi-W] ACTION[0000234-172707121423674-oozie-oozi-W@sshTest] Received early callback for action still in PREP state; will wait [10,000]ms and requeue up to [5] more times
2018-08-13 10:01:48,216 WARN org.apache.oozie.command.wf.CompletedActionXCommand: SERVER[{oozie-host}] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000234-172707121423674-oozie-oozi-W] ACTION[0000234-172707121423674-oozie-oozi-W@sshTest] Received early callback for action still in PREP state; will wait [10,000]ms and requeue up to [5] more times

....

2018-08-13 10:02:38,243 ERROR org.apache.oozie.command.wf.CompletedActionXCommand: SERVER[cdlpf1hdpm1004.es.ad.adp.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000234-172707121423674-oozie-oozi-W] ACTION[0000234-172707121423674-oozie-oozi-W@sshTest] XException, 
org.apache.oozie.command.CommandException: E0822: Received early callback for action [0000234-172707121423674-oozie-oozi-W@sshTest] while still in PREP state and exhausted all requeues
 at org.apache.oozie.command.wf.CompletedActionXCommand.execute(CompletedActionXCommand.java:114)
 at org.apache.oozie.command.wf.CompletedActionXCommand.execute(CompletedActionXCommand.java:39)
 at org.apache.oozie.command.XCommand.call(XCommand.java:286)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:179)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

The reason for the failure is because Oozie currently does not support chained Linux commands on SSH Action, which is tracked via upstream JIRA OOZIE-1974.

Below is what happened behind the scene:

1. ssh-base.sh and ssh-wrapper.sh files will be copied to target host
https://github.com/apache/oozie/blob/master/core/src/main/resources

2. Oozie will run below command from Oozie server via ssh directly to the target host:

sh ssh-base.sh FLATTEN_ARGS curl "http://{oozie-host}:11000/oozie/callback?id=0000234-172707121423674-oozie-oozi-W@sshTest&status=#status" \
"--data-binary%%%@#stdout%%%--request%%%POST%%%--header%%%\"content-type:text/plain\"" \
0000234-172707121423674-oozie-oozi-W@sshTest@3 kinit test.keytab test@TEST.COM ; python ....

3. based on the command from above, we can see that the command was rebuilt, now the full command will be broken into two commands:

sh ssh-base.sh FLATTEN_ARGS curl "http://{oozie-host}:11000/oozie/callback?id=0000234-172707121423674-oozie-oozi-W@sshTest&status=#status" \
"--data-binary%%%@#stdout%%%--request%%%POST%%%--header%%%\"content-type:text/plain\"" \
0000234-172707121423674-oozie-oozi-W@sshTest@3 kinit test.keytab test@TEST.COM

and

python ....

Not the original “kinit test.keytab test@TEST.COM” and “python ….”

4. ssh-base.sh script will in term run below command:

sh ssh-wrapper.sh FLATTEN_ARGS curl "http://{oozie-host}:11000/oozie/callback?id=0000234-172707121423674-oozie-oozi-W@sshTest&status=#status" \
"--data-binary%%%@#stdout%%%--request%%%POST%%%--header%%%\"content-type:text/plain\"" \
0000234-172707121423674-oozie-oozi-W@sshTest@3 kinit test.keytab test@TEST.COM

This command will finish very quickly and triggered callback curl call immediately, however, the “python” command will cause the SSH job to not finish until it finishes. This is causing the Oozie job in the pending state and causing the callback to fail after timeout because the Oozie job and SSH job states are not consistent.

So until OOZIE-1974 is fixed, the solution is to put both the commands inside a single script file and make it available to run on the remote host.

Hope above helps.

WebHCat Request Failed With Error “id: HTTP: no such user”

WebHCat, previously known as Templeton, is the REST API for HCatalog, a table and storage management layer for Hadoop. Users can use WebHCat to access metadata information from HCatalog, as well as submitting jobs for MapReduce, Hive & Pig.

Below is an example of how to retrieve a list of databases via WebHCat API:

curl --negotiate -u: http://{webhcat-hostname}:50111/templeton/v1/ddl/database/

Please note that port 50111 is the default port number for WebHCat. And sample output looks like below:

{"databases":["default","mytest","s3_test","udf_db"]}

However, recently I was facing an issue that WebHCat request failed with below error:

2018-08-08 17:18:44,413 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hive (auth:PROXY) via HTTP/{webhcat-hostname}@CDH511.COM (auth:KERBEROS) cause:org.apache.thrift.transport.TTrans
portException: java.net.SocketException: Connection reset
2018-08-08 17:18:44,414 ERROR org.apache.hive.hcatalog.templeton.CatchallExceptionMapper: java.lang.reflect.UndeclaredThrowableException
java.io.IOException: java.lang.reflect.UndeclaredThrowableException
...
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
....
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1882)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1872)
....

From the stacktrace, we can see that WebHCat failed when trying to collect delegation token from HiveMetaStore. So checking the HMS server log, I found below error:

2018-08-08 17:18:19,675 WARN  org.apache.hadoop.security.ShellBasedUnixGroupsMapping: [pool-7-thread-2]: unable to return groups for user HTTP
PartialGroupNameException The user name 'HTTP' is not found. id: HTTP: no such user
id: HTTP: no such user

It is pretty clear that HMS failed due to user “HTTP” is missing. Adding “HTTP” user on the HMS server host resolved the issue.

Research further, I realized that this was due to in Hive configuration, hadoop.proxyuser.hive.groups was set to a list of groups, rather than “*”, and “HTTP” was one in the group list. You will not get such error if the hadoop.proxyuser.hive.groups is set at “*”, and only failed if “HTTP” was added manually (it is required to be on this list if the value is not “*”, because “hive” user need to be able to impersonate as “HTTP” user for the request to work).

The reason for such failure is because when hadoop.proxyuser.hive.groups is set as “*”, Hive will not bother to check for user’s existence, since every user is allowed. However, when a list of users are defined here, when Hive impersonates as those users, it will try to make sure that those users exist on the host that Hive runs. In our case, “HTTP” user did not exist on HMS host, HMS failed with the error we saw earlier. So we just need to add this user to resolve the issue.

Hope above helps for anyone who also have the same issue.

My Patch for SQOOP-3330 Committed

In the last few weeks, I have been working on an issue in Sqoop that Sqoop’s “–append” options does not work well with parameter “-Dmapreduce.output.basename”. The goal of adding “-Dmapreduce.output.basename” is to ask Sqoop to generate custom file names for the target file in HDFS, rather than using the default “part” prefix for all files.

However, Sqoop has a bug that when trying to move the file from temp folder to target folder in HDFS, it does not respect the setting for mapreduce.output.basename, hence resulted no data being moved to the final destination.

This can be confirmed by turning on the DEBUG logging for Sqoop (using –verbose):

sqoop import -Dmapreduce.output.basename="eric-test" --connect jdbc:mysql://mysql-host.com/test --username root --password 'root' --table test --target-dir /tmp/ericlin-test/sqoop/test --fields-terminated-by '\t' --verbose --append

18/05/28 22:24:44 INFO util.AppendUtils: Appending to directory test
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: _SUCCESS ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00000 ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00001 ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00002 ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Deleting temporary folder 14935e396acc4ea7b9a6236c66064c9b_test

From the output, you can see that Sqoop ignored all the files generated with prefix “eric-test”.

I have submitted a upstream JIRA SQOOP-3330, and after a few review cycles, my patch finally get accepted and committed to Sqoop’s trunk code base. I am looking forward for the fix to be backported into CDH in the near future.