How to Clean Up Deleted Projects in Cloudera Data Science Workbench

We have got a few customer complains about the fact that currently Cloudera Data Science Workbench (CDSW) does not release the underlining project files on disk after the project is deleted from within the CDSW web console.

This can be re-produced easily by creating a dummy project in CDSW, check the project directory created under /var/lib/cdsw/current/projects/projects/0, and then delete the project again, you will see that the newly created project directory is not removed on the file system.

This is a known bug and reported internally in Cloudera, however, it has not been fixed as yet.

To workaround the issue, you can setup a simple shell script to detect the orphaned project and remove it automatically.

Steps as below:

1. Get the list of Project IDs from the directory /var/lib/cdsw/current/projects/projects/0/ on the master host, in my example, it returned below:

ls /var/lib/cdsw/current/projects/projects/0/
1  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48

2. Run ‘cdsw status’ command on the master host to capture the DB pod id:

Sending detailed logs to [/tmp/cdsw_status_T8jRig.log] ...
CDSW Version: [1.3.0:9bb84f6]
OK: Application running as root check
OK: Sysctl params check
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                 NAME                |   STATUS   |           CREATED-AT          |   VERSION   |   EXTERNAL-IP   |          OS-IMAGE         |         KERNEL-VERSION         |   GPU   |   STATEFUL   |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   ericlin-xxx.xxx-1.com   |    True    |   2018-07-11 03:30:45+00:00   |   v1.6.11   |       None      |   CentOS Linux 7 (Core)   |   3.10.0-514.26.2.el7.x86_64   |    0    |    False     |
|   ericlin-xxx.xxx-2.com   |    True    |   2018-07-11 03:30:32+00:00   |   v1.6.11   |       None      |   CentOS Linux 7 (Core)   |   3.10.0-514.26.2.el7.x86_64   |    0    |     True     |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2/2 nodes are ready.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                             NAME                            |   READY   |    STATUS   |   RESTARTS   |           CREATED-AT          |       POD-IP      |      HOST-IP      |   ROLE   |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|             etcd-ericlin-xxx.xxx-1.com                      |    1/1    |   Running   |      5       |   2018-07-11 03:31:44+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|        kube-apiserver-ericlin-xxx.xxx-1.com                 |    1/1    |   Running   |      5       |   2018-07-11 03:30:31+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|   kube-controller-manager-ericlin-xxx.xxx-1.com             |    1/1    |   Running   |      5       |   2018-07-11 03:31:54+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|                  kube-dns-3911048160-30l05                  |    3/3    |   Running   |      15      |   2018-07-11 03:30:45+00:00   |    100.66.128.1   |   172.26.12.157   |   None   |
|                       kube-proxy-c4xk7                      |    1/1    |   Running   |      4       |   2018-07-11 03:30:45+00:00   |    172.26.14.58   |    172.26.14.58   |   None   |
|                       kube-proxy-k95s2                      |    1/1    |   Running   |      5       |   2018-07-11 03:30:45+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|        kube-scheduler-ericlin-xxx.xxx-1          .com       |    1/1    |   Running   |      5       |   2018-07-11 03:31:57+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|               node-problem-detector-v0.1-0624z              |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|               node-problem-detector-v0.1-b80tt              |    1/1    |   Running   |      4       |   2018-07-11 03:32:15+00:00   |    172.26.14.58   |    172.26.14.58   |   None   |
|                       weave-net-469fb                       |    2/2    |   Running   |      12      |   2018-07-11 03:30:45+00:00   |   172.26.12.157   |   172.26.12.157   |   None   |
|                       weave-net-8dzx6                       |    2/2    |   Running   |      10      |   2018-07-11 03:30:45+00:00   |    172.26.14.58   |    172.26.14.58   |   None   |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster kube-system.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                  NAME                  |   READY   |    STATUS   |   RESTARTS   |           CREATED-AT          |       POD-IP      |      HOST-IP      |           ROLE           |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|         cron-1906902965-wzkp5          |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.9   |   172.26.12.157   |           cron           |
|          db-1165222207-dg98q           |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.6   |   172.26.12.157   |            db            |
|           engine-deps-1rvcl            |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.4   |   172.26.12.157   |       engine-deps        |
|           engine-deps-njwlc            |    1/1    |   Running   |      4       |   2018-07-11 03:32:15+00:00   |     100.66.0.5    |    172.26.14.58   |       engine-deps        |
|   ingress-controller-684706958-6fzh3   |    1/1    |   Running   |      5       |   2018-07-11 03:32:14+00:00   |   172.26.12.157   |   172.26.12.157   |    ingress-controller    |
|        livelog-2502658797-kmq4l        |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.3   |   172.26.12.157   |         livelog          |
|      reconciler-2738760185-1nnsp       |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.2   |   172.26.12.157   |        reconciler        |
|       spark-port-forwarder-krtw6       |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |   172.26.12.157   |   172.26.12.157   |   spark-port-forwarder   |
|       spark-port-forwarder-rbhc6       |    1/1    |   Running   |      4       |   2018-07-11 03:32:15+00:00   |    172.26.14.58   |    172.26.14.58   |   spark-port-forwarder   |
|          web-3320989329-7php0          |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.7   |   172.26.12.157   |           web            |
|          web-3320989329-ms63k          |    1/1    |   Running   |      5       |   2018-07-11 03:32:15+00:00   |    100.66.128.5   |   172.26.12.157   |           web            |
|          web-3320989329-zdpcj          |    1/1    |   Running   |      2       |   2018-08-10 08:13:55+00:00   |    100.66.128.8   |   172.26.12.157   |           web            |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster default.
All required Application services are configured.
All required config maps are ready.
All required secrets are available.
Persistent volumes are ready.
Persistent volume claims are ready.
Ingresses are ready.
Checking web at url: http://ericlin-xxx.xxx-1.com
OK: HTTP port check
Cloudera Data Science Workbench is ready!

You can see that from above example, the DB pod ID is: db-1165222207-dg98q

3. Run below command to connect to CDSW database:

kubectl exec db-1165222207-dg98q -ti -- psql -U sense

4. On prompt, run PostgreSQL to check which project has been delete (the one not in the DB)

sense=# SELECT id, user_id, name, slug FROM projects;
 id | user_id |      name      |      slug
----+---------+----------------+----------------
 41 |       1 | Impala Project | impala-project
 34 |       3 | Test           | test
  1 |       1 | Test           | test
 47 |      10 | tensortest     | tensortest
 44 |       9 | TestEnvVar     | testenvvar
 40 |       4 | hbase          | hbase
 36 |       2 | Scala Test     | scala-test
 46 |       1 | R Project      | r-project
 45 |       4 | spackshell     | spackshell
 37 |       4 | tim            | tim
 48 |      10 | rtest          | rtest
 35 |       1 | Scala Project  | scala-project
 39 |       5 | salim          | salim
 38 |       4 | timtest        | timtest
(14 rows)

To put all above together, I have below quick shell script that can do the job:

for project_id in `ls /var/lib/cdsw/current/projects/projects/0/`
do
  echo "Processing $project_id"
  rows=`kubectl exec $(cdsw status | grep 'db-' | cut -d '|' -f 2 | sed 's/ //g') -ti -- psql -U sense -c "SELECT * FROM projects WHERE ID = $project_id" | grep '0 row' | wc -l`
  if [ $rows -gt 0 ]; then
    echo "Project $project_id has been deleted, you can archive directory /var/lib/cdsw/current/projects/projects/0/$proejct_id"
  fi
done

The output looks like something below:

Processing project 1
Processing project 34
Processing project 35
Processing project 36
Processing project 37
Processing project 38
Processing project 39
Processing project 40
Processing project 41
Processing project 42
Project 42 has been deleted, you can archive directory /var/lib/cdsw/current/projects/projects/0/
Processing project 43
Project 43 has been deleted, you can archive directory /var/lib/cdsw/current/projects/projects/0/
Processing project 44
Processing project 45
Processing project 46
Processing project 47
Processing project 48

So, until there is a fix for this issue, I hope above simple shell script can help.

Any suggestions or ideas, please let me know in the comments section below, thanks a lot in advance.

Oozie SSH Action Does Not Support Chained Commands – OOZIE-1974

I have seen quite a few CDH users who try to run chained Linux command via Oozie’s SSH Action. Example is like below:

<action name="sshTest">
  <ssh xmlns="uri:oozie:ssh-action:0.1">
    <host>${sshUserHost}</host>
    <command>kinit test.keytab test@TEST.COM ; python ....</command>
    <capture-output/>
  </ssh>
  <ok to="nextActino"/>
  <error to="kill"/>
</action>

We can see that the command to run on remote host is as below:

kinit test.keytab test@TEST.COM ; python ....

This is OK if both commands can finish successfully very quickly. However it will cause the SSH action to fail if the python command needs to run for a certain time, say more than 5-10 minutes. Below is the example log messages produced in Oozie’s server log while the SSH action is running:

2018-08-13 10:01:48,215 WARN org.apache.oozie.command.wf.CompletedActionXCommand: SERVER[{oozie-host}] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000234-172707121423674-oozie-oozi-W] ACTION[0000234-172707121423674-oozie-oozi-W@sshTest] Received early callback for action still in PREP state; will wait [10,000]ms and requeue up to [5] more times
2018-08-13 10:01:48,216 WARN org.apache.oozie.command.wf.CompletedActionXCommand: SERVER[{oozie-host}] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000234-172707121423674-oozie-oozi-W] ACTION[0000234-172707121423674-oozie-oozi-W@sshTest] Received early callback for action still in PREP state; will wait [10,000]ms and requeue up to [5] more times

....

2018-08-13 10:02:38,243 ERROR org.apache.oozie.command.wf.CompletedActionXCommand: SERVER[cdlpf1hdpm1004.es.ad.adp.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000234-172707121423674-oozie-oozi-W] ACTION[0000234-172707121423674-oozie-oozi-W@sshTest] XException, 
org.apache.oozie.command.CommandException: E0822: Received early callback for action [0000234-172707121423674-oozie-oozi-W@sshTest] while still in PREP state and exhausted all requeues
 at org.apache.oozie.command.wf.CompletedActionXCommand.execute(CompletedActionXCommand.java:114)
 at org.apache.oozie.command.wf.CompletedActionXCommand.execute(CompletedActionXCommand.java:39)
 at org.apache.oozie.command.XCommand.call(XCommand.java:286)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:179)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

The reason for the failure is because Oozie currently does not support chained Linux commands on SSH Action, which is tracked via upstream JIRA OOZIE-1974.

Below is what happened behind the scene:

1. ssh-base.sh and ssh-wrapper.sh files will be copied to target host
https://github.com/apache/oozie/blob/master/core/src/main/resources

2. Oozie will run below command from Oozie server via ssh directly to the target host:

sh ssh-base.sh FLATTEN_ARGS curl "http://{oozie-host}:11000/oozie/callback?id=0000234-172707121423674-oozie-oozi-W@sshTest&status=#status" \
"--data-binary%%%@#stdout%%%--request%%%POST%%%--header%%%\"content-type:text/plain\"" \
0000234-172707121423674-oozie-oozi-W@sshTest@3 kinit test.keytab test@TEST.COM ; python ....

3. based on the command from above, we can see that the command was rebuilt, now the full command will be broken into two commands:

sh ssh-base.sh FLATTEN_ARGS curl "http://{oozie-host}:11000/oozie/callback?id=0000234-172707121423674-oozie-oozi-W@sshTest&status=#status" \
"--data-binary%%%@#stdout%%%--request%%%POST%%%--header%%%\"content-type:text/plain\"" \
0000234-172707121423674-oozie-oozi-W@sshTest@3 kinit test.keytab test@TEST.COM

and

python ....

Not the original “kinit test.keytab test@TEST.COM” and “python ….”

4. ssh-base.sh script will in term run below command:

sh ssh-wrapper.sh FLATTEN_ARGS curl "http://{oozie-host}:11000/oozie/callback?id=0000234-172707121423674-oozie-oozi-W@sshTest&status=#status" \
"--data-binary%%%@#stdout%%%--request%%%POST%%%--header%%%\"content-type:text/plain\"" \
0000234-172707121423674-oozie-oozi-W@sshTest@3 kinit test.keytab test@TEST.COM

This command will finish very quickly and triggered callback curl call immediately, however, the “python” command will cause the SSH job to not finish until it finishes. This is causing the Oozie job in the pending state and causing the callback to fail after timeout because the Oozie job and SSH job states are not consistent.

So until OOZIE-1974 is fixed, the solution is to put both the commands inside a single script file and make it available to run on the remote host.

Hope above helps.

WebHCat Request Failed With Error “id: HTTP: no such user”

WebHCat, previously known as Templeton, is the REST API for HCatalog, a table and storage management layer for Hadoop. Users can use WebHCat to access metadata information from HCatalog, as well as submitting jobs for MapReduce, Hive & Pig.

Below is an example of how to retrieve a list of databases via WebHCat API:

curl --negotiate -u: http://{webhcat-hostname}:50111/templeton/v1/ddl/database/

Please note that port 50111 is the default port number for WebHCat. And sample output looks like below:

{"databases":["default","mytest","s3_test","udf_db"]}

However, recently I was facing an issue that WebHCat request failed with below error:

2018-08-08 17:18:44,413 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hive (auth:PROXY) via HTTP/{webhcat-hostname}@CDH511.COM (auth:KERBEROS) cause:org.apache.thrift.transport.TTrans
portException: java.net.SocketException: Connection reset
2018-08-08 17:18:44,414 ERROR org.apache.hive.hcatalog.templeton.CatchallExceptionMapper: java.lang.reflect.UndeclaredThrowableException
java.io.IOException: java.lang.reflect.UndeclaredThrowableException
...
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
....
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1882)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1872)
....

From the stacktrace, we can see that WebHCat failed when trying to collect delegation token from HiveMetaStore. So checking the HMS server log, I found below error:

2018-08-08 17:18:19,675 WARN  org.apache.hadoop.security.ShellBasedUnixGroupsMapping: [pool-7-thread-2]: unable to return groups for user HTTP
PartialGroupNameException The user name 'HTTP' is not found. id: HTTP: no such user
id: HTTP: no such user

It is pretty clear that HMS failed due to user “HTTP” is missing. Adding “HTTP” user on the HMS server host resolved the issue.

Research further, I realized that this was due to in Hive configuration, hadoop.proxyuser.hive.groups was set to a list of groups, rather than “*”, and “HTTP” was one in the group list. You will not get such error if the hadoop.proxyuser.hive.groups is set at “*”, and only failed if “HTTP” was added manually (it is required to be on this list if the value is not “*”, because “hive” user need to be able to impersonate as “HTTP” user for the request to work).

The reason for such failure is because when hadoop.proxyuser.hive.groups is set as “*”, Hive will not bother to check for user’s existence, since every user is allowed. However, when a list of users are defined here, when Hive impersonates as those users, it will try to make sure that those users exist on the host that Hive runs. In our case, “HTTP” user did not exist on HMS host, HMS failed with the error we saw earlier. So we just need to add this user to resolve the issue.

Hope above helps for anyone who also have the same issue.

My Patch for SQOOP-3330 Committed

In the last few weeks, I have been working on an issue in Sqoop that Sqoop’s “–append” options does not work well with parameter “-Dmapreduce.output.basename”. The goal of adding “-Dmapreduce.output.basename” is to ask Sqoop to generate custom file names for the target file in HDFS, rather than using the default “part” prefix for all files.

However, Sqoop has a bug that when trying to move the file from temp folder to target folder in HDFS, it does not respect the setting for mapreduce.output.basename, hence resulted no data being moved to the final destination.

This can be confirmed by turning on the DEBUG logging for Sqoop (using –verbose):

sqoop import -Dmapreduce.output.basename="eric-test" --connect jdbc:mysql://mysql-host.com/test --username root --password 'root' --table test --target-dir /tmp/ericlin-test/sqoop/test --fields-terminated-by '\t' --verbose --append

18/05/28 22:24:44 INFO util.AppendUtils: Appending to directory test
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: _SUCCESS ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00000 ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00001 ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00002 ignored
18/05/28 22:24:44 DEBUG util.AppendUtils: Deleting temporary folder 14935e396acc4ea7b9a6236c66064c9b_test

From the output, you can see that Sqoop ignored all the files generated with prefix “eric-test”.

I have submitted a upstream JIRA SQOOP-3330, and after a few review cycles, my patch finally get accepted and committed to Sqoop’s trunk code base. I am looking forward for the fix to be backported into CDH in the near future.

My Patch for SQOOP-3042 Committed

I have got a lot complains from Cloudera customers that after Sqoop job finishes, the table class Jar files were not cleaned up. By default, they are saved under /tmp/sqoop-{username}/compile, to be used by current running jobs. They are not needed anymore after job finishes, so they should be cleaned up.

The content of the directory looks like below:

[root@localhost ~]# ll /tmp/sqoop-hadoop/compile/
total 16
drwxrwxr-x. 2 hadoop hadoop 4096 Jun  6 08:56 1496d8f8400052af2a7d3ede2cfe496d
drwxrwxr-x. 2 hadoop hadoop 4096 Jun  6 08:45 6360b964ea0c1fdf6bf6aaed7a35b986
drwxrwxr-x. 2 hadoop hadoop 4096 Jun  6 08:45 d4ccb83934494ba2874b5c6d1b51d2ac
drwxrwxr-x. 2 hadoop hadoop 4096 Jun  6 08:50 df37a566defbfac477f6f309cf227dec

[root@localhost ~]# ll /tmp/sqoop-hadoop/compile/1496d8f8400052af2a7d3ede2cfe496d
total 56
-rw-rw-r--. 1 hadoop hadoop   620 Jun  6 08:56 SQOOP_3042$1.class
-rw-rw-r--. 1 hadoop hadoop   617 Jun  6 08:56 SQOOP_3042$2.class
-rw-rw-r--. 1 hadoop hadoop   620 Jun  6 08:56 SQOOP_3042$3.class
-rw-rw-r--. 1 hadoop hadoop   516 Jun  6 08:56 SQOOP_3042.avsc
-rw-rw-r--. 1 hadoop hadoop 10389 Jun  6 08:56 SQOOP_3042.class
-rw-rw-r--. 1 hadoop hadoop   237 Jun  6 08:56 SQOOP_3042$FieldSetterCommand.class
-rw-rw-r--. 1 hadoop hadoop  6063 Jun  6 08:56 SQOOP_3042.jar
-rw-rw-r--. 1 hadoop hadoop 12847 Jun  6 08:56 SQOOP_3042.java

I created an upstream JIRA to track and fix it SQOOP-3042 in Nov 2016. I have provided the patch since then, but never got looked at due to lack of reviewers.

After getting help from Cloudera Sqoop Engineers in our Budapest team, I finally get the JIRA progressed in the last few weeks and it was committed to Sqoop trunk yesterday. Details can be seen here: https://github.com/apache/sqoop/commit/0cfbf56713f7574568ea3754f6854e82f5540954

The fix involves adding a new command line options “–delete-compile-dir” so that user can instruct Sqoop to remove those temp directories after job finishes. The reason to add such option is to avoid changing Sqoop’s behaviour, but at the same time, allow Sqoop to perform exact actions.

An example command would look like below:

sqoop import --connect jdbc:mysql://localhost/test --username root --password pass --table SQOOP_3042 --target-dir /tmp/erictest --delete-target-dir --verbose --delete-compile-dir

And you can see below message showing in the –verbose mode to verify that directory and files are removed:

....
18/06/06 17:39:27 INFO mapreduce.ImportJobBase: Transferred 52 bytes in 29.6139 seconds (1.7559 bytes/sec)
18/06/06 17:39:27 INFO mapreduce.ImportJobBase: Retrieved 4 records.
18/06/06 17:39:27 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader@6f1fba17
18/06/06 17:39:28 DEBUG util.DirCleanupHook: Removing directory: /tmp/sqoop-hadoop/compile/a9d8a87bc02a5f823a82014c49516736 in the clean up hook.