Access to WebHCat with error “User: HTTP/full-domain@REALM is not allowed to impersonate username”

Last week I was dealing with an issue that when connecting to WebHCat using the following command:

curl -i -u : --negotiate 'http://<webhcat-domina>:50111/templeton/v1/ddl/database'

user got the following error:

{"error":"User: HTTP/<domain-name>@<REALM> is not allowed to impersonate <username>"}

After doing some research, it turned out to be caused by the auth_to_local rules user defined in the cluster, see below config in the core-site.xml for HDFS:


In the first two rules, it is actually translating the principal to lowercase (defined by /L at the end). This will translate the principal “HTTP/@” into “http” instead of “HTTP”, and only the following proxyuser are defined in the same XML:


To fix the issue, I did the following:

  1. go to Cloudera Manager > HDFS > Configuration
  2. search for “Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml”
  3. enter the following XML into the textarea:


    please note the lower case “http”

  4. save and restart related services (indicated by the restart icon in Cloudera Manager)

After this change, issue will be resolved.

How to use “filters” to exclude files when in DistCp

This article explains how to use the new feature supported in Apache Hadoop 2.6.0 to filter out the files that don’t need to be DistCp-ed.

Hadoop 2.8.0 added support to filter out certain files that match certain regular expressions, so that they won’t be copied to destination when DistCp command is issued. This new feature was introduced by HADOOP-1540.

However, it it not obvious on how to define regular expressions. We have customers who tried to define the following regexp in the filter file:


and hoping that all files under .Trash and .staging will be skipped. However, it does not work.

By checking the code in class src/main/java/org/apache/hadoop/tools/

public boolean shouldCopy(Path path) {
  for (Pattern filter : filters) {
    if (filter.matcher(path.toString()).matches()) {
      return false;
  return true;

We can see that the code uses Matcher.matches function, which attempts to match the entire region against the pattern.

So the above example of “Trash” will not match “/path/to/.Trash/filename” because it only matches part of the string, not entire string.

To fix it, use the following:


The full command looks like this:

hadoop distcp -filters /path/to/filterfile.txt hdfs://source/path hdfs://destination/path

Hope this helps.

Hive MetaStore Server takes a long time to start up

This article explains the possible causes when HiveMetastore server (HMS) takes a long time to start up (more than 10 minutes).

Every time when Hive is restarted through Cloudera Manager (CM), it takes more than 10 minutes for Hive services to become green and users to be able to use beeline CLI.

One possible cause of the issue is:

  • Sentry HDFS sync is enabled
  • There are lots of tables or tables with lots of partitions (hundreds of thousands of partitions)

When the above two conditions are met, when HMS starts up, it will need to scan through all the tables and partitions in HMS database, and then sync with HDFS directories one by one. If there are too many tables or partitions, there will be a lot of HDFS directories that need to be synced, which will take some time.

To confirm whether this is the cause, simply disable the HDFS sync, restart Hive and see how long HMS takes to start up again. If the symptom disappears, then we have confirmed the cause.

The fix is to simply keep the number of tables and partitions per table down:

  • If possible, drop the tables that you do not need
  • If you need to keep the tables that have lots of partitions, try to merge those partitions if possible, by copying data into a new table with merged partitions
  • If there are hundreds of thousands of partitions that you can not merge, then it is time to redesign your tables so that less partitions could be used


Securely Managing Passwords In Sqoop

Apache Sqoop became the Top-Level Project in Apache in March 2012. Since then, Sqoop has developed a lot and become very popular amongst Hadoop ecosystem. In this post, I will cover the ways to specify database passwords to Sqoop in a secure way.

The following ways are common to pass database passwords to Sqoop:

sqoop import --connect jdbc:mysql:// \
             --username myuser -P \
             --table mytable
sqoop import --connect jdbc:mysql:// 
             --username myuser \
             --password mypassword \
             --table mytable

The first one is secure as other people can’t see the password, however, it is only practical to use in the command line.

And we all agree that the second one is insecure as everyone can see what the password is to access the database.

The more secure way of passing the password is through the use of so called password file. The command as follows:

echo -n "password" > /home/ericlin/.mysql.password
chmod 400 /home/ericlin/.mysql.password
sqoop import --connect jdbc:mysql:// \
             --username myuser \
             --password-file /home/ericlin/.mysql.password \
             --table mytable

Please note that we need “-n” option for the “echo” command so that no newline will be added to the end of the password. And, please do not use “vim” to create the file as “vim” will add newline automatically to the end of the file, which will cause Sqoop to fail as the password contains a newline character.

However, storing password in a text file is still considered not secure even though we have set the permissions. As of Sqoop 1.4.5, Sqoop supports the use of JAVA Key Store to store passwords, so that you do not need to store passwords in clear text in a file.

To generate the key:

[ericlin@localhost ~] $ hadoop credential create mydb.password.alias -provider jceks://hdfs/user/ericlin/mysql.password.jceks
Enter password: 
Enter password again: 
mysql.password has been successfully created. has been updated.

On prompt, enter the password that will be used to access the database.

The “mydb.password.alias” is the alias that we can use to pass to Sqoop when running the command, so that no password is needed.

Then you can run the following Sqoop command:

sqoop import \
             -–connect ‘jdbc:mysql://’ \
             -–table mytable \
             -–username myuser \
             -–password-alias mydb.password.alias

This way password is hidden inside jceks://hdfs/user/ericlin/mysql.password.jceks and no one is able to see it.

Hope this helps.