How to use “filters” to exclude files when in DistCp

This article explains how to use the new feature supported in Apache Hadoop 2.6.0 to filter out the files that don’t need to be DistCp-ed.

Hadoop 2.8.0 added support to filter out certain files that match certain regular expressions, so that they won’t be copied to destination when DistCp command is issued. This new feature was introduced by HADOOP-1540.

However, it it not obvious on how to define regular expressions. We have customers who tried to define the following regexp in the filter file:

Trash
staging
\/\.Trash\/
\/\staging\/

and hoping that all files under .Trash and .staging will be skipped. However, it does not work.

By checking the code in class src/main/java/org/apache/hadoop/tools/RegexCopyFilter.java:

@Override
public boolean shouldCopy(Path path) {
  for (Pattern filter : filters) {
    if (filter.matcher(path.toString()).matches()) {
      return false;
    }
  }
  
  return true;
}

We can see that the code uses Matcher.matches function, which attempts to match the entire region against the pattern.

So the above example of “Trash” will not match “/path/to/.Trash/filename” because it only matches part of the string, not entire string.

To fix it, use the following:

.*\.Trash.*
.*\.staging.*

The full command looks like this:

hadoop distcp -filters /path/to/filterfile.txt hdfs://source/path hdfs://destination/path

Hope this helps.

2 Comments

  1. Hi ,
    I am using hadoop 2.7.0 and want to skip some files through distCp command.I saw -filters option on apache website but it doesn’t help me.
    Is there a way to achieve the above scenario.Please suggets.

    Reply

    1. Hi Prakash,

      Sorry about the delay, I am just back from holiday.

      As I explained in the article, you can use the -filters option and put the regular expressions in a text file to define the files that you need to filter.

      If it does not work for you, can you please explain which part it is not working for you? What is your filter file look like?

      Thanks

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *