BDR (Backup and Disaster Recovery) feature has been provided by Cloudera Manager for a long time. It is a very important tool for many enterprise companies to backup their production data into DR cluster, so in the case of cluster down or data loss, a backup cluster can be used to retrieve important information.
Snapshot is a feature provided by HDFS, to allow users to take a snapshot of a directory at a particular time, so that users can easily view the contents in HDFS at different time intervals and allows users to roll back to older data if required. On top of that, HDFS also provides the ability to generate a snapshot diff report between two defined snapshots, to understand the changes were made during that period of time. BDR makes use of this cool feature to speed up the replication process, so that only the modified files will be copied.
In this post, I will briefly explain how to enable this feature for BDR via Cloudera Manager, and how to confirm that Snapshot Diff is used after a BDR job is finished, to make sure that we have setup correctly.
Firstly, in order to make sure of snapshot diff feature, we have to enable Immutable Snapshots at HDFS level on both source and target clusters. To do so, if you are using CDH 5.14 and higher, simply go to CM > HDFS > Configuration > Enable Immutable Snapshots, tick and restart HDFS:
It should be on by default, but just double check to confirm.
If you are using CDH 5.13.x and lower, then you have to enable it in Safety Value, as Cloudera Manager won’t expose this setting natively even though you are using latest CM. So go to CM > HDFS > Configuration > “HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml” and enter below setting:
Then save and restart HDFS.
Next, we need to make sure that the directories that you are about to copy need to be snapshottable at both source and destination clusters. This is simple enough to do, just run below commands on both clusters to enable it:
hdfs dfsadmin -allowSnapshot /prod/data/test Allowing snaphot on /prod/data/test succeeded
At this point, we are ready to use the snapshot diff feature in BDR.
To demonstrate, I have setup one HDFS replication job. Hive replication will be similar, though steps will be slightly different, which I will skip in this post. The job setting looks like below, pretty straightforward:
After running the job, we can see that there are 4 steps to replicate HDFS from source to cluster.
- Run Pre-Filelisting Check
- This step simply checks if we have snapshot enabled in the target cluster or not, and whether we are able to generate a snapshot diff report
- Run File Listing on Peer cluster
- This step runs a job in the source cluster to retrieve the list of files that needs to be copied, and saves the output to HDFS under /user/hdfs/.cm/distcp-staging/2020-04-22-22-02-50-03545d8c/fileList.seq, where date and time in the directory name might be different from yours.
- Transfer Listing File from Peer Cluster
- The third step simply copies the output from step 2 from source to target
- Trigger a HDFS replication job on one of the available HDFS roles
- The last step does the actual job, to trigger distcp command to copy data based on the file list generated from source cluster
To check if the Snapshot Diff has been applied, we simply need to check the steps 1 and 4.
Expand the “Run Pre-Filelisting Check” step and click on “stderr” tab in the command view for the BDR job that we just ran in CM:
And then check the log at the bottom, you will be able to see below messages:
20/04/22 15:31:46 INFO distcp.SnapshotDiffGenerator: Target: Checking changes since last run under /prod/data/test 20/04/22 15:31:46 INFO distcp.SnapshotDiffGenerator: Found a snapshottable ancestor (or self): /prod/data/test 20/04/22 15:31:46 WARN distcp.SnapshotDiffGenerator: distcp-309--952841939-old is not present. It's normal if this is the first run of the replication or previous replication run was failed 20/04/22 15:31:46 INFO distcp.SnapshotDiffGenerator: Target: Failure to get Diff for path /prod/data/test 20/04/22 15:31:46 INFO util.DistCpUtils: Failed to use diff for path, falling back to regular distcp 20/04/22 15:31:46 INFO util.DistCpUtils: Replicate ACL: false 20/04/22 15:31:46 INFO util.DistCpUtils: Replicate XAttr: false 20/04/22 15:31:46 INFO distcp.PreCopyListingCheck: Done in: (seconds) 0.955
We can see that getting the snapshot diff report failed, because the old snapshot was missing “distcp-309–952841939-old is not present“. Every time when BDR job runs, it will create a “new” snapshot at the start, and then rename it to “old” just before it finishes, so that next run can pick up the snapshot and perform the diff. In our case, since it was the first time I ran the job, the “old” snapshot was not there, so it was not able to generate snapshot diff report, as indicated by the message: “Failed to use diff for path, falling back to regular distcp“
If you check the log message from the last step, you can that the distcp job indeed did not use snapshot diff:
20/04/22 15:33:13 INFO distcp.DistCp: Used diff: false 20/04/22 15:33:13 INFO distcp.DistCp: Deleting old snapshot and renaming new to old on Source 20/04/22 15:33:13 INFO distcp.DistCp: Deleting old snapshot and renaming new to old on Target
Now, let’s run it again to see if snapshot diff can be used, since we have just ran it before. Check the logs from same steps, we can see something has changed:
20/04/22 15:45:39 INFO distcp.SnapshotDiffGenerator: Target: Checking changes since last run under /prod/data/test 20/04/22 15:45:39 INFO distcp.SnapshotDiffGenerator: Found a snapshottable ancestor (or self): /prod/data/test 20/04/22 15:45:39 INFO distcp.SnapshotDiffGenerator: snapshotDiffReport with subdir of the snapshot root is successful. 20/04/22 15:45:39 INFO distcp.SnapshotDiffGenerator: Target: No changes since last run under /prod/data/test 20/04/22 15:45:39 INFO util.DistCpUtils: Replicate ACL: false 20/04/22 15:45:39 INFO util.DistCpUtils: Replicate XAttr: false 20/04/22 15:45:39 INFO distcp.PreCopyListingCheck: Done in: (seconds) 0.943
This time, BDR job found out that the directory is snapshottable “Found a snapshottable ancestor (or self): /prod/data/test“, and snapshot diff report at the sub-dir level was successful “snapshotDiffReport with subdir of the snapshot root is successful“, and it also recognized that there were no changes happened in the target cluster: “Target: No changes since last run under /prod/data/test“.
Now check the logs from the last step, we can confirm the same:
20/04/22 15:46:57 INFO distcp.DistCp: Used diff: true 20/04/22 15:46:57 INFO distcp.DistCp: Deleting old snapshot and renaming new to old on Source 20/04/22 15:46:57 INFO distcp.SnapshotMgr: Deleted old snapshot: distcp-309--952841939-old 20/04/22 15:46:57 INFO distcp.SnapshotMgr: Renamed new snapshot: distcp-309--952841939-new to old: distcp-309--952841939-old 20/04/22 15:46:57 INFO distcp.DistCp: Deleting old snapshot and renaming new to old on Target 20/04/22 15:46:57 INFO distcp.SnapshotMgr: Deleted old snapshot: distcp-309--952841939-old 20/04/22 15:46:57 INFO distcp.SnapshotMgr: Renamed new snapshot: distcp-309--952841939-new to old: distcp-309--952841939-old
You can see that this time “Used diff” was true, and BDR had managed to rename snapshots created during the process from “new” to “old”.
Lastly, I would like to list out all the required conditions that I know for BDR to be able to use snapshot diff feature, as sometimes it is not as obvious. Below is the full list outlined from Cloudera’s official documentation:
- The source and target clusters must be managed by Cloudera Manager 5.15.0 or higher. If the destination is Amazon S3 or Microsoft ADLS, the source cluster must be Managed by Cloudera Manager 5.15.0 or higher. Snapshot diff-based restore to S3 or ADLS is not supported
- The source and target clusters run CDH 5.15.0 or higher, 5.14.2 or higher, or 5.13.3 or higher.
- Verify that HDFS snapshots are immutable.
In the Cloudera Manager Admin Console, go to Clusters > > Configuration and search for Enable Immutable Snapshots.
- Do not use snapshot diff for globbed paths. It is not optimized for globbed paths.
- Set the snapshot root directory as low in the hierarchy as possible.
- To use the Snapshot diff feature, the user who is configured to run the job, needs to be either a super user or the owner of the snapshottable root, because the run-as-user must have the permission to list the snapshots.
- Decide if you want BDR to abort on a snapshot diff failure or continue the replication. If you choose to configure BDR to continue the replication when it encounters an error, BDR performs a complete replication. Note that continuing the replication can result in a longer duration since a complete replication is performed.
- BDR performs a complete replication when one or more of the following change: Delete Policy, Preserve Policy, Target Path, or Exclusion Path.
- Paths from both source and destination clusters in the replication schedule must be under a snapshottable root or should be snapshottable for the schedule to run using snapshot diff.
This concludes this post regarding BDR job makes use of snapshot diff feature from HDFS.