Mastering Hadoop 3
上QQ阅读APP看书,第一时间看更新

HDFS Snapshots

Today, data is the backbone of businesses, and users do not want to lose data due to any machine failure or disaster. In the event of failures or disasters, a File System user may want to have a plan to take a backup and restore their important data. Because of this, HDFS introduced the Snapshots. Snapshots are point-in-time images that are a part of a File System or are the entire File System. In other words, HDFS Snapshots are snapshots of a subtree of the HDFS, such as directories, sub-directories, or the entire HDFS. Let's look into a few of the common use cases of HDFS Snapshots:

  • Backup: The admin may want to do a backup of the entire File System, a subtree in the File System, or just a file. Depending on the requirements, the admin can take a read-only snapshot. This snapshot could be used to recover data or could be used to send data to remote storage. 
  • Protection: The user may accidentally delete files on HDFS or delete the entire directory. However, these files go into the trash and can be recovered, but once the files have been deleted using the File System, the operation does not go into the trash and is recoverable. The admin can set up a job that can take a HDFS snapshot on a regular basis so that if any file is deleted, it can be restored using a HDFS Snapshot. 
  • Application Testing: Testing the application with the original dataset is a common requirement of an application developer or application users. The application may not perform as per expectations, which may lead to data loss or may corrupt the production data. In such cases, the admin can create a read/write HDFS Snapshot of the original data and assign that snapshot to a user for testing. Any changes done to this dataset will not reflect the original dataset. 
  • Distributed Copy (distcp): distcp is used to copy data from one cluster to another cluster. But think about scenarios where you are copying data and someone has deleted the source file or moved data to some other location—this will put distcp in an inconsistent state. HDFS Snapshot can be used to address this problem where the snapshot can be used with distcp to copy data from one cluster to another cluster. 
  • Legal and Auditing: Organizations may want to store data for a certain period of time for legal or internal processes to see what data has changed over a period of time or to take an aggregated report from data. They may want to do auditing of the File System. Snapshots are taken regularly and contain information for data that can be used for auditing or legal purposes. 

Let's see how we can take a snapshot of  HDFS tree, sub-tree, or sub-directories. Before you can take a snapshot, you have to allow snapshot for the tree, sub-tree, or directory. This can be done by using the following command:

hdfs dfsadmin -allowSnapshot <path>

Once the directory is snapshottable, then you can take a snapshot of the directory using the following command:

hdfs dfs -createSnapshot <path> [<snapshotName>]

Here, path is the path of the tree, sub-tree, directory, or file that you want to take a snapshot of. Remember that, until and unless you allow the directory to be snapshottable, you cannot execute the preceding command successfully. SnapshotName is a name that you can assign to a snapshot. It is good practice to attach the date to the snapshot's name for identification.