This post shows how to find duplicate files in Rust with the same digests. The codes display list of duplicate files under the same digest.
Personal Use Case
I have a lot of files stored in Google Drive. I plan to back them up to external hard disks to minimize my online storage usage costs. To efficiently use the local disks, I need to find duplicate files and get rid of them. Doing so manually is not easy with tons of files. Therefore, I need a small Rust application that only lists all duplicate files.
Find Duplicate Files In Rust
Before we get to the logic, we need to decide on the data structure first. The easiest one is a HashMap whose key type and value type are String and Vec<String>, respectively.
1 | let mut filenames:HashMap<String, Vec<String>> = HashMap::new(); |
The key represents the hash or digest of duplicate files.
Recursively Traverse Directories
We need to check, go through the files one by one, and recursively traverse directories.
1 2 3 4 5 6 7 8 9 10 | ... for entry in WalkDir::new(".") .into_iter() .filter_map(Result::ok) .filter(|e| !e.file_type().is_dir()) { // Some important codes here } ... |
Build Up List Of Duplicate Files
As the codes go through the files, we gather the list of duplicate files using our data structure and the following logic.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | ... let f_name = String::from(entry.path().to_string_lossy()); let input = File::open(f_name.clone())?; let reader = BufReader::new(input); let digest = sha256_digest(reader)?; let hashCode = String::from(HEXUPPER.encode(digest.as_ref())); let mut map_values = Vec::new(); map_values.push(f_name.clone()); if !filenames.contains_key(&hashCode) { filenames.insert(hashCode.clone(), map_values); } else { let p = filenames.get_mut(&hashCode); p.unwrap().push(f_name); } ... |
First, get the complete file name. Then, generate the hash using the content of the file. If a hash is not available in the HashMap as a key, create an entry using that key and the file name as its value. Otherwise, retrieve the entry and add the file name to the list of duplicate files.
Display Duplicate Files After Finding Them
The remaining codes are self-explanatory. Although we have the duplicate files list with their respective hashes, they also include the unique files. Therefore, we need to display hashes that only have more than one file.
1 2 3 4 5 6 7 8 9 10 11 12 | ... for (hash, files) in &filenames { if files.len() > 1 { println!("{}", hash); for file in files.iter() { println!("---- {}", file); } } } Ok(()) } |
The codes are just basic. We can still improve it. For example, we can get the user’s root directory as an argument input to the program. We can even delete or move the duplicate files and retaining only one copy of the file.
This post is part of the Rust Programming Language For Beginners Tutorial.