Sunday, May 26, 2013

Finding Duplicate Folders

There isn't one specific linux utility that does this.  As is very often the case with Linux, a combination of tools must be used together to meet your goal.

The first challenge here is, What's your definition of duplicated?  

Almost certainly, the two folders don't have the same name.  Almost certainly, the two folders weren't created at exactly the same time.  The files within the two will most probably have different time stamps - access times, create times - as well.  What about ownership?  Permissions?  As you can see, there's a LOT about a directory that might differ.

I'll focus on whether the content of the folders appears to be duplicated - i.e. same number of files, with same filenames, all duplicates.  

You can achieve this as follows.

First, in a shell, change directory to the top of whatever tree you want to search for duplicated directories.  Then run the following.

for d in `find $CWD -type d -print`; do \
echo `cd $d;ls -1ARs .|cksum|sed 's/ /_/'` $d;done| \
awk '{c[$1]++; s[$1]=s[$1] " " $2} END {for (i in c) {if (c[i]>1) print s[i]}}'

Pretty, isn't it?  ...pretty ugly! Let's break it down.

for d in ` find $CWD -type d -print`;  do something; done

find gets run with $CWD as an argument, and asked to tell us every directory within the current directory.  Because we used $CWD - a fully-qualified pathname - we get back fully qualified pathnames, rather than relative pathnames.  This makes it possible for us to tell our script to change directory into each directory in turn - which is needed for a reason that'll become evident shortly.

With the above, in each iteration of the for loop, "$d" will have a directory name.

Let's look at the "something" that gets done with each directory name.

echo `cd $d;ls -1ARs .|cksum|sed 's/ /_/'` $d

That says to output a line, with the results of a command pipeline, followed by a space character, then the directory name.

cd $d;ls -1ARs .|cksum|sed 's/ /_/'

We "cd" into $d  - not caring where we were before, relative to $d, which is only possible because the directories are fully qualified.  I.e. we're not saying "cd ../otherdir" - we're saying "cd /home/joe/stuff/otherdir"

Now that we're in that directory, we can refer to it as "." - thus, working around the fact that our two otherwise "identical" directories probably don't have the same name.

Once we're in that directory, we do an "ls -1ARs ." - which will produce an identical listing on two directories which have the same files, taking the same number of blocks.

We don't really care what the output has to say - we care whether it's the same as the output from some other directory.  'cksum' to the rescue.  

cksum calculates CRC checksums, and outputs that checksum, with a byte count.  It's quite happy to checksum "stdin" which is what we do - we send the output from "ls" into cksum.    We then change the space between the resulting checksum and the byte count to an _ making it a single "word".

So... that "echo" creates output that looks kind of like this:

237894789783_1823 ./directory_one
367892367829_2290 ./directory_two
378927829789_4430 ./directory_three
367892367829_2290 ./duplicate_of_directory_two

Awk comes in REALLY handy for the next step - figuring out, across all of those checksums, which directories are duplicates.

awk '{c[$1]++; s[$1]=s[$1] " " $2} END {for (i in c) {if (c[i]>1) print s[i]}}'

We create two associate arrays - "c" and "s".  They get created automagically just by being referenced, so that works out great.

For each line of input, we run the first {block}:  c[$1]++; s[$1]=s[$1] " " $2

"c" (count) at index $1 (the checksum_size string) gets incremented.  It defaults to zero, so that works great.

s[$1] gets the directoryname added to whatever it currently holds.  It defaults to empty string, so that works great as well.

Then, once we're out of input lines... we run the "END" {block} - for (i in c) {if (c[i]>1) print s[i]}

That loops over every entry in the c array, looking for ones with more than one for a count, then prints out the corresponding entry from the s array.

The result, as you might expect if you've been following this all the way through (in which case... congratulations!) will be something like this:

% for d in `find $CWD -type d -print`; do echo `cd $d;ls -1ARs .|cksum|sed 's/ /_/'` $d;done|awk '{c[$1]++; s[$1]=s[$1] " " $2} END {for (i in c) {if (c[i]>1) print s[i]}}'
 ./directory_two ./duplicate_of_directory_two

... each line of output will identify two or more directories that have the exact same content.  almost.

I say "almost" because, in fact, the "size" option on "ls -1ARs" is in blocks. 

A file could have some extra content compared to it's other copy, and this would miss that until the extra content was sufficient in size to cause the file to grow in blocks.  

That means the above approach might have "false positives" - it might identify two directories as being duplicates, when there's actually a slight difference in one or more files within those directories.

Many people would consider this a pretty awesome feature, rather than a problem.  Indeed... this can identify "almost duplicate" directories.

The 100% accurate way would be to cksum each file.  This is HUGELY costly. It works, but it takes FOREVER to run.  If you have "forever" to wait, here's the 100% accurate approach:

for d in `find $CWD -type d -print`; do echo `cd $d;cksum $(find . -type f -print)|cksum|sed 's/ /_/'` $d;done|awk '{c[$1]++; s[$1]=s[$1] " " $2} END {for (i in c) {if (c[i]>1) print s[i]}}'

Instead of using "ls" we cksum all of the files in ".", then cksum the OUTPUT of all those cksums.

Since that involves reading every single byte of every single file in every single directory being scanned... it's very expensive to run.  That's why I presented an alternative that might possibly have a few "false positives" but takes much less processing power first.

Wrapping this up (as if it wasn't already way too long)... you could indeed use the first approach, grab the output listing all the directories that might be duplicates, then feed THAT list to a routine that used checksums to be 100% sure before identifying the directories as duplicates.  I'll leave that as an exercise in leveraging all of the techniques I've shared above.

No comments:

Post a Comment