Some Techniques to Detect Duplicate Files

I had been using DoubleKiller and Beyond Compare to compare drives and detect duplicate files. Of course, there were many free and commercial programs to compare files, do backups, and detect duplicates.

DoubleKiller and Beyond Compare had been useful in mitigating a tangled mess of files that I thought might be duplicates: the CRC exact contents and similar filename comparison options in these programs did identify many duplicates. For example, I could pretty easily find cases where I had converted Filename.doc to Filename.pdf, and also where File1.doc and File2.doc were just identical copies of the same file.

With some effort, I could stretch those comparisons: for example, I might use a combination of a batch file and Bulk Rename Utility to move certain files to a single folder and then rename them en masse before doing the PDF conversion. So, for example, “Letter dated 8-26-13.doc” might get moved to D:\Workspace and then converted to “2013-08-26 Letter.pdf” — at which point it might become identified as either an exact or a filename duplicate.

These steps left cases where files had been renamed in a more scattered or mangled way. For example, I might have previously looked at a file to identify its date, and then renamed it accordingly: the file formerly known as “Unspecified Smith Letter to X” might now be “2013-08-26 Smith Letter to Jones.” The letter might be substantially the same, but a slight change of contents would defeat an exact file contents comparison, and there would not be a readily identified duplication of names. There were potentially ways to do a less-than-exact contents comparison, but I was not presently seeing how to do that effectively.

One technique I used was to use a batch file to identify potential duplicates. My first step was to create a file containing a listing of all files in the folder(s) being examined. For that, I used this command at a Windows 7 command prompt:

DIR /s /a-d /b > FileList.txt

I viewed the resulting FileList.txt in Notepad and pasted its contents into an Excel spreadsheet. (I could also have used OpenOffice or LibreOffice Calc programs.) Using Excel functions (e.g., FIND, LEN, MID, and a REVERSE macro), I was able to create batch files that would help me rename some sets of files and search for comparisons. One Excel formula that I found useful for this purpose went like this:

="echo ------ Looking for "&CHAR(34)&"*"&C11&"*.*"&CHAR(34)&" >> FoundFiles.txt"&" & dir /s /a-d /b "&CHAR(34)&"*"&C11&"*.*"&CHAR(34)&" >> FoundFiles.txt"

This formula produced a batch command pertaining to one file among the many that I thought might be duplicates. Briefly, within that command, CHAR(34) produced double quotation marks (“), and cell C11 contained the name of the potentially duplicate file. For cell C11, that formula produced this batch command:

echo ------ Looking for "*01--Intro*.*" >> FoundFiles.txt & dir /s /a-d /b "*01--Intro*.*" >> FoundFiles.txt

That command, run on the Windows 7 command line, would search for files containing “01–Intro” anywhere in the partition or folder where I was running that DIR command. The command would output, to FoundFiles.txt, a line stating what I was looking for, followed by a list of files and folders where that text string appeared. (The “——” part of the “Looking for” line was there just to make it easier to see where a list of output files ended and a new search began.) In Excel, the output looked like this:


(The Index column was an addition to preserve the original order, in case I sorted the worksheet by other columns.)

In that example, the search found files beginning with Balderrama and also with Anthony Balderrama. I would then have to do a manual comparison to see whether the “Balderrama” file that had triggered this search was one of the other two, or something else entirely. Depending on my patience and time available for manual comparison, I might set adjacent columns at the right to detect (so that I could exclude) searches that turned up more than one, or five, or a hundred potential matches. For example, a line that said, it was looking for “the.*” might be best approached, not by looking for all potential filename matches, but rather by opening the “the” file and seeing what it should have been named.

This entry was posted in Uncategorized and tagged , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.