Tag Archives: union

Finding the Union and Intersection of Multiple Sets of Files

I had several sets of files. I wanted to find their union and their intersection. The Math Lab defined the union of two sets as “the set of elements which are in either set,” and the intersection of two sets as “the set of elements which are in both sets.” So if set 1 consisted of A and B, and set 2 consisted of B and C, then the union of sets 1 and 2 would be A, B, and C (typically written without duplication — that is, not A, B, B, C), and the intersection of sets 1 and 2 would be B. In the case of filesets, the union would be the complete set of all files existing in at least one fileset, after eliminating duplicates, while the intersection would be the list of files that existed in all filesets.

Among other things, this post explores tools for calculating hash values for files. A later post provides an updated and more thorough look at some such tools.

Union of Multiple Filesets

To find the union of multiple filesets, I could use a duplicate file detection program. A search led to various lists of recommended duplicate detectors. I had been using DoubleKiller Pro for some years, and was familiar with its options, so that’s what I used in this case. I believed but wasn’t certain that the options described here were present in the free version.

If my duplication detection program found file X in fileset 1 and also in fileset 2, I could use that program to delete the copy of X that existed in fileset 2. For some purposes, I might have to remember that I had thus distorted fileset 2: it might now contain fewer files than it previously contained, but that would not necessarily mean that it was originally inferior to fileset 1. As with file operations generally, it would be a good idea to make backups at the beginning and also intermittently along the way, so as to be able to check back and see how things used to be, prior to one’s intervention.

Where the files had the same contents but not the same names, it would be advisable to examine the list of duplicates before deleting anything. As detailed below, one set could have names that did not accurately summarize file contents, or that would provide little information, while the duplicates in another set could have more accurate or informative names or extensions, or could be arranged in folders that the user would be unwilling or unable to reconstruct afterwards. Some of the techniques discussed below could be useful in deciding which duplicates to keep and which to delete, so as to preserve a union of files that had informative names without duplicates.

Intersection of Multiple Filesets

It seemed easier to obtain the union than the intersection of multiple filesets. It appeared that neither dupeGuru nor DoubleKiller had any built-in intersection capability, nor had I noticed it elsewhere. My searches didn’t lead to an easy technique. Evidently I would have to come up with my own way of doing that. The method I used was to produce a list of files, and then use a spreadsheet to examine the contents of that list. This section is mostly focused on the technicalities of the spreadsheet I used.

Producing the list was the easy part. There were at least two ways to do so. One way, detailed in a prior post, was to run a Windows DIR command, and capture its output. That method would be most useful where duplicate files were known to have the same filenames. Another way was to run a DoubleKiller comparison, and export its results. DoubleKiller had the advantage of being able to conduct a CRC32 and/or byte-for-byte comparison among files, so as to identify exact duplicates among those whose names were not identical, and to disregard those with identical names and reported filesizes that were actually not identical. If I wanted to achieve something like that on the command line — that is, if I wanted to devise a command what would produce a list of files with something like the CRC value for each file — my command would apparently have to use a tool capable of calculating those values and including them in the file list. For instance, it appeared that NirSoft’s HashMyFiles (below) might have that capability.

I used the DoubleKiller approach. Specifically, as I learned the hard way, I would want to run a DoubleKiller search that would identify exact duplicates among my filesets, not only byte-for-byte, but also using their CRC32 values. Obtaining both would insure that I was seeing exact duplicates and would also produce CRC values in the output, assisting (as we are about to see) in making sure of which files were duplicates of which others.

Then I proceeded to examine the resulting list of files. Specifically, when the DoubleKiller search finished, I right-clicked on one of the listed files and chose Export all entries. That produced a text file listing the search results. I opened that text file in Excel 2016. The Text Import Wizard came up automatically. I indicated that the file was Delimited with tabs. Although the resulting spreadsheet was probably already sorted with duplicates next to each other, I sorted it on the CRC values, just to be sure. I added a column to count the number of duplicates in each set: basically, if the CRC from the previous row is the same as the CRC from this row, add 1 to the number immediately above; if not, enter 1 (i.e., restart the count). I used Quick Filter to verify that 3 was the maximum number of members in any set. I added another column to detect the sets with 3 members: if the value to the left is 3, put an X; if the value to the left and one row below is a 1, put a 1, otherwise copy the value of the cell below. I fixed those values so they wouldn’t change (i.e., select the column > Alt-E C > Alt-E S V > Enter > Enter). I sorted by that column and deleted all rows lacking an X. That gave me 1926 duplicates (presumably, 1926 / 3 = 642 sets). I could then use similar procedures to identify, say, the first member of each set of duplicates, and mark the second and third members for deletion, giving me a list of (in this case, 642) unique files. Each of these unique files was duplicated (though perhaps not with the same name, folder, or date) in each of my source filesets.

The remaining step would be to produce a set of commands that would extract these 642 unique files from the source filesets. But before doing that, I wanted to make one modification to the procedure just described. I had noticed that, within some sets of duplicates, some files had better names than others. For example, in one set of three identical files (i.e., one from each of my source datasets), I saw that two of the files were named FLARE4.BMP, while the third was named f1578175.bmp. It was possible that this .bmp image file really did display a flare, in which case the FLARE4.BMP name would be informative. But there was no obvious meaning to the f1578175.bmp name. Indeed, I was pursuing this topic as part of a larger project of recovering files from a trashed hard drive, and along the way I had noticed that the PhotoRec file recovery tool had recovered many files with meaningless names, beginning with f, that resembled this f1578175.bmp filename. So, if possible, I would prefer that my resulting list of files — the list representing the intersection of these several filesets — would feature informative names, like FLARE4.BMP, rather than generic names like f1578175.bmp.

To identify the best filenames in each set, I started by sorting the intersection list (of 1926 files, in this case) and looking for big groups of undesirable (e.g., uninformative) filenames, like f1578175.bmp or LostFile_dll_1383439.dll (among many LostFile entries created by the Lazesoft file recovery program). I created an Inferior column and put an X next to each of those. I added a column identifying each file’s extension (i.e., =RIGHT([previous column], 4), sorted on that column, and added another Inferior column, with an X for dubious extensions. In this particular project, I had noticed that .rbf was one such extension, accompanying many of the files retrieved by some file recovery tools. It was also possible to mark files for deletion according to whether they were found in undesirable folders. For example, a file called mscorier.dll was found in a subfolder named Unknown folder\1031, and also in one named Windows\system32. Guess which I found more informative. There could be other reasons to prefer one copy of a file over another.

When I had finished marking some relatively undesirable files, I resorted by CRC and added a column to mark, with an X, those instances where the present and the two preceding rows all had the same CRC and also all had an X in the Inferior Name column, to make sure I wasn’t about to wipe out an entire set of duplicates. I ran a Quick Filter and saw that, indeed, I did have some entries in that column, indicating that all of the recovered duplicates with that CRC had inferior names. I unmarked one file in each such set, so that one duplicate would survive. Then I deleted, from the Excel list, all the nonessential duplicates with inferior names.

That left me with a list of preferred duplicates. For those, I could use the spreadsheet to construct Windows commands that would copy or move the designated files to some other location. As detailed in another post, the Excel formula would use something like this:

="xcopy /s /h /i /-y "&CHAR(34)&E2&A2&CHAR(34)&" "&CHAR(34)&"D"&MID(E2,2,LEN(E2))&CHAR(34)&" > errors.txt"

In other words, for each desired file, the spreadsheet would construct a command that would say, Copy the file (whose full pathname consisted of the path in column E plus the filename in column A) to a folder with the same name on drive T. Presumably drive T would not already have such a folder: the command would create it. The user could specify a subfolder after T if desired. The command could also be altered to steer all files into a single folder, though that might not work too well unless the user first sorted and checked the spreadsheet to verify that there were not multiple files with the same name. The purpose of CHAR(34) was to insert a quotation mark, to surround file or path names that might contain spaces. The actual quotation marks shown in the formula (i.e., not CHAR(34)) denoted text portions of this command. Errors.txt could be checked afterwards to make sure there were no problems.

A formula like that, copied onto each row listing a desired file, would produce a list of commands that could be run from a batch file. Each such command would instruct Windows to copy that particular file. To create a batch file to run those commands, I would simply have to copy the list of commands (as displayed in Excel) to a Notepad file, save it with a .bat extension (e.g., copylist.bat), and run it by selecting and hitting Enter (or by double-clicking) in Windows File Manager. After running that batch file, the target folder(s) should contain copies of the source files, constituting the intersection of the source filesets.

Finding a Batch Hashing Solution

As indicated earlier, it seemed that it might be helpful to use a third-party tool like NirSoft’s HashMyFiles to produce a directory listing that included the hash value for each file. With such a listing, I could use Excel to identify and examine features of duplicative files, potentially informing a decision on which files to save, move, or delete. At a later point, I returned to this post and added this section, with information on my use of such tools.

I had expected to run a command-line version of HashMyFiles, but for some reason I wasn’t finding one. So I started by downloading and running the 64-bit GUI version. I clicked on its folder icon and navigated to the folder containing the filesets I wanted to compare. They were large, and it looked like it was going to take a long time to process these files. I let it run, just to see what I’d get, but finally aborted after 36 hours. It seemed that HashMyFiles was calculating each kind of checksum it could, whereas I might need only one. The available options (in HashMyFiles > Options > Hash Types) were MD5, SHA1, CRC32, SHA-256, SHA-384, and SHA-512.

A Stack Overflow discussion offered a number of perspectives on those options. There appeared to be a difference between file verification and security. The latter, taking greater pains to verify that a file had not been deliberately altered, would apparently call for at least SHA-256 (e.g., How-To Geek, 2018). But for purposes of simple file verification, CRC32 was common and apparently adequate. So I chose CRC32. Therefore, in HashMyFiles, I went to View > Choose Columns and chose only these columns: Filename, CRC32, Full Path, Modified Time, File Size, Identical, and Extension. (The Modified Time column seemed to match the date and time shown in Windows File Manager.) I would soon learn that HashMyFiles did not remember these settings; I had to reset them each time I used the program. Then I re-ran the HashMyFiles calculation. Unfortunately, it seemed that HashMyFiles was still determined to calculate all possible values, even if it wasn’t displaying them. After another 12 hours, I canceled it again, and looked elsewhere.

A brief check suggested that the HashMyFiles values were probably accurate. Specifically, I ran HashMyFiles on a set of 79,113 files, and accepted its option of .csv output. That gave me a nice-looking spreadsheet showing the columns I had selected, including CRC32 values in hexadecimal. For some reason, a few files were left out: the spreadsheet listed only 79,102 files, and provided CRC32 values for 79,101. Using Excel’s VLOOKUP, I compared CRCs for files (sorted by full pathname) against the output of a tool called ExactFile, whose webpage said it was still in beta. ExactFile calculated CRC32 values very quickly — within a minute or two — but produced a report (in SFV format) listing only 54,611 files in that set, even though I did specify “Include subfolders” in its Create Digest tab. Of those, almost all (54,577) appeared in the HashMyFiles list, and most (52,943) had the same CRC32 values. The reasons for the discrepancies were not immediately obvious.

Raymond (2017) listed other hashing tools that I could try. At the top of his list: the portable IgorWare Hasher (4.9 stars from 15 raters on Softpedia). I tried version 1.7.1 (x64). Unfortunately, it did not seem able to process more than one file at a time. That was also the case with Hash Generator 7.0 (3.3 stars from 21 raters on Softpedia; no explanatory comments from users) and, apparently, with the other tools on Raymond’s list. Gizmo (2017) recommended Hashing (4.5 stars from 10 raters on Softpedia), but screenshots suggested that it was going to output its results across multiple lines, resulting in considerable hassle to move them all to a single line (below). It also did not offer a CRC32 option.

I tried running Checksum Calculator (4.0 stars from only 3 raters on Softpedia) on a fileset that I called WINXPFILES. It began loading files as soon as I selected the folder. It started loading files at a rate of, I’d guess, maybe 50 per second. But by the time a few hours had passed, it had slowed down to more like five per second. Then three. The speed didn’t seem to be related to file size: it was constant and equally slow for really small files. It seemed that the large number of files might be clogging the calculator’s memory. It took maybe 3.5 hours to load 79,113 files (i.e., ~22K files/hour) totaling 20.1GB (i.e., roughly 5.7GB/hour, averaging ~250KB/file). Let us be clear: this was not the CRC32 calculation. This was just the file loading process. When Checksum Calculator was ready, its buttons stopped being greyed out, and I was able to select CRC32 and click Calculate. I was surprised to see that it calculated CRC32 values for all those files in about 15 minutes. I clicked Save Results. It saved a comma-delimited text file that I could open in Notepad or Excel, containing file path and name and the calculated value. Unfortunately, the supposed CRC32 values did not look like CRC32 values: they were all numerals, with no letters, and they were nine to ten digits long. Later, I realized that perhaps this was what HashMyFiles would have produced if I had chosen its Options > CRC32 Display Mode > Decimal (rather than Hexadecimal) option.

I had another, much larger fileset that I also wanted to get CRC32 values for. I wondered if I would get faster and better results by using a command that would write its CRC32 calculations directly to a file. For that, a SuperUser discussion suggested using the certutil command:

certutil -hashfile "D:\Path Name\File to Hash.ext" MD5 >> output.txt

which offered options such as MD5 and SHA256 but not CRC32. I would have to use something like my Excel approach (above) to generate a specific command for each file that I wanted to hash, and would then have to include all those commands in a batch file. (I would shortly be getting into the details of that.) The situation was similar for a PowerShell command — that, again, could not calculate CRC32 values:

Get-FileHash -Path "D:\Path Name\File to Hash.ext" -Algorithm MD5 >> output.txt

The output.txt files, in those examples, would contain unwanted additional information that could be filtered, at least to some extent, by a more refined command or by opening output.txt in Excel and sorting to identify and remove those lines. That appeared to be less of an issue in Ashley’s CRC32.exe utility, which I hesitated to use merely because it was relatively untested: it looked like it would produce the desired CRC32 output with minimal extraneous information. When I did test it (below), it didn’t work.

I would be able to avoid the need to generate file-by-file commands by using the generally reliable SysInternals — specifically, Russinovich’s Sigcheck, from which I got nice output for all files in a directory (including its subdirectories) in an Excel-friendly .csv file using this:

sigcheck64 -ct -h -s -w output.csv "D:\[top directory name]"

but it was slow, because it apparently could not be persuaded not to calculate MD5, SHA1, SHA256, PESHA1, and PESHA256 values for each file — and it did not offer a CRC32 option. I put in a feature request on that. It also gave no progress indication, so the user would have to remember to leave its command window alone for hours, if not days, when it appeared nothing was happening.

There were many other checksum utilities, but few seemed to have the kind of multi-file capability I was seeking. So far, for purposes of rapidly calculating checksums on tens or hundreds of thousands of files, it seemed my best solution would be to give up on CRC32 and go with MD5 instead. In that case, I could start with Hashing, if I really needed a GUI tool, but it seemed that the output would be easier to work with if I chose the certutil approach instead.

In that case, the specific procedure was as follows. In the top level of another directory tree containing the files I wanted checksums for, I ran dir /a-d /s /b > dirlist.txt. In this case, the dirlist.txt file contained information on more than a million files, so trying to open it in Excel produced a “File not loaded completely” error. There was a way to use Excel with millions of data records, but I was not confident that would work for my purposes. I chose instead to split dirlist.txt in two. Among tools for this purpose discussed at SuperUser, I chose GSplit (4.0 stars from 17 raters at Softpedia). In GSplit, I specified dirlist.txt as the file to split, and then went into left sidebar > Pieces > Type and Size > Blocked Pieces > Blocked Piece Properties > drop-down arrow > I want to split after the nth occurrence of a specified pattern > Split after the occurrence number > 1048570 (i.e., just short of the 1,048,576 rows Excel could handle) > of this pattern: 0x0D0x0A (i.e., the default, representing a line break in Notepad, so that I would not be splitting the file in the middle of a line) > Split button (at top). That gave me disk1.EXE, disk1.gsd, and disk2.gsd. I renamed the last two to be Dirlist1.txt and Dirlist2.txt. (I had to delete one or two junk lines added by GSplit at the end of each of those two files.) I opened those two files in Excel, one at a time. Doing so opened the Text Import Wizard, to which I responded by selecting Delimited > Next > verify that Tab is selected > Finish. There were no tabs in dirlist.txt, so this put the full path information for each file in a single cell in column A. In column B, I entered the Excel equivalent (above) of the certutil command (above), like so:

="certutil -hashfile "&CHAR(34)&A1&CHAR(34)&" MD5 | find /i /v "&CHAR(34)&"certutil"&CHAR(34)&" >> output.txt"

The new material at the end of that command, involving “find” and another reference to “certutil,” just said that I wanted to exclude, from the output, the line in which certutil would close each MD5 calculation by saying, “CertUtil: -hashfile command completed successfully.” I copied that formula down to each row in the Excel versions of Dirlist1.txt and Dirlist2.txt. Each occurrence of that formula produced a command that could run in a batch file. I copied the full contents of column B from each of those two spreadsheets into a single file (using Notepad, though some other plain text editor would be necessary for files above a certain size) called MD5Calc.bat. I saved that file. I deleted the output.txt file produced by my testing, so that the output.txt file produced by my commands wouldn’t contain that scrap information. In Windows File Manager, I right-clicked on MD5Calc.bat > Run as administrator (or I could have run it in an elevated command window). That seemed to run at a consistent pace that I estimated at ~500 files per minute. It ran for 30 hours, and I could see what it was doing the whole time, as it processed one file after another. Unfortunately, that visibility did not help me detect that, somehow, either I or one of the processes just described had gotten the commands wrong, so that my list of files being processed was out of order. After 30 hours of doing it wrong, however, I did begin to catch on. I was not confident that the result was going to be correct, so I canceled it.

Rather than restart what would evidently be a 60-hour process, I decided to try Ashley’s CRC32.exe utility. As I explored that approach, however, I encountered some complications. Among other things, the CRC32.exe utility would put its output on two lines. This was not a problem with the smaller file set; but with the larger set I was trying to resolve, this would double its already very long file listing (i.e., 1.5 million files) — which, as demonstrated above, would make the Excel approach even more unwieldy. Seeking a solution on that front, I posted a question on SuperUser.

Pending a good answer to that question, I tried another search for CRC32 tools. This search seemed more successful. Several additional tools came to light, each seemingly capable of batch CRC32 processing. I started with Fsum Frontend (4.5 stars from 2 raters on Softpedia) — which, as its name suggests, was a front end or GUI for Fsum. I downloaded it and ran it as a portable (4.0 stars from 3 raters on Softpedia). Its ReadMe file said it contained code from Jacksum, another possibility that had emerged in my search, as well as from BouncyCastle and FlexiProvider. To use it, I went to its Tools > Generate checksum file > navigate to and select the folder containing the files to be checked > check the Recurse Subdirectories box > Format = SFV 512 > Mode: 1 file in any place > Save As > navigate to the desired output folder and enter output.txt. To run it, I clicked the icon next to the Mode box. That seemed to run for a few seconds, but then it produced an error, naming one of the files and saying, “The system cannot find the path specified.” That seemed to kill the process: clicking OK left me back where I started. There did not seem to be an option to allow it to skip that problematic file. I wasn’t about to go digging manually through tens of thousands of files to fix any that this program might not like. It seemed wiser to try another tool.

The context menu (i.e., right-click) option installed in Windows File Manager by HashCheck Shell Extension (HSE) (4.4 stars from 46 raters on Softpedia) was simply to “Create Checksum File.” When I went ahead with that option, it gave me an opportunity to choose a name and location for the checksum file, as well as a choice of filetype. I was interested in the CRC-32 option, which would create an .sfv file; this option would be much faster and would produce much shorter filename addenda than some of the other options. Then it said, “Creating [the specified filename].” It worked through the set of 79,113 files in about four minutes. As above, I imported that .sfv file into Excel and compared its results against those produced by HashMyFiles. With a few dozen exceptions, due primarily to an apparent Excel inability to deal with the presence of a tilde (~) in filenames and secondarily to a handful of files reported as UNREADABLE, HSE produced the same file listing and CRC32 values as HashMyFiles. It appeared that HSE could do the job.