Tag Archives: unique

Excel: Calculating a Spreadsheet That’s Too Large to Calculate

As described in another post, I had created a large spreadsheet (formally speaking, an Excel workbook) to study the performance of various musical artists on the Billboard Hot 100. The file size was ~800MB. It had about 3,300 columns – one for each (in most cases weekly) Hot 100 report between 1958 and 2021 – and about 30,000 rows (one for each artist-song pairing). That gave it almost 100 million cells in the main tab, each containing a lookup (SUMIFS) formula looking at the data in a reference tab. This size made the spreadsheet very slow to calculate. This post describes how I worked around that.

As I had recently learned, when you try to ask for help with something like this in an Excel forum, the first thing people will tell you is to optimize the spreadsheet. I found there were many sources of advice on that. It appeared that I had already optimized the spreadsheet to the extent possible, consistent with my objectives. There may have been additional optimizations that I was not aware of.

But for purposes of this post, that is immaterial. The question here was not, how can I make this spreadsheet more efficient? The question at hand was, how can I help the computer complete certain desired steps in a very large spreadsheet? I wanted to know if there were techniques that would assist the hardware to handle the various steps in my computing process.

The Problem

For this purpose, I had two machines to work with. Both were running Excel 2016 on Windows 10. One was a desktop machine with an Intel Core i7-4790 CPU and 32GB DDR3 RAM. The other was an Acer Aspire laptop with an Intel Core i5-7200U and 20GB DDR3 RAM. According to PassMark, the desktop machine’s Core i7-4790 (CPU Mark score: 7228) was about 2.1 times faster than the Acer’s Core i5-7200U (CPU Mark score: 3397).

For reference, at this writing, the Core i7-4790 was only about one-tenth as fast as the fastest sub-$1000 CPU listed on PassMark’s high-end CPU chart, namely, the AMD Ryzen 9 7950X ($699, CPU Mark score 64727). With the right hardware, the problems in this particular spreadsheet could be significantly ameliorated. Of course, one could always find a way to build a bigger or more complicated spreadsheet, to such an extent that even a mighty CPU could be humbled – in which case one might still need to deal with issues discussed here.

On my machines, CPU speed was an issue at certain points in the process. I found it convenient to use Moo0 System Monitor to keep me informed of system bottlenecks, of CPU temperature, and of the load on CPU, RAM, and SSD. The desktop machine seemed to become sluggish at just the time when Moo0 reported 100% CPU usage. To mitigate that, I used Task Manager (taskmgr.exe in Windows) > default view: Processes tab > right-click on Excel > Go to details (i.e., Task Manager Details tab) > right-click on Excel > Set affinity > uncheck one “processor” (i.e., thread). On my quad-core i7-4790 CPU (i.e., with eight threads), Moo0 indicated that that step reduced CPU load to ~89%. On the dual-core, four-thread laptop, that step reduced CPU load to ~74%.

Unchecking one processor in Task Manager would thus slow processing. The tradeoff was that this step would make the machine more responsive for other purposes. I hoped it would also eliminate the freezeups that I experienced on the desktop machine, where I was performing multiple tasks simultaneously, mostly taxing RAM but not CPU. I didn’t multitask on the laptop during this process, didn’t experience any freezeups there, and also didn’t notice that unchecking a processor in Task Manager reduced the speed at which the laptop’s internal fan was running – so I set the laptop to use all processors.

The tasks that I needed to perform on this monster spreadsheet were as follows:

Run a calculation once. I had the spreadsheet set to manual calculation (Excel > menu > File > Options > Formulas > Calculation options > Workbook Calculation > Manual > uncheck the option to Recalculate workbook before saving). I had created the spreadsheet by copying a formula from one cell to many others, but I had not yet allowed the spreadsheet to begin the very long process of calculating the values of those formulas in those many cells.
Save the spreadsheet after calculating as version A. Then save again (F12) as version B.
In version B, freeze values (i.e., convert formulas to fixed values). Then I could change to Automatic calculation: there would be no remaining formulas to calculate, and Automatic calculation would prevent the mistake of assuming that displayed values (set to manual calculation, but perhaps not yet actually calculated) were correct. Autocalc would also try again automatically if a previous attempt to calculate was interrupted.
Delete tabs (i.e., worksheets within this spreadsheet) whose values were used to produce the values calculated in the main tab. Now that I had frozen the final values in the main tab, I didn’t need those subsidiary worksheets anymore.

The reason why I mention the puny laptop is that the desktop machine was not turning out to be an ideal partner for this project. For one thing, experience taught that Microsoft would be perfectly willing to induce system instability in order to compel users to reboot, so as to finish installing updates that Microsoft deemed essential – more accurately, updates that Microsoft made practically unavoidable. I was not sure whether that was the cause of the intermittent crashes or other instances of instability that plagued the desktop computer. Possibly in this case the instability was due to interactions among installed software and/or my own tweaks to the system.

Whatever the reason, I had to reboot this system periodically – perhaps every week or so, on average; more often when I stressed the system by demanding too much of it, as in the case of this large spreadsheet. After a few crashes, I realized that I wasn’t going to be able to just forge straight through this project on the desktop computer.

The desktop machine was also not ideal because, in some of the tasks just listed, Excel would abandon its calculations, taking me back to the starting point, whenever I turned to some other task, or when another spreadsheet did something, or basically whenever I breathed. Multitasking was essential: the desktop machine was my main tool for getting work done. Some of the steps just listed would require many hours, if not days, even on the desktop machine. I would not be able to devote that machine solely to this project for any such period of time.

Excel did not normally seem hostile to multitasking. I wasn’t sure why it was acting like that in this case. I posted a question on it. Unfortunately, Excel Forum lost the discussion – not that we were reaching any solution to the problem I had posed. It seemed that I would be most likely to finish this project if I attempted it on some computer other than the desktop. A virtual machine (VM) was a possibility. A VM would allow me to restrict the system resources being allocated to this task. Unfortunately, the VM would still be vulnerable to desktop system instability.

The whole spreadsheet was about 800MB. But evidently the formulas in that 800MB spreadsheet would operate most efficiently if they could fill 20GB of RAM, or more. Unfortunately, Excel and my Internet browsers did not seem to coexist peacefully. I had many tabs open in my browsers. They needed RAM. I did not sort out the exact nature of the problem, but the nature and frequency of crashes appeared to depend on whether I started Excel before starting the browsers or opening new browser tabs. At any rate, I was not able to work smoothly with that 800MB spreadsheet while doing my other work on the desktop.

Thus, I turned to the laptop. It had less RAM than the desktop, but apparently that wasn’t the problem. It seemed that Excel just needed to start by filling all available RAM. Once it finished doing that, it would be able to settle down and get to work – as long as I was not running browsers or anything else that might infringe upon Excel’s jealous seizure of resources. Excel might actually get to that point more quickly if there was less RAM to fill, though presumably the extra RAM would enable Excel to finish its calculations faster.

I was pleased and surprised to see that the laptop was able to open this monster spreadsheet. At that point, I had not yet switched the spreadsheet to manual calculation. So Excel immediately launched into an attempt to update all those 100 million cells. I let it have its head, with that; I let it run for 24 hours. At that point, it was 6% finished. So it appeared that Excel would need about 16 days to complete one calculation step on the laptop.

There was the option to let it go ahead. The problem there was that Acer designed its laptops so that the battery would eventually die if you left the laptop plugged in constantly. The battery on this Acer, which I usually left plugged in, had died long ago. I did have it plugged into an uninterruptible power supply (UPS), which it shared with the desktop – which meant that the laptop would have only a few minutes of uptime, once the power went out. So if there was any appreciable power outage during those 16 days – and we had been known to have power interruptions from time to time – I would lose all of the unsaved calculations to that point. Since Excel was very touchy during some of its operations, I would also lose all unsaved work if I hit the wrong key at the wrong time. Excel also stopped calculating if I closed the lid on the laptop, even though I had adjusted the Windows power settings to prevent any actual system sleep, hibernation, or shutdown. There was also the question of whether it was good for the laptop to run full-bore for 16 days without a break.

In short, I didn’t want to gamble that the laptop would make it through 16 days’ worth of calculations without having to start over, or worse. It seemed I might have to complete this project on the laptop, and that doing so would require me to adjust the size of the project.

The Solution

After the abortive efforts described above, my first successful step was to break the spreadsheet into pieces. This particular spreadsheet was organized around dates, from 1958 to 2021, so I wound up breaking it into six pieces, one for each decade in that period (i.e., 1958-1969, 1970-79, 1980-89 …). This breakup took some time. I divided the big spreadsheet into two parts, each containing three decades, and then divided each of those two parts into three smaller parts. Each of these pieces was about 135MB and contained the raw data worksheet plus an average of about 16 million formula cells.

The actual procedure I used to break up those pieces was to make a copy of the original spreadsheet in Windows. This was faster than making the copy in Excel. Then I opened the copy, renamed it (as e.g., 1960s.xlsm), and deleted all columns that did not pertain to the specified decade. Each of these steps was simple for the user, but deleting columns took many hours of machine time. I think the fastest way through this step may have been to subdivide into three larger chunks. That is, the 1960s spreadsheet would contain columns for both the 1960s and the 1970s; likewise, the 1980s spreadsheet would also include columns for the 1990s; and the 2000s spreadsheet would include columns for the 2010s. Then make copies of those three, rename them appropriately (e.g., the contents of 1970s.xlsm would be identical to the contents of 1960s.xlsm), and delete extraneous columns from each (so that e.g., 1960s.xlsm would contain only columns for the 1960s).

If I had figured out those things at the start, I might have been able to do that subdividing on the desktop. What actually happened was that, once I had converted the one large spreadsheet into six smaller spreadsheets on the laptop, it occurred to me that the desktop might be able to work with those, without conflicting with other tasks underway. That seemed to be true. With a smaller chunk, Moo0 might still report that Excel was hogging the CPU (depending on the task), but now Excel seemed to be using only about half of the desktop system’s RAM. So I would guess that, depending on formula complexity, if you have 20+GB of RAM, you might be best advised to keep each of your chunks to somewhere around 15-25 million calculated cells.

The first step in each of these six manageable chunks was to hit F9 to run the calculation once, or (assuming no further calculations) just switch back to automatic calculation. This calculation could take a long time. The laptop seemed to process formula cells at a rate of roughly a half-million per hour. In this case, each of the chunks required ~36 hours for the calculation step. Consistent with its PassMark score, the desktop’s CPU processed formula cells about twice as fast – roughly one million per hour.

Once that calculation finished, I saved the spreadsheet as version A. So, for example, 1960s.xlsm now became 1960sA.xlsm. Then I saved it again as version B. The latter was where I would finally start cutting this spreadsheet down to its essential size.

The first size-reduction step was to freeze the spreadsheet’s formulas into fixed values. Now that I had calculated them, I wouldn’t need to calculate them again. Freezing meant replacing the formulas with the resulting number. For example, a cell that previously contained =3+2 would now contain 5. It would not require any recalculation. In this case, that could make things a lot simpler. For instance, the performance of The Beatles’ “Penny Lane” on the Billboard Hot 100 for March 18, 1967 was a settled fact. Once Excel looked up that ranking from the raw data tab, the calculation step was finished: the main tab merely needed to contain the number 1, specifying that song’s Hot 100 rank on that date.

Freezing values was easy enough. My customary solution, almost second nature, was to select the range to be frozen and then hit Edit > Copy, Edit > Paste Special > Values. Since I still used Alt-key combinations from back in the day, for me it was Alt-E C, Alt-E S V > Enter.

Once the values were frozen, I no longer needed the raw data tab. So I could delete that. These two steps – freezing values and deleting the raw data tab – significantly reduced the size and time to load, for each of my six decadal chunks. For instance, 1960sA.xlsm was 136MB, whereas 1960sB.xlsm was only 42MB.

When I finished those steps with each of the six decadal spreadsheets, the final task was to merge them back into one larger spreadsheet, so that I could conduct further calculations using this established data. With the aid of Excel Forum, that larger spreadsheet became the foundation for the Billboard Hot 100 analysis in another post.

Finding the Union and Intersection of Multiple Sets of Files

Leave a reply

I had several sets of files. I wanted to find their union and their intersection. The Math Lab defined the union of two sets as “the set of elements which are in either set,” and the intersection of two sets as “the set of elements which are in both sets.” So if set 1 consisted of A and B, and set 2 consisted of B and C, then the union of sets 1 and 2 would be A, B, and C (typically written without duplication — that is, not A, B, B, C), and the intersection of sets 1 and 2 would be B. In the case of filesets, the union would be the complete set of all files existing in at least one fileset, after eliminating duplicates, while the intersection would be the list of files that existed in all filesets.

Among other things, this post explores tools for calculating hash values for files. A later post provides an updated and more thorough look at some such tools.

Union of Multiple Filesets

To find the union of multiple filesets, I could use a duplicate file detection program. A search led to various lists of recommended duplicate detectors. I had been using DoubleKiller Pro for some years, and was familiar with its options, so that’s what I used in this case. I believed but wasn’t certain that the options described here were present in the free version.

If my duplication detection program found file X in fileset 1 and also in fileset 2, I could use that program to delete the copy of X that existed in fileset 2. For some purposes, I might have to remember that I had thus distorted fileset 2: it might now contain fewer files than it previously contained, but that would not necessarily mean that it was originally inferior to fileset 1. As with file operations generally, it would be a good idea to make backups at the beginning and also intermittently along the way, so as to be able to check back and see how things used to be, prior to one’s intervention.

Where the files had the same contents but not the same names, it would be advisable to examine the list of duplicates before deleting anything. As detailed below, one set could have names that did not accurately summarize file contents, or that would provide little information, while the duplicates in another set could have more accurate or informative names or extensions, or could be arranged in folders that the user would be unwilling or unable to reconstruct afterwards. Some of the techniques discussed below could be useful in deciding which duplicates to keep and which to delete, so as to preserve a union of files that had informative names without duplicates.

Intersection of Multiple Filesets

It seemed easier to obtain the union than the intersection of multiple filesets. It appeared that neither dupeGuru nor DoubleKiller had any built-in intersection capability, nor had I noticed it elsewhere. My searches didn’t lead to an easy technique. Evidently I would have to come up with my own way of doing that. The method I used was to produce a list of files, and then use a spreadsheet to examine the contents of that list. This section is mostly focused on the technicalities of the spreadsheet I used.

Producing the list was the easy part. There were at least two ways to do so. One way, detailed in a prior post, was to run a Windows DIR command, and capture its output. That method would be most useful where duplicate files were known to have the same filenames. Another way was to run a DoubleKiller comparison, and export its results. DoubleKiller had the advantage of being able to conduct a CRC32 and/or byte-for-byte comparison among files, so as to identify exact duplicates among those whose names were not identical, and to disregard those with identical names and reported filesizes that were actually not identical. If I wanted to achieve something like that on the command line — that is, if I wanted to devise a command what would produce a list of files with something like the CRC value for each file — my command would apparently have to use a tool capable of calculating those values and including them in the file list. For instance, it appeared that NirSoft’s HashMyFiles (below) might have that capability.

I used the DoubleKiller approach. Specifically, as I learned the hard way, I would want to run a DoubleKiller search that would identify exact duplicates among my filesets, not only byte-for-byte, but also using their CRC32 values. Obtaining both would insure that I was seeing exact duplicates and would also produce CRC values in the output, assisting (as we are about to see) in making sure of which files were duplicates of which others.

Then I proceeded to examine the resulting list of files. Specifically, when the DoubleKiller search finished, I right-clicked on one of the listed files and chose Export all entries. That produced a text file listing the search results. I opened that text file in Excel 2016. The Text Import Wizard came up automatically. I indicated that the file was Delimited with tabs. Although the resulting spreadsheet was probably already sorted with duplicates next to each other, I sorted it on the CRC values, just to be sure. I added a column to count the number of duplicates in each set: basically, if the CRC from the previous row is the same as the CRC from this row, add 1 to the number immediately above; if not, enter 1 (i.e., restart the count). I used Quick Filter to verify that 3 was the maximum number of members in any set. I added another column to detect the sets with 3 members: if the value to the left is 3, put an X; if the value to the left and one row below is a 1, put a 1, otherwise copy the value of the cell below. I fixed those values so they wouldn’t change (i.e., select the column > Alt-E C > Alt-E S V > Enter > Enter). I sorted by that column and deleted all rows lacking an X. That gave me 1926 duplicates (presumably, 1926 / 3 = 642 sets). I could then use similar procedures to identify, say, the first member of each set of duplicates, and mark the second and third members for deletion, giving me a list of (in this case, 642) unique files. Each of these unique files was duplicated (though perhaps not with the same name, folder, or date) in each of my source filesets.

The remaining step would be to produce a set of commands that would extract these 642 unique files from the source filesets. But before doing that, I wanted to make one modification to the procedure just described. I had noticed that, within some sets of duplicates, some files had better names than others. For example, in one set of three identical files (i.e., one from each of my source datasets), I saw that two of the files were named FLARE4.BMP, while the third was named f1578175.bmp. It was possible that this .bmp image file really did display a flare, in which case the FLARE4.BMP name would be informative. But there was no obvious meaning to the f1578175.bmp name. Indeed, I was pursuing this topic as part of a larger project of recovering files from a trashed hard drive, and along the way I had noticed that the PhotoRec file recovery tool had recovered many files with meaningless names, beginning with f, that resembled this f1578175.bmp filename. So, if possible, I would prefer that my resulting list of files — the list representing the intersection of these several filesets — would feature informative names, like FLARE4.BMP, rather than generic names like f1578175.bmp.

To identify the best filenames in each set, I started by sorting the intersection list (of 1926 files, in this case) and looking for big groups of undesirable (e.g., uninformative) filenames, like f1578175.bmp or LostFile_dll_1383439.dll (among many LostFile entries created by the Lazesoft file recovery program). I created an Inferior column and put an X next to each of those. I added a column identifying each file’s extension (i.e., =RIGHT([previous column], 4), sorted on that column, and added another Inferior column, with an X for dubious extensions. In this particular project, I had noticed that .rbf was one such extension, accompanying many of the files retrieved by some file recovery tools. It was also possible to mark files for deletion according to whether they were found in undesirable folders. For example, a file called mscorier.dll was found in a subfolder named Unknown folder\1031, and also in one named Windows\system32. Guess which I found more informative. There could be other reasons to prefer one copy of a file over another.

When I had finished marking some relatively undesirable files, I resorted by CRC and added a column to mark, with an X, those instances where the present and the two preceding rows all had the same CRC and also all had an X in the Inferior Name column, to make sure I wasn’t about to wipe out an entire set of duplicates. I ran a Quick Filter and saw that, indeed, I did have some entries in that column, indicating that all of the recovered duplicates with that CRC had inferior names. I unmarked one file in each such set, so that one duplicate would survive. Then I deleted, from the Excel list, all the nonessential duplicates with inferior names.

That left me with a list of preferred duplicates. For those, I could use the spreadsheet to construct Windows commands that would copy or move the designated files to some other location. As detailed in another post, the Excel formula would use something like this:

="xcopy /s /h /i /-y "&CHAR(34)&E2&A2&CHAR(34)&" "&CHAR(34)&"D"&MID(E2,2,LEN(E2))&CHAR(34)&" > errors.txt"

In other words, for each desired file, the spreadsheet would construct a command that would say, Copy the file (whose full pathname consisted of the path in column E plus the filename in column A) to a folder with the same name on drive T. Presumably drive T would not already have such a folder: the command would create it. The user could specify a subfolder after T if desired. The command could also be altered to steer all files into a single folder, though that might not work too well unless the user first sorted and checked the spreadsheet to verify that there were not multiple files with the same name. The purpose of CHAR(34) was to insert a quotation mark, to surround file or path names that might contain spaces. The actual quotation marks shown in the formula (i.e., not CHAR(34)) denoted text portions of this command. Errors.txt could be checked afterwards to make sure there were no problems.

A formula like that, copied onto each row listing a desired file, would produce a list of commands that could be run from a batch file. Each such command would instruct Windows to copy that particular file. To create a batch file to run those commands, I would simply have to copy the list of commands (as displayed in Excel) to a Notepad file, save it with a .bat extension (e.g., copylist.bat), and run it by selecting and hitting Enter (or by double-clicking) in Windows File Manager. After running that batch file, the target folder(s) should contain copies of the source files, constituting the intersection of the source filesets.

Finding a Batch Hashing Solution

As indicated earlier, it seemed that it might be helpful to use a third-party tool like NirSoft’s HashMyFiles to produce a directory listing that included the hash value for each file. With such a listing, I could use Excel to identify and examine features of duplicative files, potentially informing a decision on which files to save, move, or delete. At a later point, I returned to this post and added this section, with information on my use of such tools.

I had expected to run a command-line version of HashMyFiles, but for some reason I wasn’t finding one. So I started by downloading and running the 64-bit GUI version. I clicked on its folder icon and navigated to the folder containing the filesets I wanted to compare. They were large, and it looked like it was going to take a long time to process these files. I let it run, just to see what I’d get, but finally aborted after 36 hours. It seemed that HashMyFiles was calculating each kind of checksum it could, whereas I might need only one. The available options (in HashMyFiles > Options > Hash Types) were MD5, SHA1, CRC32, SHA-256, SHA-384, and SHA-512.

A Stack Overflow discussion offered a number of perspectives on those options. There appeared to be a difference between file verification and security. The latter, taking greater pains to verify that a file had not been deliberately altered, would apparently call for at least SHA-256 (e.g., How-To Geek, 2018). But for purposes of simple file verification, CRC32 was common and apparently adequate. So I chose CRC32. Therefore, in HashMyFiles, I went to View > Choose Columns and chose only these columns: Filename, CRC32, Full Path, Modified Time, File Size, Identical, and Extension. (The Modified Time column seemed to match the date and time shown in Windows File Manager.) I would soon learn that HashMyFiles did not remember these settings; I had to reset them each time I used the program. Then I re-ran the HashMyFiles calculation. Unfortunately, it seemed that HashMyFiles was still determined to calculate all possible values, even if it wasn’t displaying them. After another 12 hours, I canceled it again, and looked elsewhere.

A brief check suggested that the HashMyFiles values were probably accurate. Specifically, I ran HashMyFiles on a set of 79,113 files, and accepted its option of .csv output. That gave me a nice-looking spreadsheet showing the columns I had selected, including CRC32 values in hexadecimal. For some reason, a few files were left out: the spreadsheet listed only 79,102 files, and provided CRC32 values for 79,101. Using Excel’s VLOOKUP, I compared CRCs for files (sorted by full pathname) against the output of a tool called ExactFile, whose webpage said it was still in beta. ExactFile calculated CRC32 values very quickly — within a minute or two — but produced a report (in SFV format) listing only 54,611 files in that set, even though I did specify “Include subfolders” in its Create Digest tab. Of those, almost all (54,577) appeared in the HashMyFiles list, and most (52,943) had the same CRC32 values. The reasons for the discrepancies were not immediately obvious.

Raymond (2017) listed other hashing tools that I could try. At the top of his list: the portable IgorWare Hasher (4.9 stars from 15 raters on Softpedia). I tried version 1.7.1 (x64). Unfortunately, it did not seem able to process more than one file at a time. That was also the case with Hash Generator 7.0 (3.3 stars from 21 raters on Softpedia; no explanatory comments from users) and, apparently, with the other tools on Raymond’s list. Gizmo (2017) recommended Hashing (4.5 stars from 10 raters on Softpedia), but screenshots suggested that it was going to output its results across multiple lines, resulting in considerable hassle to move them all to a single line (below). It also did not offer a CRC32 option.

I tried running Checksum Calculator (4.0 stars from only 3 raters on Softpedia) on a fileset that I called WINXPFILES. It began loading files as soon as I selected the folder. It started loading files at a rate of, I’d guess, maybe 50 per second. But by the time a few hours had passed, it had slowed down to more like five per second. Then three. The speed didn’t seem to be related to file size: it was constant and equally slow for really small files. It seemed that the large number of files might be clogging the calculator’s memory. It took maybe 3.5 hours to load 79,113 files (i.e., ~22K files/hour) totaling 20.1GB (i.e., roughly 5.7GB/hour, averaging ~250KB/file). Let us be clear: this was not the CRC32 calculation. This was just the file loading process. When Checksum Calculator was ready, its buttons stopped being greyed out, and I was able to select CRC32 and click Calculate. I was surprised to see that it calculated CRC32 values for all those files in about 15 minutes. I clicked Save Results. It saved a comma-delimited text file that I could open in Notepad or Excel, containing file path and name and the calculated value. Unfortunately, the supposed CRC32 values did not look like CRC32 values: they were all numerals, with no letters, and they were nine to ten digits long. Later, I realized that perhaps this was what HashMyFiles would have produced if I had chosen its Options > CRC32 Display Mode > Decimal (rather than Hexadecimal) option.

I had another, much larger fileset that I also wanted to get CRC32 values for. I wondered if I would get faster and better results by using a command that would write its CRC32 calculations directly to a file. For that, a SuperUser discussion suggested using the certutil command:

certutil -hashfile "D:\Path Name\File to Hash.ext" MD5 >> output.txt

which offered options such as MD5 and SHA256 but not CRC32. I would have to use something like my Excel approach (above) to generate a specific command for each file that I wanted to hash, and would then have to include all those commands in a batch file. (I would shortly be getting into the details of that.) The situation was similar for a PowerShell command — that, again, could not calculate CRC32 values:

Get-FileHash -Path "D:\Path Name\File to Hash.ext" -Algorithm MD5 >> output.txt

The output.txt files, in those examples, would contain unwanted additional information that could be filtered, at least to some extent, by a more refined command or by opening output.txt in Excel and sorting to identify and remove those lines. That appeared to be less of an issue in Ashley’s CRC32.exe utility, which I hesitated to use merely because it was relatively untested: it looked like it would produce the desired CRC32 output with minimal extraneous information. When I did test it (below), it didn’t work.

I would be able to avoid the need to generate file-by-file commands by using the generally reliable SysInternals — specifically, Russinovich’s Sigcheck, from which I got nice output for all files in a directory (including its subdirectories) in an Excel-friendly .csv file using this:

sigcheck64 -ct -h -s -w output.csv "D:\[top directory name]"

but it was slow, because it apparently could not be persuaded not to calculate MD5, SHA1, SHA256, PESHA1, and PESHA256 values for each file — and it did not offer a CRC32 option. I put in a feature request on that. It also gave no progress indication, so the user would have to remember to leave its command window alone for hours, if not days, when it appeared nothing was happening.

There were many other checksum utilities, but few seemed to have the kind of multi-file capability I was seeking. So far, for purposes of rapidly calculating checksums on tens or hundreds of thousands of files, it seemed my best solution would be to give up on CRC32 and go with MD5 instead. In that case, I could start with Hashing, if I really needed a GUI tool, but it seemed that the output would be easier to work with if I chose the certutil approach instead.

In that case, the specific procedure was as follows. In the top level of another directory tree containing the files I wanted checksums for, I ran dir /a-d /s /b > dirlist.txt. In this case, the dirlist.txt file contained information on more than a million files, so trying to open it in Excel produced a “File not loaded completely” error. There was a way to use Excel with millions of data records, but I was not confident that would work for my purposes. I chose instead to split dirlist.txt in two. Among tools for this purpose discussed at SuperUser, I chose GSplit (4.0 stars from 17 raters at Softpedia). In GSplit, I specified dirlist.txt as the file to split, and then went into left sidebar > Pieces > Type and Size > Blocked Pieces > Blocked Piece Properties > drop-down arrow > I want to split after the nth occurrence of a specified pattern > Split after the occurrence number > 1048570 (i.e., just short of the 1,048,576 rows Excel could handle) > of this pattern: 0x0D0x0A (i.e., the default, representing a line break in Notepad, so that I would not be splitting the file in the middle of a line) > Split button (at top). That gave me disk1.EXE, disk1.gsd, and disk2.gsd. I renamed the last two to be Dirlist1.txt and Dirlist2.txt. (I had to delete one or two junk lines added by GSplit at the end of each of those two files.) I opened those two files in Excel, one at a time. Doing so opened the Text Import Wizard, to which I responded by selecting Delimited > Next > verify that Tab is selected > Finish. There were no tabs in dirlist.txt, so this put the full path information for each file in a single cell in column A. In column B, I entered the Excel equivalent (above) of the certutil command (above), like so:

="certutil -hashfile "&CHAR(34)&A1&CHAR(34)&" MD5 | find /i /v "&CHAR(34)&"certutil"&CHAR(34)&" >> output.txt"

The new material at the end of that command, involving “find” and another reference to “certutil,” just said that I wanted to exclude, from the output, the line in which certutil would close each MD5 calculation by saying, “CertUtil: -hashfile command completed successfully.” I copied that formula down to each row in the Excel versions of Dirlist1.txt and Dirlist2.txt. Each occurrence of that formula produced a command that could run in a batch file. I copied the full contents of column B from each of those two spreadsheets into a single file (using Notepad, though some other plain text editor would be necessary for files above a certain size) called MD5Calc.bat. I saved that file. I deleted the output.txt file produced by my testing, so that the output.txt file produced by my commands wouldn’t contain that scrap information. In Windows File Manager, I right-clicked on MD5Calc.bat > Run as administrator (or I could have run it in an elevated command window). That seemed to run at a consistent pace that I estimated at ~500 files per minute. It ran for 30 hours, and I could see what it was doing the whole time, as it processed one file after another. Unfortunately, that visibility did not help me detect that, somehow, either I or one of the processes just described had gotten the commands wrong, so that my list of files being processed was out of order. After 30 hours of doing it wrong, however, I did begin to catch on. I was not confident that the result was going to be correct, so I canceled it.

Rather than restart what would evidently be a 60-hour process, I decided to try Ashley’s CRC32.exe utility. As I explored that approach, however, I encountered some complications. Among other things, the CRC32.exe utility would put its output on two lines. This was not a problem with the smaller file set; but with the larger set I was trying to resolve, this would double its already very long file listing (i.e., 1.5 million files) — which, as demonstrated above, would make the Excel approach even more unwieldy. Seeking a solution on that front, I posted a question on SuperUser.

Pending a good answer to that question, I tried another search for CRC32 tools. This search seemed more successful. Several additional tools came to light, each seemingly capable of batch CRC32 processing. I started with Fsum Frontend (4.5 stars from 2 raters on Softpedia) — which, as its name suggests, was a front end or GUI for Fsum. I downloaded it and ran it as a portable (4.0 stars from 3 raters on Softpedia). Its ReadMe file said it contained code from Jacksum, another possibility that had emerged in my search, as well as from BouncyCastle and FlexiProvider. To use it, I went to its Tools > Generate checksum file > navigate to and select the folder containing the files to be checked > check the Recurse Subdirectories box > Format = SFV 512 > Mode: 1 file in any place > Save As > navigate to the desired output folder and enter output.txt. To run it, I clicked the icon next to the Mode box. That seemed to run for a few seconds, but then it produced an error, naming one of the files and saying, “The system cannot find the path specified.” That seemed to kill the process: clicking OK left me back where I started. There did not seem to be an option to allow it to skip that problematic file. I wasn’t about to go digging manually through tens of thousands of files to fix any that this program might not like. It seemed wiser to try another tool.

The context menu (i.e., right-click) option installed in Windows File Manager by HashCheck Shell Extension (HSE) (4.4 stars from 46 raters on Softpedia) was simply to “Create Checksum File.” When I went ahead with that option, it gave me an opportunity to choose a name and location for the checksum file, as well as a choice of filetype. I was interested in the CRC-32 option, which would create an .sfv file; this option would be much faster and would produce much shorter filename addenda than some of the other options. Then it said, “Creating [the specified filename].” It worked through the set of 79,113 files in about four minutes. As above, I imported that .sfv file into Excel and compared its results against those produced by HashMyFiles. With a few dozen exceptions, due primarily to an apparent Excel inability to deal with the presence of a tilde (~) in filenames and secondarily to a handful of files reported as UNREADABLE, HSE produced the same file listing and CRC32 values as HashMyFiles. It appeared that HSE could do the job.

Ray Woodcock's Latest

It's complicated

Tag Archives: unique

Excel: Calculating a Spreadsheet That’s Too Large to Calculate