Tag Archives: SHA1

Comparing Backup Drive Hashes to Detect Ransomware

Summary

This post seeks a solution to the problem that, at present, the most commonly used type of ransomware gradually encrypts files and decrypts them on the fly, making them available to the user whenever s/he wants them, so that the user doesn’t realize they have been encrypted.

The provided solution involves comparing the present SHA-512 hash values of data files against the hash values of previously stored or created files, such as on a backup drive. The latter can be calculated on another machine, so that if the working data drive is corrupted, that corruption will not spread to that backup.

The final section of this post links to the downloadable spreadsheet that I have been using to compare hash values. A more widely usable solution might involve PowerShell, which is reportedly able to detect file changes and calculate hashes in real time. A PowerShell script could ideally compare changes against a hash list saved (and passworded, perhaps, in the cloud) at the time of the most recent backup, and could notify the user of any excessive number of file changes.

I posted questions about the solution suggested in this post. Answers to those questions (in the Stack Exchange forums titled Cryptography and Information Security) seemed to confirm that the solution developed here, involving comparisons on two different operating systems, should work as long as the file-encryption ransomware is not running in a synchronized manner on both Linux and Windows platforms. A subsequent post refines this one on certain points.

Contents

The Problem
Hash Could Be the Answer
GUI Solutions
Command-Line Interface (CLI) Tools
Source Drive Hash Comparisons
Comparing Against Old Backups
Implementation

The Problem

As discussed in another post, I was developing methods of improving my system’s ransomware resistance. As part of that effort, I wanted a way to compare the contents of two drives. One might be the source drive and the other its backup, or perhaps they would both be backup drives, made at different times. The idea was that, if ransomware was encrypting the files on the data drive on my Windows computer (where I did most of my work), these comparisons would tend to make that corruption evident.

I had developed a cautious arrangement that I had found useful for checking my backups, rather than trust automated backup software. I found that this cautious approach tended to make me aware that I might have unknowingly deleted or relocated files, whereas automated backup software would happily relocate or delete those files on the backup as well. This cautious arrangement used Beyond Compare (BC); no doubt other drive comparison tools would have worked too.

But I didn’t want to use that drive comparison approach in this situation. If the Windows computer was infected with ransomware, it would not be wise to connect the backup drive to it. (The assumption, here, is that I would be dealing with file-encryption ransomware, which presently was the dominant type. This form of ransomware would quietly encrypt files, but would then decrypt them on the fly, whenever a program attempted to access them. That way, the user would not notice anything amiss: files would continue to be available as usual, until the ransomware suddenly flipped the switch, locking most if not all files at once.)

BC’s more exacting (especially its byte-for-byte) comparisons would notice differences between source and backup copies of a file, even if the ransomware was smart enough to reset filesize and timestamp (and possibly even CRC32) to their pre-encryption values. But it could take many hours to run those more exacting comparisons in BC. During those hours, the ransomware could be encrypting files on the backup as well – conceivably the same files as on the source drive, so as to minimize noticeable differences between the two drives.

Hash Could Be the Answer

Compared to my usual BC-based backup arrangement, it seemed that it could be safer and more thorough to conduct separate file verification procedures on two different computers. My source data drive was in my primary Windows desktop machine; I could connect a backup drive to another computer. For possible cross-platform ransomware resistance, I might even boot the secondary computer in Linux.

(As discussed in more detail in the other post and in posts that it links to, malware would be less likely to travel from the Windows machine to the Linux machine if my Linux installation wasn’t using anything like WINE, and if the malware wasn’t one of the relatively infrequent cross-platform types. Ransomware would also find it difficult to relay information to and receive instructions from its home server if, as envisioned, the Linux computer was disabled from going online.)

Wikipedia defined “file verification” as “the process of using an algorithm for verifying the integrity of a computer file.” The verification process would calculate a value, based on the contents of the file. If the calculation was sufficiently stringent, virtually any change to the contents of the file would change that value. So if ransomware encrypted a file on my source drive, recalculation of the algorithm would result in a new value.

To detect that the source file had changed, I wouldn’t have to compare it directly to its backup, and risk infecting the backup as well. Instead, I could just compare the values computed by the algorithm. The backup drive could remain air-gapped from the infected Windows machine, especially if I took sufficient precautions to prevent the malware from hitching a ride on the USB drive that I used to convey the algorithm’s results from one computer to the other. (One possible psychological benefit of using Linux on the backup computer could be that its mere presence – that is, the use of Linux – would remind me that the secondary machine was supposed to remain separated.)

In my situation, it may have been helpful that I often ran DoubleKiller to verify that the source drive did not have any exact duplicate files. If it did, I would usually zip or alter one, so as to eliminate the duplication. But perhaps exact duplicates would not be a problem. Regardless of the number of copies of a file, I would want to be notified of unexpected changes.

For present purposes, some file verification tools would be better than others. It seemed, in particular, that I would want to use a cryptographic hash function rather than a simpler checksum. Wikipedia said that relatively simple checksums (e.g., CRC32) “are not suitable for protecting against intentional alteration of data” (by e.g., ransomware) – because, according to a StackExchange answer, “[I]t is pretty easy to create a file with a particular checksum.” That is, conceivably ransomware could encrypt a file in such a way that its CRC32 checksum would remain unchanged, even though the contents of the file were now very different. By contrast, that answer said, “A cryptographic hash function … is secure against malicious changes. It is pretty hard to create a file with a specific cryptographic hash.”

Of course, not all hashes were created equal. This was, after all, an attempt to represent an entire file’s contents with a relatively short string of letters and numbers. As just indicated, the CRC32 algorithm was no longer considered secure; likewise the MD5. Instead, we had the Secure Hash Algorithms (SHA) family. Wikipedia explained that this family consisted of several sets of cryptographic hash functions: SHA-0 (1993), SHA-1 (1995), SHA-2 (2001), and SHA-3 (2012), each providing greater security than its predecessor. Each of these sets offered a choice among hash value sizes. In SHA-2, those sizes ranged from 224 to 512 bits. (A Stack Exchange discussion supported my impression that, for example, 256-bit SHA-2 could have been written as SHA2-256, anticipating the later SHA3-256. But that wasn’t standard usage. Instead, it was as though SHA-2 would never be superseded: everyone wrote 256-bit SHA2 as simply SHA-256.)

Unfortunately, each improved level of security would require longer to compute. For some purposes, more advanced hardware could help with that. Various sources (e.g., Cryptopedia, 2021; BitcoinWiki, n.d.; Medium, 2019) explained that cryptographic hashing rates had increased by orders of magnitude with the introduction of successive generations of specialized calculating chips, from the computer’s general-purpose central processing unit (CPU) to graphics processing units (GPUs) to field-programmable gate arrays (FPGAs) to application-specific integrated circuits (ASICs) to cloud computing services (e.g., Elcomsoft) that would rent you as many fast calculators as you could afford. Not that computation speed was the only factor: there were indications (e.g., FreeCodeCamp, 2020; SysTutorials, 2015) that, on a local drive, data transfer speed could be the primary bottleneck.

For my purposes, advanced hardware might not be helpful. I wasn’t sure I wanted to give ransomware access to faster hardware. To the contrary, I wanted its encryption process to take as long as possible, to extend the timeframe in which I might become aware that something was happening to my files. There was also a budget factor. I wasn’t inclined to set up an expensive secondary computer to calculate hashes. At least to start, I was planning to use my old (2012) Lenovo E430 laptop as my Linux machine, producing sets of hash values for my backup drives.

For consumers without GPUs, with 1TB and larger data drives that were increasingly likely to fill up with video and other space-consuming materials, there could be hours of difference, between using a relatively fast checksum or a relatively slow cryptographic hash to check for file changes. Until I felt like investing in faster hardware, there would be limits on my willingness to drag things out, in hopes of giving the bad guys a hard time. At some point, I thought I might like to actually complete the hash-checking project, so as to continue with my backup process, and then maybe have some time left over for a life. In that spirit, it seemed I should aim for the least demanding cryptographic hash that would still be relatively secure.

In my reading, SHA-256 seemed to be a reasonable compromise: secure, but not taking so long to compute. But then I discovered there was a wrinkle. As it turned out, SHA-256 was significantly faster to calculate than SHA-512 – but only for late-model CPUs that supported the Intel SHA extensions. The CPUs in my computers weren’t among those specific recent models. Therefore, test runs using MD5 & SHA Checksum Utility on both the desktop and the Lenovo laptop visually illustrated what they were saying in a Cryptography discussion: on all other machines, SHA-512 file hashing would unexpectedly be much faster than SHA-256 calculation. At least until I upgraded my hardware, it seemed that SHA2-512 was the hash value I would want to compute and compare.

GUI Solutions

Since I wasn’t seeing any tools designed specifically to compare file hashes from drives on two different computers, I figured I would use a spreadsheet. One tab of the spreadsheet would list the filenames and their hash values from the first computer (e.g., the source drive in my Windows 10 desktop), and the other tab of the spreadsheet would list the filenames and hashes from the second computer (e.g., a backup drive connected to the Lenovo laptop). Using spreadsheet techniques discussed in another post, it wouldn’t be hard to divide the path, filename, and hash values in each tab, and then sort the lists and use lookups to identify inconsistencies between the two lists. The Implementation section (below) provides more detail on the spreadsheet I developed in this case.

With that plan in mind, what I needed was a tool that would produce a single line of output for each compared file. That line would contain nothing other than the complete path and file name of the file being compared, and its corresponding SHA-512 value. (Spreadsheeting would have been easier if I had found a tool that would also produce the file’s timestamp on that same line.)

There seemed to be quite a few GUI hash calculators (e.g., Raymond.cc, 2019; Next Of Windows, 2017). Most seemed oriented toward providing multiple hashes (e.g., SHA1, MD5, CRC32) for a single file, or for a folder as a single unit (i.e., not for each file within a folder). Those that would calculate hashes for each file within a folder individually seemed to focus on providing mostly weaker (e.g., SHA1, MD5, CRC32) hashes. For example, Nirsoft’s HashMyFiles was user-friendly, and provided a lot of useful information; but it did not calculate SHA512, nor did its developer seem interested in my suggestion that he should develop a command-line version.

HashTools and MultiHasher appeared to be exceptions, capable of calculating SHA512 hashes for multiple files at once. Between the two, MultiHasher seemed more popular at Softpedia (4.7 stars from 41 raters). I personally found its display cluttered and unintuitive, and its output undesirable for my purposes: filename plus hash were not provided on one line. HashTools was much less cluttered but still unfortunately lacking in basic instructions. The impression that I accrued, from trial and error, seemed to be that you click Add Folder and wait for it to populate the list of files; you do not check “Automatically save hash files with each file checked” because apparently that meant the program would write something onto, or adjacent to, the files being hashed; and then you click on your preferred hash button (in my case, SHA512).

I ran HashTools’s SHA512 hash on a Test Folder containing 1,994 files (total folder size: 9.1GB; average file size: 4.9MB; median file size: 27KB). HashTools had the nice feature of showing its progress. It did calculate a SHA512 value for every file in the folder. For some reason, it seemed incapable of copying both filenames and hashes to the clipboard simultaneously. I had to copy first one, then the other, yielding the risk that what I pasted to the spreadsheet would not match up. HashTools finished analyzing that folder in 7:19 (m:ss). That was less than half as fast as MultiHasher (3:12). The difference was so great that I would have to re-run that comparison to be sure. But I didn’t like HashTools enough to bother. These results were sufficient to remind me that a GUI was inevitably a front end for programming decisions that could vary from one product to the next.

As another GUI possibility, I ran QuickHash 3.3 on that Test Folder. Its interface was superior to HashTools and MultiHasher, as were its options and progress report (at least potentially – see below). In a brief test, its speed seemed to be only slightly below that of PowerShell (below). There were, however, some problems. For example:

I ran into difficulty getting QuickHash to export in the desired format. In response to my emails, the developer kept telling me to check the User’s Manual. He oddly insisted that the manual was easy to find – even though there seemed to be no links to it anywhere on his website or in the program itself (which also had no Help feature). He felt it was logical that users would find it in the Downloads area – even though the drop-down menu for that section of the website did not list it. He said it was also included “inside the zip of the download for ultimate convenience” – but, obviously, there was no zip in the Linux .deb that I installed. He said it was an excellent manual, and that I should be able to find answers readily in it – even though it had no table of contents or index. He felt that users should just search for relevant terms, and then examine each occurrence of such terms until they found the right one.
I belabor those communications to convey an impression of stubbornness that could impair quality. That is, the belief that your product is perfect may inhibit you from looking for its possible flaws. This was definitely not the kind of correspondence I had with developers who were working hard to insure that their software got it right every time. Thus, in my first use of QuickHash, in the very brief test of the Test Folder mentioned above, I found that the output misstated some filenames. In the actual folder, those names contained double spaces; but in the QuickHash output, those had been changed to single spaces. This would mean that the program’s output would not match up with the actual filenames. The developer was not aware of this bug. If I could find it this easily, I had to wonder whether he was carefully testing to identify possible imperfections in his homemade hashing script, the handling of filenames, or other aspects of QuickHash.
I ran QuickHash for five hours, to obtain hashes on every file on an entire hard disk drive (HDD). When I sought to save the results, the program crashed. I ran it again; it crashed again. The developer said that the data presented in the QuickHash output grid was also held in an SQLite file. The user manual added, however, that the SQLite file was deleted when QuickHash was closed. I did not find the SQLite file on either Linux or Windows. The developer did not seem to have a solution for my impression that the program had a hard time with large numbers of files. At present, this behavior rendered QuickHash useless for my purposes.
In that five-hour run, the QuickHash progress report signaled rapid progress at first, but then slowed sharply. It seemed to be counting progress in terms of numbers of files rather than quantity of data. So it reported 80% completion at the two-hour mark, and then kept grinding away at that remaining 20% for the next three hours. Maybe the folders it tackled first happened to have many small files. Whatever the reason, the progress report would have been more accurate if it had been based on quantity of data rather than on numbers of files. I mention this, among the other glitches not mentioned, because it now seemed that the progress report could be, but wasn’t yet, one of the program’s strong points.

I felt that QuickHash was a worthy program, and I recognized that the developer had done a lot of work on it over a period of years. I did not mean to bring difficulty into his life – but I did get the feeling that I was bothering him, when I presented seemingly reasonable issues arising from real-world attempts to use his software. I was disappointed, because I did prefer its interface over the other GUI tools that I tried.

Command-Line Interface (CLI) Tools

While QuickHash seemed useful for calculating hashes of individual files and folders, my experience (above) suggested that I might be wise to stick with more established tools for full-drive hashing.

There appeared to be a number of relatively solid CLI options. In Linux, it seemed that rhash or possibly sha512sum might be adequate. For rhash, the command could be something like

 rhash --sha512 --uppercase -r /path/to/folder/filename.txt | pv > hashlist.txt

That formulation drew on the use of pv in another post. It wasn’t as informative here, but at least it stated the size of the output file and the elapsed time, either of which would be meaningful if the user already had a sense of how long the full job should take and/or how large its output file would tend to be. If I wanted to specify all files in a folder, rather than just one file named filename.txt, I would simply omit filename.txt from that command. If I meant the rhash command to refer to the current folder, I could replace the entire input path and filename with just a dot. (Note the irregular behavior I obtained with rhash in one situation. Note also that rhash might appear to hang when hashing large files. Be patient.)

There were also quite a few CLI hashing tools in Windows, most of which gave at least an initial impression of being well-developed:

A possibility I did not explore: run rhash in a Linux virtual machine, to hash files on a physical drive. I assumed this would be slower.
I tried Windows’ built-in certutil -hashfile “D:\Path\to\filename.ext” sha512. For recursively hashing all files in a specified folder plus its subfolders, answers at SuperUser pointed toward something like for /r “D:\Current” %F in (*) do certutil -hashfile “%F” SHA512 | find /i /v “certutil:” >> “D:\Current\CertUtilOutput.txt”. I revised that command into a batch file that I posted as an answer at Stack Overflow. At least its optional find component would eliminate one line of clutter from the output. That command did produce results for all files in the folder and its subfolders. It seemed to run much more slowly than PowerShell (below). I didn’t like that this certutil command didn’t provide any kind of progress report. It looked like adding a progress bar to a batch file could be equally problematic. I also didn’t like that certutil‘s output came in two lines: one for the path and filename, and then one for the hash. That wouldn’t be hard to work around in a spreadsheet. It would just be one more hassle, repeated every time I wanted to compare a drive to some other drive.
Installing 7-Zip gave me the option of using its command: 7z h -scrcsha512 “D:\Path\to\filename.ext”. I could run that command in any directory, after I put a copy of 7z.exe in a folder on my system’s path (e.g., C:\Windows). I could omit filename.ext to run the command recursively (i.e., on all files in the specified folder and its subfolders). Unfortunately, what I just wrote was incorrect: a webpage presenting 7-Zip’s hash (h) option indicated that SHA512 was not supported. SHA256 was as high as it went. When I tried the command just shown, it said, “System ERROR: Not implemented.” It did seem to run OK when I changed it to SHA256.
I tried DirHash by Mounir Idrassi, the developer of VeraCrypt. Its instructions led me to put a copy of DirHash.exe in C:\Windows, and then to add a bunch of parameters. The command that worked for me was dirhash D:\ SHA512 -progress -threads -skipError -nofollow -sum -nowait -overwrite -t “D:\Path\DirHashOutput.txt”. For me, the primary drawback of this tool – indeed, of all these CLI tools – was that it displayed no progress indicator, though it sounded like I might have persuaded Mounir to add one. Without its –quiet parameter, at least I could sometimes get a glimpse of how far it had progressed. DirHash apparently locked the entire top-level folder while it was hashing its contents – or anyway, for some reason, when I tried to move a file, Windows said, “The file is open in Dirhash.exe,” and that problem persisted for a long time before I was finally able to move the file.
There were multiple Linux emulators that would bring Linux tools over to Windows. Various sources seemed to agree that, in the words of a SUNY Binghamton webpage, Cygwin would require “more time and effort up front to install and maintain” whereas Git-Bash (a/k/a Git for Windows) was “the quickest and easiest solution.” While there were enthusiasts for the newish Windows Subsystem for Linux (now WSL2) (at e.g., Reddit), for present purposes the more important observation (in SuperUser) was that “Git-Bash is a typical Windows program” and, as such, “sees C:\ as its root directory” whereas WSL2 “is definitively not a typical Windows program” but instead “sees itself as running on Linux” and therefore uses “Linux’s directory structure.” Until I needed one of the others, Git-Bash seemed like the option for me. Geeks for Geeks (2021) walked me through the installation steps, starting with downloading and running the installer from the Git for Windows website. I went with the defaults almost entirely. This included not checking the new option to add a Git Bash profile to Windows Terminal. The Notepad++ option was grayed out, so I chose Wordpad as Git’s default editor, not for any strong reason other than that WordPad was always included in Windows. When installation was done, I checked the box to launch Git Bash. That gave me what looked like a Linux prompt. I tried sha512sum “D:\Path\to\folder” for my Test Folder (above). That produced an error message informing me that the designated folder was a directory. Evidently sha512sum only worked on individual files. That was good as an option, but I didn’t want to have to work up a list of complete commands for every file in every drive or folder. I quickly found that rhash was not installed in Git-Bash, and sudo apt install rhash produced an error. So: I ran into Git-Bash limitations sooner than expected. Cygwin and WSL2 were still possibilities, but with more of a learning curve and investment of time and resources.
A few years earlier, I had posted a similar question, seeking one-line output for CRC32 values. In that case, as detailed in the post leading up to that question, I relied on Ashley’s (2008) crc32.exe executable. Answers to my posted question offered two different batch scripts incorporating crc32.exe. This time around, I had those batch solutions – but now I was missing the SHA-512 executable that they could use in place of crc32.exe. I did find that assorted older and/or unverified sources had developed similarly named Windows CLI tools (e.g., md5.exe, sha512.exe; see also CRC32 Calculator); but there did not seem to be any widely recognized source (e.g., Microsoft) for such utilities, and I was running low on enthusiasm for experimentation with potentially undercooked fare.
My posted inquiry at ComputerHope led at first to hashsum.bat. Unfortunately, as I reported in that discussion, hashsum failed to report hashes for about 1% of the files hashed, and also failed to specify those files (I had to identify them manually) or the reason(s) for its seeming failure.
To make sure I was not overlooking anything obvious, I posted a question at Software Recommendations. In that forum, I might have expected a GUI suggestion. Instead, the suggestion was to use Windows PowerShell. Among the ways to start PowerShell, I used Win-R > powershell. There ensued some struggling to figure out PowerShell’s Get-FileHash for SHA-512. I found that the PowerShell solution calculated SHA-512 hashes, for the files in my Test Folder (above), as quickly as any other solution – as much as three times faster than some. The PowerShell hashes matched up perfectly with the results from certutil (above). My PowerShell solution incorporated several findings: I needed to output to text, so as to avoid PowerShell’s Export-CSV tendency to interpret commas in pathnames as delimiters; to specify ASCII encoding, because apparently this would reduce problems when viewing the results in Excel; to use -LiteralPath for success with square brackets in filenames (with no guarantees for square brackets in folder names); to include Format-Table -AutoSize to prevent truncation (which, for some reason, only happened when I was hashing an entire drive); and to set the width to 400, so as to allow for the 128 characters produced by SHA-512 plus the default maximum Windows path length of 260 characters. With those adjustments, at first this command seemed to work:

Get-FileHash -Algorithm SHA512 -LiteralPath (Get-ChildItem 'D:\folder\*.*' -Force -Recurse) | Select-Object Hash,Path | Format-Table -AutoSize | Out-File -filePath 'D:\Current\PSOut.txt' -Encoding ASCII -Width 400

There was no progress report feature in this PowerShell command, and various sources (e.g., Microsoft, SuperUser, Adam the Automator) conveyed an impression that it would be difficult at best to make a progress indicator work well in this context. Stack Overflow offered ways of including a progress report, but those appeared likely to involve scripting that, at present, was beyond my abilities. I was not sure whether they would also slow down the hashing.

Unfortunately, that PowerShell command was a work in progress. As suggested in various sites (including those linked in the last bullet point, above), PowerShell was complex to the point of seeming flaky: what worked in one setting did not work in another, nor did it necessarily work as expected. For example, as I would soon see, the PowerShell command just shown would terminate when it failed to find a file that it expected to find. I did hope that PowerShell (or something like it) would point the way to a superior solution, and therefore I posted a question about this. In the meantime, however, I didn’t have such problems with the rhash or DirHash commands shown above. Those were the tools I used, then, for the following explorations.

Source Drive Hash Comparisons

The more frequently the user could compare hash values of source and target drives, the more quickly s/he could detect that ransomware was corrupting files. The principal barrier against more frequent hash comparisons was that each full set of hash values would take a substantial amount of time to calculate and compare. Calculation time would depend upon machine speed and the number and size of files. As a rule of thumb, hashing seemed to proceed at something like a half-terabyte per hour on a fast machine, and maybe a third of a terabyte per hour on a slow machine. For users with a lot of data and/or a slow machine, hashing could take many hours.

To illustrate, the Linux machine that I used to calculate hashes on the backup drive was an old (2012) Lenovo ThinkPad E430. It was the only computer I could set aside for hours of drive analysis. As discussed in another post, disabling this laptop’s ability to go online would presumably prevent it from coordinating effectively with a ransomware server. Thus, even if the Lenovo was infected, the data drives it interacted with would presumably differ significantly from drives on a Windows computer that it did not interact with, even if the same ransomware was at work on both. This laptop also had no internal drives installed and was unable to go online; thus, it could only corrupt data files when an external drive was connected. The Lenovo’s minimal Ubuntu installation was running from a Samsung FIT USB 3.0 drive. The Samsung was relatively fast, as USB drives went, but this was surely a slow way to run an operating system. The backup HDD that I was hashing with rhash was mounted in an external drive dock, connected to the laptop via USB 3.0 cable – which was, again, much slower than an internal SATA drive. A brief test suggested that running Ubuntu from a Samsung T5 external solid state drive (SSD) didn’t help much. Possibly the explanation was that most of the relevant operating system software was loaded into RAM. In that case, if the Lenovo had lacked sufficient USB 3.0 ports, it might not have been terrible to boot the Samsung FIT from a USB 2.0 port. The Lenovo had a dual-core Intel Core i3-2350M with a PassMark score of 1254, arguably adequate (8GB DDR3) RAM, and ~~two~~ three USB 3.0 connectors. Putting data files on the external SSD increased hashing speed by about 20%, compared with putting them on the external HDD. Presumably that meant that hashing speed was mostly determined by the speed of the CPU. But these hashing tasks did not seem to burden the laptop too much: its fan rarely sped up much beyond its normal, barely audible sound.

At least the process of computing hash values did not take much user time. S/he would just enter a command and then, hours later, the process would be finished. By contrast, the process of comparing source and target drive hash values could take a chunk of user time, depending again upon machine speed and files compared. Occasionally I saw forum comments from users who were working with millions of files. Although Microsoft Excel had a limit of about a million rows, it was possible to use it to analyze multiple millions of rows of data. But even with just a few hundred thousand rows, a computer that was not really fast could take a long time to perform the calculations needed to compare before-and-after hash values. On some systems, that could mean crashes of Excel or even of the operating system. (Calculations would be even slower in a virtual machine, but possibly the system as a whole would be more stable that way.) Some users might be able to mitigate such loads by setting up separate encrypted partitions for folders whose files were less or more frequently modified. In that approach, partitions might be hashed at different rates, depending on how often and/or how long they were mounted.

The large amounts of machine and user time required to compare hash sets meant that the user was unlikely to run such comparisons often. I, myself, was only willing to run them every five days, when I updated my backups (below). That was unfortunate. If I could have run them daily, I would have had faster warning of ransomware infection, as mentioned above. There could also be another benefit: I could compare today’s source drive file hashes against yesterday’s, without involving a backup drive. If I knew that there had been no unwanted changes to files on the source drive, I would have greater insight into the cause of a sudden, major difference between source and backup drives: I would suspect that, if there was an infection, it may have been on the Linux side. Without this separate check on the source drive, I could only assume (incorrectly, perhaps) that the infection probably came from the Windows side.

The right software could could significantly decrease time burdens on the Windows (source) drive side, opening the way to faster ransomware detection. Consider, for example, the impression that PowerShell was capable of detecting whenever a file had changed within a folder. With suitable exclusions (for e.g., user-approved changes in a smallish number of files, by a file-editing program presently in use), that sort of functionality might be able to notify users of ransomware file encryption in real time – without requiring my method’s full rehash of every file in a drive. Such software might also be able to log a history of hash changes efficiently, so as to help users identify the last good date of a now-corrupted file. With the right tool, same-drive hash comparisons (e.g., today vs. yesterday) could become the dominant form.

A different way to reduce the hashing burden would be to set red flags – to salt various folders, throughout my source drive, with specific files that I would never change, and to use those as canaries in the coal mine – as leading indicators, that is, that entire folders were being corrupted. So, for instance, if Red Flag File No. 3 was one that I would never change, and if one day I got an indication that its hash value had changed in the past 24 hours, I would know that something was wrong with that file and/or its folder. Relying on hash values for a few hundred red flag files would be much faster than hashing every file on the entire drive, every time I wanted to do a backup. Unfortunately, browsing revealed that I was not the only one who had contemplated this solution. I had to assume that creators of ransomware – spending all day, every day, thinking about this stuff, in pursuit of potential wealth – would be a step ahead of me in tricks of this sort. It was possible that ransomware, in touch with its home base through my Windows computer’s Internet connection, would relay information that would detect something unusual about my red flag files (e.g., that they all shared the same creation time, similar names, or similar patterns of being listed or otherwise accessed around the times when I made backups), triggering instructions from the home base to refrain from encrypting those files. So then we would be back at at the starting point: files were being encrypted while my anti-malware solutions slept.

Comparing Against Old Backups

Regardless of hash calculation and comparison method, it seemed that comparisons of sets of hash values would depend upon the user’s judgment and recollection. Software (be it my spreadsheet or a sophisticated PowerShell script) could only notify me that, today, some number of files on the source drive differed from their counterparts in a comparison set (from the backup drive or from the source drive previously). I would have to decide whether this difference was expected or unexpected.

Consider what happened when I compared new hash values from the source drive against newly calculated hash values for a nine-month-old backup. I found that, unfortunately, there were many discrepancies that I could not vouch for. In a comparison of today’s data drive against such an old backup, I might have no idea why D:\Folder\File001.doc on the source drive was different from its copy on the backup drive. There may have been a good reason – this may have been a change that I (not ransomware) made, months ago, with full knowledge of what I was doing. The point is just that I couldn’t say, with any confidence, that ransomware was not the reason for the change.

When I faced those many files that had changed in the past nine months, for reasons that might or might not be due to ransomware corruption, I had to decide whether to just accept the changes and hope for the best, or manually examine the contents of the changed files on a different (perhaps Linux) system that would hopefully not be infected with the same ransomware, where (as sketched in another post) corrupted files would not be decrypted on the fly, to fool me into thinking they were OK.

Predictably, in the real-life situation, I didn’t go to that trouble. I just took my chances: I connected the old drive to the Windows computer and updated it to match the current state of the source drive. I hoped that my other comparisons against more recent backups would draw upon a better recollection of more recent file activities. If ransomware was corrupting my files, the comparison against this nine-month-old drive could not be trusted to alert me to that fact.

So if I was going to rely on comparisons of older and newer hash sets, it would help if I didn’t let the older ones get too old. If I ran a hash comparison every few days, for example, I could expect to remember most if not all of the file changes that it highlighted. If I made a regular habit of that, I wouldn’t have to worry so much about unexplained changes from nine months ago: I could be confident that I had been keeping up with changes on more of a day-to-day basis.

Against that nine-month-old drive scenario, consider what happened when I compared the current state of the source drive against a five-month-old backup. Would my experience be different with this more recent backup?

The answer was yes. In this case, I took a closer look at the discrepancies. My spreadsheet indicated that, after five months, ~85% of files remained completely unchanged – in their filenames, paths, and contents as summarized by SHA-512 hash. For most of the remaining ~15% of files, the names and/or paths had changed, but the hash values remained the same. This was not surprising: I recalled specific relevant activities, notably including a lot of file renaming and rearranging.

Thus, I was able to account for almost all of the discrepancies between the source drive and this five-month-old backup. Only 2.4% of the files on the source drive had no file with a matching hash on the backup drive, and 0.5% of the files on the backup drive had no match on the source drive. It was not surprising that there were more unexplained file changes on the source drive: that drive contained files that had been newly added in the past five months. I was able to identify those files by extracting timestamps from a directory listing of files on the source drive, for each file in that remaining 2.4%.

The few remaining unexplained files on the source drive came mostly from others (e.g., developers) who had created those files earlier, but which I had only downloaded or otherwise acquired in the past five months. On the backup drive, the 0.5% unaccounted for were mostly files that I had deleted in these past five months, or had updated with a newer version, or had archived into compressed .rar files.

Ideally, I would not wait five months to update a drive. Instead, these results seemed to support my general impression, as discussed in another post, that I would be better advised to set up a drive rotation plan that might require me to revisit a quarterly backup drive every three months, along with more frequent (e.g., monthly, weekly) backups, on my way to semiannual or annual immutable backups. This experience with the five-month-old backup seemed to suggest that such rotations would be frequent enough to let my memory contribute to a relatively rapid review of differences between source and backup drives.

Recently, I had initiated a five-day rotation scheme: one partition (on a large drive) for backups on the 5th, 15th, and 25th of the month; three other partitions for the 10th, 20th, and 30th, respectively (supplemented with other drives for cyclical monthly and quarterly backups). At present, a five-day period seemed to be an acceptable compromise between spending too much time computing and comparing file hashes, and not making ransomware-resistant backups often enough to give me reasonable protection and good recollection of my file-altering activities. Once I had brought my backup drives up to date, it would still take me 20-30 minutes to get the hash lists into my spreadsheet, but my actual review of discrepancies typically took only a few minutes: the reasons for the differences tended to be obvious to me.

Implementation

As indicated above, I developed a spreadsheet to compare hash sets and to help me figure out the reasons for differences between those sets. The spreadsheet is available for download at SourceForge. (Note: a later post explains changes in some ideas expressed in this post, and introduces a revised version of the spreadsheet.)

In its present form, the spreadsheet consisted of a half-dozen tabs (i.e., worksheets). I would start at the leftmost tab – importing and sorting values, and revising formulas – and work my way across to the rightmost one. I would freeze the values in a given tab when I had it in shape, so that Excel did not have to recalculate all formulas every time it recalculated.

I added a tab containing step-by-step instructions, designed to reduce the time spent waiting for Excel to sort and calculate. At a certain point, however, I didn’t need those instructions anymore; I knew what I needed to do. At this point, as noted above, I found that I could usually finish the hash comparison in 20 to 30 minutes. Again, though, that would vary according to the speed of the machine and the number of files compared.

While this solution did not facilitate the kind of efficient file change history envisioned above, it did offer the potential for at least a rudimentary file history record. After each file comparison session, the user could zip the spreadsheet and the file and hash lists used during that session, for possible future reference.

Generally speaking, I was updating backups often enough to remember why files had changed. That was somewhat less true when I updated one of the older cyclical backup drives. The spreadsheet was helpful when I didn’t remember: it allowed me to sort the discrepancies by folder, filename, extension, and to some extent by date, so as to identify sets of many files that had changed at the same time. That, plus occasional review of related materials, would typically help me to confirm that nothing amiss had occurred. I didn’t need to nail down every last file: I assumed that ransomware would tend to encrypt substantial numbers of files. Few users would pay a ransom to recover a small number of encrypted files, unless they were unusually important – and usually those would be older, in which case I would have redundant backups using another of my ransomware-resistant backup systems.

Finding the Union and Intersection of Multiple Sets of Files

Leave a reply

I had several sets of files. I wanted to find their union and their intersection. The Math Lab defined the union of two sets as “the set of elements which are in either set,” and the intersection of two sets as “the set of elements which are in both sets.” So if set 1 consisted of A and B, and set 2 consisted of B and C, then the union of sets 1 and 2 would be A, B, and C (typically written without duplication — that is, not A, B, B, C), and the intersection of sets 1 and 2 would be B. In the case of filesets, the union would be the complete set of all files existing in at least one fileset, after eliminating duplicates, while the intersection would be the list of files that existed in all filesets.

Among other things, this post explores tools for calculating hash values for files. A later post provides an updated and more thorough look at some such tools.

Union of Multiple Filesets

To find the union of multiple filesets, I could use a duplicate file detection program. A search led to various lists of recommended duplicate detectors. I had been using DoubleKiller Pro for some years, and was familiar with its options, so that’s what I used in this case. I believed but wasn’t certain that the options described here were present in the free version.

If my duplication detection program found file X in fileset 1 and also in fileset 2, I could use that program to delete the copy of X that existed in fileset 2. For some purposes, I might have to remember that I had thus distorted fileset 2: it might now contain fewer files than it previously contained, but that would not necessarily mean that it was originally inferior to fileset 1. As with file operations generally, it would be a good idea to make backups at the beginning and also intermittently along the way, so as to be able to check back and see how things used to be, prior to one’s intervention.

Where the files had the same contents but not the same names, it would be advisable to examine the list of duplicates before deleting anything. As detailed below, one set could have names that did not accurately summarize file contents, or that would provide little information, while the duplicates in another set could have more accurate or informative names or extensions, or could be arranged in folders that the user would be unwilling or unable to reconstruct afterwards. Some of the techniques discussed below could be useful in deciding which duplicates to keep and which to delete, so as to preserve a union of files that had informative names without duplicates.

Intersection of Multiple Filesets

It seemed easier to obtain the union than the intersection of multiple filesets. It appeared that neither dupeGuru nor DoubleKiller had any built-in intersection capability, nor had I noticed it elsewhere. My searches didn’t lead to an easy technique. Evidently I would have to come up with my own way of doing that. The method I used was to produce a list of files, and then use a spreadsheet to examine the contents of that list. This section is mostly focused on the technicalities of the spreadsheet I used.

Producing the list was the easy part. There were at least two ways to do so. One way, detailed in a prior post, was to run a Windows DIR command, and capture its output. That method would be most useful where duplicate files were known to have the same filenames. Another way was to run a DoubleKiller comparison, and export its results. DoubleKiller had the advantage of being able to conduct a CRC32 and/or byte-for-byte comparison among files, so as to identify exact duplicates among those whose names were not identical, and to disregard those with identical names and reported filesizes that were actually not identical. If I wanted to achieve something like that on the command line — that is, if I wanted to devise a command what would produce a list of files with something like the CRC value for each file — my command would apparently have to use a tool capable of calculating those values and including them in the file list. For instance, it appeared that NirSoft’s HashMyFiles (below) might have that capability.

I used the DoubleKiller approach. Specifically, as I learned the hard way, I would want to run a DoubleKiller search that would identify exact duplicates among my filesets, not only byte-for-byte, but also using their CRC32 values. Obtaining both would insure that I was seeing exact duplicates and would also produce CRC values in the output, assisting (as we are about to see) in making sure of which files were duplicates of which others.

Then I proceeded to examine the resulting list of files. Specifically, when the DoubleKiller search finished, I right-clicked on one of the listed files and chose Export all entries. That produced a text file listing the search results. I opened that text file in Excel 2016. The Text Import Wizard came up automatically. I indicated that the file was Delimited with tabs. Although the resulting spreadsheet was probably already sorted with duplicates next to each other, I sorted it on the CRC values, just to be sure. I added a column to count the number of duplicates in each set: basically, if the CRC from the previous row is the same as the CRC from this row, add 1 to the number immediately above; if not, enter 1 (i.e., restart the count). I used Quick Filter to verify that 3 was the maximum number of members in any set. I added another column to detect the sets with 3 members: if the value to the left is 3, put an X; if the value to the left and one row below is a 1, put a 1, otherwise copy the value of the cell below. I fixed those values so they wouldn’t change (i.e., select the column > Alt-E C > Alt-E S V > Enter > Enter). I sorted by that column and deleted all rows lacking an X. That gave me 1926 duplicates (presumably, 1926 / 3 = 642 sets). I could then use similar procedures to identify, say, the first member of each set of duplicates, and mark the second and third members for deletion, giving me a list of (in this case, 642) unique files. Each of these unique files was duplicated (though perhaps not with the same name, folder, or date) in each of my source filesets.

The remaining step would be to produce a set of commands that would extract these 642 unique files from the source filesets. But before doing that, I wanted to make one modification to the procedure just described. I had noticed that, within some sets of duplicates, some files had better names than others. For example, in one set of three identical files (i.e., one from each of my source datasets), I saw that two of the files were named FLARE4.BMP, while the third was named f1578175.bmp. It was possible that this .bmp image file really did display a flare, in which case the FLARE4.BMP name would be informative. But there was no obvious meaning to the f1578175.bmp name. Indeed, I was pursuing this topic as part of a larger project of recovering files from a trashed hard drive, and along the way I had noticed that the PhotoRec file recovery tool had recovered many files with meaningless names, beginning with f, that resembled this f1578175.bmp filename. So, if possible, I would prefer that my resulting list of files — the list representing the intersection of these several filesets — would feature informative names, like FLARE4.BMP, rather than generic names like f1578175.bmp.

To identify the best filenames in each set, I started by sorting the intersection list (of 1926 files, in this case) and looking for big groups of undesirable (e.g., uninformative) filenames, like f1578175.bmp or LostFile_dll_1383439.dll (among many LostFile entries created by the Lazesoft file recovery program). I created an Inferior column and put an X next to each of those. I added a column identifying each file’s extension (i.e., =RIGHT([previous column], 4), sorted on that column, and added another Inferior column, with an X for dubious extensions. In this particular project, I had noticed that .rbf was one such extension, accompanying many of the files retrieved by some file recovery tools. It was also possible to mark files for deletion according to whether they were found in undesirable folders. For example, a file called mscorier.dll was found in a subfolder named Unknown folder\1031, and also in one named Windows\system32. Guess which I found more informative. There could be other reasons to prefer one copy of a file over another.

When I had finished marking some relatively undesirable files, I resorted by CRC and added a column to mark, with an X, those instances where the present and the two preceding rows all had the same CRC and also all had an X in the Inferior Name column, to make sure I wasn’t about to wipe out an entire set of duplicates. I ran a Quick Filter and saw that, indeed, I did have some entries in that column, indicating that all of the recovered duplicates with that CRC had inferior names. I unmarked one file in each such set, so that one duplicate would survive. Then I deleted, from the Excel list, all the nonessential duplicates with inferior names.

That left me with a list of preferred duplicates. For those, I could use the spreadsheet to construct Windows commands that would copy or move the designated files to some other location. As detailed in another post, the Excel formula would use something like this:

="xcopy /s /h /i /-y "&CHAR(34)&E2&A2&CHAR(34)&" "&CHAR(34)&"D"&MID(E2,2,LEN(E2))&CHAR(34)&" > errors.txt"

In other words, for each desired file, the spreadsheet would construct a command that would say, Copy the file (whose full pathname consisted of the path in column E plus the filename in column A) to a folder with the same name on drive T. Presumably drive T would not already have such a folder: the command would create it. The user could specify a subfolder after T if desired. The command could also be altered to steer all files into a single folder, though that might not work too well unless the user first sorted and checked the spreadsheet to verify that there were not multiple files with the same name. The purpose of CHAR(34) was to insert a quotation mark, to surround file or path names that might contain spaces. The actual quotation marks shown in the formula (i.e., not CHAR(34)) denoted text portions of this command. Errors.txt could be checked afterwards to make sure there were no problems.

A formula like that, copied onto each row listing a desired file, would produce a list of commands that could be run from a batch file. Each such command would instruct Windows to copy that particular file. To create a batch file to run those commands, I would simply have to copy the list of commands (as displayed in Excel) to a Notepad file, save it with a .bat extension (e.g., copylist.bat), and run it by selecting and hitting Enter (or by double-clicking) in Windows File Manager. After running that batch file, the target folder(s) should contain copies of the source files, constituting the intersection of the source filesets.

Finding a Batch Hashing Solution

As indicated earlier, it seemed that it might be helpful to use a third-party tool like NirSoft’s HashMyFiles to produce a directory listing that included the hash value for each file. With such a listing, I could use Excel to identify and examine features of duplicative files, potentially informing a decision on which files to save, move, or delete. At a later point, I returned to this post and added this section, with information on my use of such tools.

I had expected to run a command-line version of HashMyFiles, but for some reason I wasn’t finding one. So I started by downloading and running the 64-bit GUI version. I clicked on its folder icon and navigated to the folder containing the filesets I wanted to compare. They were large, and it looked like it was going to take a long time to process these files. I let it run, just to see what I’d get, but finally aborted after 36 hours. It seemed that HashMyFiles was calculating each kind of checksum it could, whereas I might need only one. The available options (in HashMyFiles > Options > Hash Types) were MD5, SHA1, CRC32, SHA-256, SHA-384, and SHA-512.

A Stack Overflow discussion offered a number of perspectives on those options. There appeared to be a difference between file verification and security. The latter, taking greater pains to verify that a file had not been deliberately altered, would apparently call for at least SHA-256 (e.g., How-To Geek, 2018). But for purposes of simple file verification, CRC32 was common and apparently adequate. So I chose CRC32. Therefore, in HashMyFiles, I went to View > Choose Columns and chose only these columns: Filename, CRC32, Full Path, Modified Time, File Size, Identical, and Extension. (The Modified Time column seemed to match the date and time shown in Windows File Manager.) I would soon learn that HashMyFiles did not remember these settings; I had to reset them each time I used the program. Then I re-ran the HashMyFiles calculation. Unfortunately, it seemed that HashMyFiles was still determined to calculate all possible values, even if it wasn’t displaying them. After another 12 hours, I canceled it again, and looked elsewhere.

A brief check suggested that the HashMyFiles values were probably accurate. Specifically, I ran HashMyFiles on a set of 79,113 files, and accepted its option of .csv output. That gave me a nice-looking spreadsheet showing the columns I had selected, including CRC32 values in hexadecimal. For some reason, a few files were left out: the spreadsheet listed only 79,102 files, and provided CRC32 values for 79,101. Using Excel’s VLOOKUP, I compared CRCs for files (sorted by full pathname) against the output of a tool called ExactFile, whose webpage said it was still in beta. ExactFile calculated CRC32 values very quickly — within a minute or two — but produced a report (in SFV format) listing only 54,611 files in that set, even though I did specify “Include subfolders” in its Create Digest tab. Of those, almost all (54,577) appeared in the HashMyFiles list, and most (52,943) had the same CRC32 values. The reasons for the discrepancies were not immediately obvious.

Raymond (2017) listed other hashing tools that I could try. At the top of his list: the portable IgorWare Hasher (4.9 stars from 15 raters on Softpedia). I tried version 1.7.1 (x64). Unfortunately, it did not seem able to process more than one file at a time. That was also the case with Hash Generator 7.0 (3.3 stars from 21 raters on Softpedia; no explanatory comments from users) and, apparently, with the other tools on Raymond’s list. Gizmo (2017) recommended Hashing (4.5 stars from 10 raters on Softpedia), but screenshots suggested that it was going to output its results across multiple lines, resulting in considerable hassle to move them all to a single line (below). It also did not offer a CRC32 option.

I tried running Checksum Calculator (4.0 stars from only 3 raters on Softpedia) on a fileset that I called WINXPFILES. It began loading files as soon as I selected the folder. It started loading files at a rate of, I’d guess, maybe 50 per second. But by the time a few hours had passed, it had slowed down to more like five per second. Then three. The speed didn’t seem to be related to file size: it was constant and equally slow for really small files. It seemed that the large number of files might be clogging the calculator’s memory. It took maybe 3.5 hours to load 79,113 files (i.e., ~22K files/hour) totaling 20.1GB (i.e., roughly 5.7GB/hour, averaging ~250KB/file). Let us be clear: this was not the CRC32 calculation. This was just the file loading process. When Checksum Calculator was ready, its buttons stopped being greyed out, and I was able to select CRC32 and click Calculate. I was surprised to see that it calculated CRC32 values for all those files in about 15 minutes. I clicked Save Results. It saved a comma-delimited text file that I could open in Notepad or Excel, containing file path and name and the calculated value. Unfortunately, the supposed CRC32 values did not look like CRC32 values: they were all numerals, with no letters, and they were nine to ten digits long. Later, I realized that perhaps this was what HashMyFiles would have produced if I had chosen its Options > CRC32 Display Mode > Decimal (rather than Hexadecimal) option.

I had another, much larger fileset that I also wanted to get CRC32 values for. I wondered if I would get faster and better results by using a command that would write its CRC32 calculations directly to a file. For that, a SuperUser discussion suggested using the certutil command:

certutil -hashfile "D:\Path Name\File to Hash.ext" MD5 >> output.txt

which offered options such as MD5 and SHA256 but not CRC32. I would have to use something like my Excel approach (above) to generate a specific command for each file that I wanted to hash, and would then have to include all those commands in a batch file. (I would shortly be getting into the details of that.) The situation was similar for a PowerShell command — that, again, could not calculate CRC32 values:

Get-FileHash -Path "D:\Path Name\File to Hash.ext" -Algorithm MD5 >> output.txt

The output.txt files, in those examples, would contain unwanted additional information that could be filtered, at least to some extent, by a more refined command or by opening output.txt in Excel and sorting to identify and remove those lines. That appeared to be less of an issue in Ashley’s CRC32.exe utility, which I hesitated to use merely because it was relatively untested: it looked like it would produce the desired CRC32 output with minimal extraneous information. When I did test it (below), it didn’t work.

I would be able to avoid the need to generate file-by-file commands by using the generally reliable SysInternals — specifically, Russinovich’s Sigcheck, from which I got nice output for all files in a directory (including its subdirectories) in an Excel-friendly .csv file using this:

sigcheck64 -ct -h -s -w output.csv "D:\[top directory name]"

but it was slow, because it apparently could not be persuaded not to calculate MD5, SHA1, SHA256, PESHA1, and PESHA256 values for each file — and it did not offer a CRC32 option. I put in a feature request on that. It also gave no progress indication, so the user would have to remember to leave its command window alone for hours, if not days, when it appeared nothing was happening.

There were many other checksum utilities, but few seemed to have the kind of multi-file capability I was seeking. So far, for purposes of rapidly calculating checksums on tens or hundreds of thousands of files, it seemed my best solution would be to give up on CRC32 and go with MD5 instead. In that case, I could start with Hashing, if I really needed a GUI tool, but it seemed that the output would be easier to work with if I chose the certutil approach instead.

In that case, the specific procedure was as follows. In the top level of another directory tree containing the files I wanted checksums for, I ran dir /a-d /s /b > dirlist.txt. In this case, the dirlist.txt file contained information on more than a million files, so trying to open it in Excel produced a “File not loaded completely” error. There was a way to use Excel with millions of data records, but I was not confident that would work for my purposes. I chose instead to split dirlist.txt in two. Among tools for this purpose discussed at SuperUser, I chose GSplit (4.0 stars from 17 raters at Softpedia). In GSplit, I specified dirlist.txt as the file to split, and then went into left sidebar > Pieces > Type and Size > Blocked Pieces > Blocked Piece Properties > drop-down arrow > I want to split after the nth occurrence of a specified pattern > Split after the occurrence number > 1048570 (i.e., just short of the 1,048,576 rows Excel could handle) > of this pattern: 0x0D0x0A (i.e., the default, representing a line break in Notepad, so that I would not be splitting the file in the middle of a line) > Split button (at top). That gave me disk1.EXE, disk1.gsd, and disk2.gsd. I renamed the last two to be Dirlist1.txt and Dirlist2.txt. (I had to delete one or two junk lines added by GSplit at the end of each of those two files.) I opened those two files in Excel, one at a time. Doing so opened the Text Import Wizard, to which I responded by selecting Delimited > Next > verify that Tab is selected > Finish. There were no tabs in dirlist.txt, so this put the full path information for each file in a single cell in column A. In column B, I entered the Excel equivalent (above) of the certutil command (above), like so:

="certutil -hashfile "&CHAR(34)&A1&CHAR(34)&" MD5 | find /i /v "&CHAR(34)&"certutil"&CHAR(34)&" >> output.txt"

The new material at the end of that command, involving “find” and another reference to “certutil,” just said that I wanted to exclude, from the output, the line in which certutil would close each MD5 calculation by saying, “CertUtil: -hashfile command completed successfully.” I copied that formula down to each row in the Excel versions of Dirlist1.txt and Dirlist2.txt. Each occurrence of that formula produced a command that could run in a batch file. I copied the full contents of column B from each of those two spreadsheets into a single file (using Notepad, though some other plain text editor would be necessary for files above a certain size) called MD5Calc.bat. I saved that file. I deleted the output.txt file produced by my testing, so that the output.txt file produced by my commands wouldn’t contain that scrap information. In Windows File Manager, I right-clicked on MD5Calc.bat > Run as administrator (or I could have run it in an elevated command window). That seemed to run at a consistent pace that I estimated at ~500 files per minute. It ran for 30 hours, and I could see what it was doing the whole time, as it processed one file after another. Unfortunately, that visibility did not help me detect that, somehow, either I or one of the processes just described had gotten the commands wrong, so that my list of files being processed was out of order. After 30 hours of doing it wrong, however, I did begin to catch on. I was not confident that the result was going to be correct, so I canceled it.

Rather than restart what would evidently be a 60-hour process, I decided to try Ashley’s CRC32.exe utility. As I explored that approach, however, I encountered some complications. Among other things, the CRC32.exe utility would put its output on two lines. This was not a problem with the smaller file set; but with the larger set I was trying to resolve, this would double its already very long file listing (i.e., 1.5 million files) — which, as demonstrated above, would make the Excel approach even more unwieldy. Seeking a solution on that front, I posted a question on SuperUser.

Pending a good answer to that question, I tried another search for CRC32 tools. This search seemed more successful. Several additional tools came to light, each seemingly capable of batch CRC32 processing. I started with Fsum Frontend (4.5 stars from 2 raters on Softpedia) — which, as its name suggests, was a front end or GUI for Fsum. I downloaded it and ran it as a portable (4.0 stars from 3 raters on Softpedia). Its ReadMe file said it contained code from Jacksum, another possibility that had emerged in my search, as well as from BouncyCastle and FlexiProvider. To use it, I went to its Tools > Generate checksum file > navigate to and select the folder containing the files to be checked > check the Recurse Subdirectories box > Format = SFV 512 > Mode: 1 file in any place > Save As > navigate to the desired output folder and enter output.txt. To run it, I clicked the icon next to the Mode box. That seemed to run for a few seconds, but then it produced an error, naming one of the files and saying, “The system cannot find the path specified.” That seemed to kill the process: clicking OK left me back where I started. There did not seem to be an option to allow it to skip that problematic file. I wasn’t about to go digging manually through tens of thousands of files to fix any that this program might not like. It seemed wiser to try another tool.

The context menu (i.e., right-click) option installed in Windows File Manager by HashCheck Shell Extension (HSE) (4.4 stars from 46 raters on Softpedia) was simply to “Create Checksum File.” When I went ahead with that option, it gave me an opportunity to choose a name and location for the checksum file, as well as a choice of filetype. I was interested in the CRC-32 option, which would create an .sfv file; this option would be much faster and would produce much shorter filename addenda than some of the other options. Then it said, “Creating [the specified filename].” It worked through the set of 79,113 files in about four minutes. As above, I imported that .sfv file into Excel and compared its results against those produced by HashMyFiles. With a few dozen exceptions, due primarily to an apparent Excel inability to deal with the presence of a tilde (~) in filenames and secondarily to a handful of files reported as UNREADABLE, HSE produced the same file listing and CRC32 values as HashMyFiles. It appeared that HSE could do the job.

Ray Woodcock's Latest

It's complicated

Tag Archives: SHA1

Comparing Backup Drive Hashes to Detect Ransomware