Summary
This post seeks a solution to the problem that, at present, the most commonly used type of ransomware gradually encrypts files and decrypts them on the fly, making them available to the user whenever s/he wants them, so that the user doesn’t realize they have been encrypted.
The provided solution involves comparing the present SHA-512 hash values of data files against the hash values of previously stored or created files, such as on a backup drive. The latter can be calculated on another machine, so that if the working data drive is corrupted, that corruption will not spread to that backup.
The final section of this post links to the downloadable spreadsheet that I have been using to compare hash values. A more widely usable solution might involve PowerShell, which is reportedly able to detect file changes and calculate hashes in real time. A PowerShell script could ideally compare changes against a hash list saved (and passworded, perhaps, in the cloud) at the time of the most recent backup, and could notify the user of any excessive number of file changes.
I posted questions about the solution suggested in this post. Answers to those questions (in the Stack Exchange forums titled Cryptography and Information Security) seemed to confirm that the solution developed here, involving comparisons on two different operating systems, should work as long as the file-encryption ransomware is not running in a synchronized manner on both Linux and Windows platforms. A subsequent post refines this one on certain points.
Contents
The Problem
Hash Could Be the Answer
GUI Solutions
Command-Line Interface (CLI) Tools
Source Drive Hash Comparisons
Comparing Against Old Backups
Implementation
The Problem
As discussed in another post, I was developing methods of improving my system’s ransomware resistance. As part of that effort, I wanted a way to compare the contents of two drives. One might be the source drive and the other its backup, or perhaps they would both be backup drives, made at different times. The idea was that, if ransomware was encrypting the files on the data drive on my Windows computer (where I did most of my work), these comparisons would tend to make that corruption evident.
I had developed a cautious arrangement that I had found useful for checking my backups, rather than trust automated backup software. I found that this cautious approach tended to make me aware that I might have unknowingly deleted or relocated files, whereas automated backup software would happily relocate or delete those files on the backup as well. This cautious arrangement used Beyond Compare (BC); no doubt other drive comparison tools would have worked too.
But I didn’t want to use that drive comparison approach in this situation. If the Windows computer was infected with ransomware, it would not be wise to connect the backup drive to it. (The assumption, here, is that I would be dealing with file-encryption ransomware, which presently was the dominant type. This form of ransomware would quietly encrypt files, but would then decrypt them on the fly, whenever a program attempted to access them. That way, the user would not notice anything amiss: files would continue to be available as usual, until the ransomware suddenly flipped the switch, locking most if not all files at once.)
BC’s more exacting (especially its byte-for-byte) comparisons would notice differences between source and backup copies of a file, even if the ransomware was smart enough to reset filesize and timestamp (and possibly even CRC32) to their pre-encryption values. But it could take many hours to run those more exacting comparisons in BC. During those hours, the ransomware could be encrypting files on the backup as well – conceivably the same files as on the source drive, so as to minimize noticeable differences between the two drives.
Hash Could Be the Answer
Compared to my usual BC-based backup arrangement, it seemed that it could be safer and more thorough to conduct separate file verification procedures on two different computers. My source data drive was in my primary Windows desktop machine; I could connect a backup drive to another computer. For possible cross-platform ransomware resistance, I might even boot the secondary computer in Linux.
(As discussed in more detail in the other post and in posts that it links to, malware would be less likely to travel from the Windows machine to the Linux machine if my Linux installation wasn’t using anything like WINE, and if the malware wasn’t one of the relatively infrequent cross-platform types. Ransomware would also find it difficult to relay information to and receive instructions from its home server if, as envisioned, the Linux computer was disabled from going online.)
Wikipedia defined “file verification” as “the process of using an algorithm for verifying the integrity of a computer file.” The verification process would calculate a value, based on the contents of the file. If the calculation was sufficiently stringent, virtually any change to the contents of the file would change that value. So if ransomware encrypted a file on my source drive, recalculation of the algorithm would result in a new value.
To detect that the source file had changed, I wouldn’t have to compare it directly to its backup, and risk infecting the backup as well. Instead, I could just compare the values computed by the algorithm. The backup drive could remain air-gapped from the infected Windows machine, especially if I took sufficient precautions to prevent the malware from hitching a ride on the USB drive that I used to convey the algorithm’s results from one computer to the other. (One possible psychological benefit of using Linux on the backup computer could be that its mere presence – that is, the use of Linux – would remind me that the secondary machine was supposed to remain separated.)
In my situation, it may have been helpful that I often ran DoubleKiller to verify that the source drive did not have any exact duplicate files. If it did, I would usually zip or alter one, so as to eliminate the duplication. But perhaps exact duplicates would not be a problem. Regardless of the number of copies of a file, I would want to be notified of unexpected changes.
For present purposes, some file verification tools would be better than others. It seemed, in particular, that I would want to use a cryptographic hash function rather than a simpler checksum. Wikipedia said that relatively simple checksums (e.g., CRC32) “are not suitable for protecting against intentional alteration of data” (by e.g., ransomware) – because, according to a StackExchange answer, “[I]t is pretty easy to create a file with a particular checksum.” That is, conceivably ransomware could encrypt a file in such a way that its CRC32 checksum would remain unchanged, even though the contents of the file were now very different. By contrast, that answer said, “A cryptographic hash function … is secure against malicious changes. It is pretty hard to create a file with a specific cryptographic hash.”
Of course, not all hashes were created equal. This was, after all, an attempt to represent an entire file’s contents with a relatively short string of letters and numbers. As just indicated, the CRC32 algorithm was no longer considered secure; likewise the MD5. Instead, we had the Secure Hash Algorithms (SHA) family. Wikipedia explained that this family consisted of several sets of cryptographic hash functions: SHA-0 (1993), SHA-1 (1995), SHA-2 (2001), and SHA-3 (2012), each providing greater security than its predecessor. Each of these sets offered a choice among hash value sizes. In SHA-2, those sizes ranged from 224 to 512 bits. (A Stack Exchange discussion supported my impression that, for example, 256-bit SHA-2 could have been written as SHA2-256, anticipating the later SHA3-256. But that wasn’t standard usage. Instead, it was as though SHA-2 would never be superseded: everyone wrote 256-bit SHA2 as simply SHA-256.)
Unfortunately, each improved level of security would require longer to compute. For some purposes, more advanced hardware could help with that. Various sources (e.g., Cryptopedia, 2021; BitcoinWiki, n.d.; Medium, 2019) explained that cryptographic hashing rates had increased by orders of magnitude with the introduction of successive generations of specialized calculating chips, from the computer’s general-purpose central processing unit (CPU) to graphics processing units (GPUs) to field-programmable gate arrays (FPGAs) to application-specific integrated circuits (ASICs) to cloud computing services (e.g., Elcomsoft) that would rent you as many fast calculators as you could afford. Not that computation speed was the only factor: there were indications (e.g., FreeCodeCamp, 2020; SysTutorials, 2015) that, on a local drive, data transfer speed could be the primary bottleneck.
For my purposes, advanced hardware might not be helpful. I wasn’t sure I wanted to give ransomware access to faster hardware. To the contrary, I wanted its encryption process to take as long as possible, to extend the timeframe in which I might become aware that something was happening to my files. There was also a budget factor. I wasn’t inclined to set up an expensive secondary computer to calculate hashes. At least to start, I was planning to use my old (2012) Lenovo E430 laptop as my Linux machine, producing sets of hash values for my backup drives.
For consumers without GPUs, with 1TB and larger data drives that were increasingly likely to fill up with video and other space-consuming materials, there could be hours of difference, between using a relatively fast checksum or a relatively slow cryptographic hash to check for file changes. Until I felt like investing in faster hardware, there would be limits on my willingness to drag things out, in hopes of giving the bad guys a hard time. At some point, I thought I might like to actually complete the hash-checking project, so as to continue with my backup process, and then maybe have some time left over for a life. In that spirit, it seemed I should aim for the least demanding cryptographic hash that would still be relatively secure.
In my reading, SHA-256 seemed to be a reasonable compromise: secure, but not taking so long to compute. But then I discovered there was a wrinkle. As it turned out, SHA-256 was significantly faster to calculate than SHA-512 – but only for late-model CPUs that supported the Intel SHA extensions. The CPUs in my computers weren’t among those specific recent models. Therefore, test runs using MD5 & SHA Checksum Utility on both the desktop and the Lenovo laptop visually illustrated what they were saying in a Cryptography discussion: on all other machines, SHA-512 file hashing would unexpectedly be much faster than SHA-256 calculation. At least until I upgraded my hardware, it seemed that SHA2-512 was the hash value I would want to compute and compare.
GUI Solutions
Since I wasn’t seeing any tools designed specifically to compare file hashes from drives on two different computers, I figured I would use a spreadsheet. One tab of the spreadsheet would list the filenames and their hash values from the first computer (e.g., the source drive in my Windows 10 desktop), and the other tab of the spreadsheet would list the filenames and hashes from the second computer (e.g., a backup drive connected to the Lenovo laptop). Using spreadsheet techniques discussed in another post, it wouldn’t be hard to divide the path, filename, and hash values in each tab, and then sort the lists and use lookups to identify inconsistencies between the two lists. The Implementation section (below) provides more detail on the spreadsheet I developed in this case.
With that plan in mind, what I needed was a tool that would produce a single line of output for each compared file. That line would contain nothing other than the complete path and file name of the file being compared, and its corresponding SHA-512 value. (Spreadsheeting would have been easier if I had found a tool that would also produce the file’s timestamp on that same line.)
There seemed to be quite a few GUI hash calculators (e.g., Raymond.cc, 2019; Next Of Windows, 2017). Most seemed oriented toward providing multiple hashes (e.g., SHA1, MD5, CRC32) for a single file, or for a folder as a single unit (i.e., not for each file within a folder). Those that would calculate hashes for each file within a folder individually seemed to focus on providing mostly weaker (e.g., SHA1, MD5, CRC32) hashes. For example, Nirsoft’s HashMyFiles was user-friendly, and provided a lot of useful information; but it did not calculate SHA512, nor did its developer seem interested in my suggestion that he should develop a command-line version.
HashTools and MultiHasher appeared to be exceptions, capable of calculating SHA512 hashes for multiple files at once. Between the two, MultiHasher seemed more popular at Softpedia (4.7 stars from 41 raters). I personally found its display cluttered and unintuitive, and its output undesirable for my purposes: filename plus hash were not provided on one line. HashTools was much less cluttered but still unfortunately lacking in basic instructions. The impression that I accrued, from trial and error, seemed to be that you click Add Folder and wait for it to populate the list of files; you do not check “Automatically save hash files with each file checked” because apparently that meant the program would write something onto, or adjacent to, the files being hashed; and then you click on your preferred hash button (in my case, SHA512).
I ran HashTools’s SHA512 hash on a Test Folder containing 1,994 files (total folder size: 9.1GB; average file size: 4.9MB; median file size: 27KB). HashTools had the nice feature of showing its progress. It did calculate a SHA512 value for every file in the folder. For some reason, it seemed incapable of copying both filenames and hashes to the clipboard simultaneously. I had to copy first one, then the other, yielding the risk that what I pasted to the spreadsheet would not match up. HashTools finished analyzing that folder in 7:19 (m:ss). That was less than half as fast as MultiHasher (3:12). The difference was so great that I would have to re-run that comparison to be sure. But I didn’t like HashTools enough to bother. These results were sufficient to remind me that a GUI was inevitably a front end for programming decisions that could vary from one product to the next.
As another GUI possibility, I ran QuickHash 3.3 on that Test Folder. Its interface was superior to HashTools and MultiHasher, as were its options and progress report (at least potentially – see below). In a brief test, its speed seemed to be only slightly below that of PowerShell (below). There were, however, some problems. For example:
- I ran into difficulty getting QuickHash to export in the desired format. In response to my emails, the developer kept telling me to check the User’s Manual. He oddly insisted that the manual was easy to find – even though there seemed to be no links to it anywhere on his website or in the program itself (which also had no Help feature). He felt it was logical that users would find it in the Downloads area – even though the drop-down menu for that section of the website did not list it. He said it was also included “inside the zip of the download for ultimate convenience” – but, obviously, there was no zip in the Linux .deb that I installed. He said it was an excellent manual, and that I should be able to find answers readily in it – even though it had no table of contents or index. He felt that users should just search for relevant terms, and then examine each occurrence of such terms until they found the right one.
- I belabor those communications to convey an impression of stubbornness that could impair quality. That is, the belief that your product is perfect may inhibit you from looking for its possible flaws. This was definitely not the kind of correspondence I had with developers who were working hard to insure that their software got it right every time. Thus, in my first use of QuickHash, in the very brief test of the Test Folder mentioned above, I found that the output misstated some filenames. In the actual folder, those names contained double spaces; but in the QuickHash output, those had been changed to single spaces. This would mean that the program’s output would not match up with the actual filenames. The developer was not aware of this bug. If I could find it this easily, I had to wonder whether he was carefully testing to identify possible imperfections in his homemade hashing script, the handling of filenames, or other aspects of QuickHash.
- I ran QuickHash for five hours, to obtain hashes on every file on an entire hard disk drive (HDD). When I sought to save the results, the program crashed. I ran it again; it crashed again. The developer said that the data presented in the QuickHash output grid was also held in an SQLite file. The user manual added, however, that the SQLite file was deleted when QuickHash was closed. I did not find the SQLite file on either Linux or Windows. The developer did not seem to have a solution for my impression that the program had a hard time with large numbers of files. At present, this behavior rendered QuickHash useless for my purposes.
- In that five-hour run, the QuickHash progress report signaled rapid progress at first, but then slowed sharply. It seemed to be counting progress in terms of numbers of files rather than quantity of data. So it reported 80% completion at the two-hour mark, and then kept grinding away at that remaining 20% for the next three hours. Maybe the folders it tackled first happened to have many small files. Whatever the reason, the progress report would have been more accurate if it had been based on quantity of data rather than on numbers of files. I mention this, among the other glitches not mentioned, because it now seemed that the progress report could be, but wasn’t yet, one of the program’s strong points.
I felt that QuickHash was a worthy program, and I recognized that the developer had done a lot of work on it over a period of years. I did not mean to bring difficulty into his life – but I did get the feeling that I was bothering him, when I presented seemingly reasonable issues arising from real-world attempts to use his software. I was disappointed, because I did prefer its interface over the other GUI tools that I tried.
Command-Line Interface (CLI) Tools
While QuickHash seemed useful for calculating hashes of individual files and folders, my experience (above) suggested that I might be wise to stick with more established tools for full-drive hashing.
There appeared to be a number of relatively solid CLI options. In Linux, it seemed that rhash or possibly sha512sum might be adequate. For rhash, the command could be something like
rhash --sha512 --uppercase -r /path/to/folder/filename.txt | pv > hashlist.txt
That formulation drew on the use of pv in another post. It wasn’t as informative here, but at least it stated the size of the output file and the elapsed time, either of which would be meaningful if the user already had a sense of how long the full job should take and/or how large its output file would tend to be. If I wanted to specify all files in a folder, rather than just one file named filename.txt, I would simply omit filename.txt from that command. If I meant the rhash command to refer to the current folder, I could replace the entire input path and filename with just a dot. (Note the irregular behavior I obtained with rhash in one situation. Note also that rhash might appear to hang when hashing large files. Be patient.)
There were also quite a few CLI hashing tools in Windows, most of which gave at least an initial impression of being well-developed:
- A possibility I did not explore: run rhash in a Linux virtual machine, to hash files on a physical drive. I assumed this would be slower.
- I tried Windows’ built-in certutil -hashfile “D:\Path\to\filename.ext” sha512. For recursively hashing all files in a specified folder plus its subfolders, answers at SuperUser pointed toward something like for /r “D:\Current” %F in (*) do certutil -hashfile “%F” SHA512 | find /i /v “certutil:” >> “D:\Current\CertUtilOutput.txt”. I revised that command into a batch file that I posted as an answer at Stack Overflow. At least its optional find component would eliminate one line of clutter from the output. That command did produce results for all files in the folder and its subfolders. It seemed to run much more slowly than PowerShell (below). I didn’t like that this certutil command didn’t provide any kind of progress report. It looked like adding a progress bar to a batch file could be equally problematic. I also didn’t like that certutil‘s output came in two lines: one for the path and filename, and then one for the hash. That wouldn’t be hard to work around in a spreadsheet. It would just be one more hassle, repeated every time I wanted to compare a drive to some other drive.
- Installing 7-Zip gave me the option of using its command: 7z h -scrcsha512 “D:\Path\to\filename.ext”. I could run that command in any directory, after I put a copy of 7z.exe in a folder on my system’s path (e.g., C:\Windows). I could omit filename.ext to run the command recursively (i.e., on all files in the specified folder and its subfolders). Unfortunately, what I just wrote was incorrect: a webpage presenting 7-Zip’s hash (h) option indicated that SHA512 was not supported. SHA256 was as high as it went. When I tried the command just shown, it said, “System ERROR: Not implemented.” It did seem to run OK when I changed it to SHA256.
- I tried DirHash by Mounir Idrassi, the developer of VeraCrypt. Its instructions led me to put a copy of DirHash.exe in C:\Windows, and then to add a bunch of parameters. The command that worked for me was dirhash D:\ SHA512 -progress -threads -skipError -nofollow -sum -nowait -overwrite -t “D:\Path\DirHashOutput.txt”. For me, the primary drawback of this tool – indeed, of all these CLI tools – was that it displayed no progress indicator, though it sounded like I might have persuaded Mounir to add one. Without its –quiet parameter, at least I could sometimes get a glimpse of how far it had progressed. DirHash apparently locked the entire top-level folder while it was hashing its contents – or anyway, for some reason, when I tried to move a file, Windows said, “The file is open in Dirhash.exe,” and that problem persisted for a long time before I was finally able to move the file.
- There were multiple Linux emulators that would bring Linux tools over to Windows. Various sources seemed to agree that, in the words of a SUNY Binghamton webpage, Cygwin would require “more time and effort up front to install and maintain” whereas Git-Bash (a/k/a Git for Windows) was “the quickest and easiest solution.” While there were enthusiasts for the newish Windows Subsystem for Linux (now WSL2) (at e.g., Reddit), for present purposes the more important observation (in SuperUser) was that “Git-Bash is a typical Windows program” and, as such, “sees C:\ as its root directory” whereas WSL2 “is definitively not a typical Windows program” but instead “sees itself as running on Linux” and therefore uses “Linux’s directory structure.” Until I needed one of the others, Git-Bash seemed like the option for me. Geeks for Geeks (2021) walked me through the installation steps, starting with downloading and running the installer from the Git for Windows website. I went with the defaults almost entirely. This included not checking the new option to add a Git Bash profile to Windows Terminal. The Notepad++ option was grayed out, so I chose Wordpad as Git’s default editor, not for any strong reason other than that WordPad was always included in Windows. When installation was done, I checked the box to launch Git Bash. That gave me what looked like a Linux prompt. I tried sha512sum “D:\Path\to\folder” for my Test Folder (above). That produced an error message informing me that the designated folder was a directory. Evidently sha512sum only worked on individual files. That was good as an option, but I didn’t want to have to work up a list of complete commands for every file in every drive or folder. I quickly found that rhash was not installed in Git-Bash, and sudo apt install rhash produced an error. So: I ran into Git-Bash limitations sooner than expected. Cygwin and WSL2 were still possibilities, but with more of a learning curve and investment of time and resources.
- A few years earlier, I had posted a similar question, seeking one-line output for CRC32 values. In that case, as detailed in the post leading up to that question, I relied on Ashley’s (2008) crc32.exe executable. Answers to my posted question offered two different batch scripts incorporating crc32.exe. This time around, I had those batch solutions – but now I was missing the SHA-512 executable that they could use in place of crc32.exe. I did find that assorted older and/or unverified sources had developed similarly named Windows CLI tools (e.g., md5.exe, sha512.exe; see also CRC32 Calculator); but there did not seem to be any widely recognized source (e.g., Microsoft) for such utilities, and I was running low on enthusiasm for experimentation with potentially undercooked fare.
- My posted inquiry at ComputerHope led at first to hashsum.bat. Unfortunately, as I reported in that discussion, hashsum failed to report hashes for about 1% of the files hashed, and also failed to specify those files (I had to identify them manually) or the reason(s) for its seeming failure.
- To make sure I was not overlooking anything obvious, I posted a question at Software Recommendations. In that forum, I might have expected a GUI suggestion. Instead, the suggestion was to use Windows PowerShell. Among the ways to start PowerShell, I used Win-R > powershell. There ensued some struggling to figure out PowerShell’s Get-FileHash for SHA-512. I found that the PowerShell solution calculated SHA-512 hashes, for the files in my Test Folder (above), as quickly as any other solution – as much as three times faster than some. The PowerShell hashes matched up perfectly with the results from certutil (above). My PowerShell solution incorporated several findings: I needed to output to text, so as to avoid PowerShell’s Export-CSV tendency to interpret commas in pathnames as delimiters; to specify ASCII encoding, because apparently this would reduce problems when viewing the results in Excel; to use -LiteralPath for success with square brackets in filenames (with no guarantees for square brackets in folder names); to include Format-Table -AutoSize to prevent truncation (which, for some reason, only happened when I was hashing an entire drive); and to set the width to 400, so as to allow for the 128 characters produced by SHA-512 plus the default maximum Windows path length of 260 characters. With those adjustments, at first this command seemed to work:
Get-FileHash -Algorithm SHA512 -LiteralPath (Get-ChildItem 'D:\folder\*.*' -Force -Recurse) | Select-Object Hash,Path | Format-Table -AutoSize | Out-File -filePath 'D:\Current\PSOut.txt' -Encoding ASCII -Width 400
There was no progress report feature in this PowerShell command, and various sources (e.g., Microsoft, SuperUser, Adam the Automator) conveyed an impression that it would be difficult at best to make a progress indicator work well in this context. Stack Overflow offered ways of including a progress report, but those appeared likely to involve scripting that, at present, was beyond my abilities. I was not sure whether they would also slow down the hashing.
Unfortunately, that PowerShell command was a work in progress. As suggested in various sites (including those linked in the last bullet point, above), PowerShell was complex to the point of seeming flaky: what worked in one setting did not work in another, nor did it necessarily work as expected. For example, as I would soon see, the PowerShell command just shown would terminate when it failed to find a file that it expected to find. I did hope that PowerShell (or something like it) would point the way to a superior solution, and therefore I posted a question about this. In the meantime, however, I didn’t have such problems with the rhash or DirHash commands shown above. Those were the tools I used, then, for the following explorations.
Source Drive Hash Comparisons
The more frequently the user could compare hash values of source and target drives, the more quickly s/he could detect that ransomware was corrupting files. The principal barrier against more frequent hash comparisons was that each full set of hash values would take a substantial amount of time to calculate and compare. Calculation time would depend upon machine speed and the number and size of files. As a rule of thumb, hashing seemed to proceed at something like a half-terabyte per hour on a fast machine, and maybe a third of a terabyte per hour on a slow machine. For users with a lot of data and/or a slow machine, hashing could take many hours.
To illustrate, the Linux machine that I used to calculate hashes on the backup drive was an old (2012) Lenovo ThinkPad E430. It was the only computer I could set aside for hours of drive analysis. As discussed in another post, disabling this laptop’s ability to go online would presumably prevent it from coordinating effectively with a ransomware server. Thus, even if the Lenovo was infected, the data drives it interacted with would presumably differ significantly from drives on a Windows computer that it did not interact with, even if the same ransomware was at work on both. This laptop also had no internal drives installed and was unable to go online; thus, it could only corrupt data files when an external drive was connected. The Lenovo’s minimal Ubuntu installation was running from a Samsung FIT USB 3.0 drive. The Samsung was relatively fast, as USB drives went, but this was surely a slow way to run an operating system. The backup HDD that I was hashing with rhash was mounted in an external drive dock, connected to the laptop via USB 3.0 cable – which was, again, much slower than an internal SATA drive. A brief test suggested that running Ubuntu from a Samsung T5 external solid state drive (SSD) didn’t help much. Possibly the explanation was that most of the relevant operating system software was loaded into RAM. In that case, if the Lenovo had lacked sufficient USB 3.0 ports, it might not have been terrible to boot the Samsung FIT from a USB 2.0 port. The Lenovo had a dual-core Intel Core i3-2350M with a PassMark score of 1254, arguably adequate (8GB DDR3) RAM, and two three USB 3.0 connectors. Putting data files on the external SSD increased hashing speed by about 20%, compared with putting them on the external HDD. Presumably that meant that hashing speed was mostly determined by the speed of the CPU. But these hashing tasks did not seem to burden the laptop too much: its fan rarely sped up much beyond its normal, barely audible sound.
At least the process of computing hash values did not take much user time. S/he would just enter a command and then, hours later, the process would be finished. By contrast, the process of comparing source and target drive hash values could take a chunk of user time, depending again upon machine speed and files compared. Occasionally I saw forum comments from users who were working with millions of files. Although Microsoft Excel had a limit of about a million rows, it was possible to use it to analyze multiple millions of rows of data. But even with just a few hundred thousand rows, a computer that was not really fast could take a long time to perform the calculations needed to compare before-and-after hash values. On some systems, that could mean crashes of Excel or even of the operating system. (Calculations would be even slower in a virtual machine, but possibly the system as a whole would be more stable that way.) Some users might be able to mitigate such loads by setting up separate encrypted partitions for folders whose files were less or more frequently modified. In that approach, partitions might be hashed at different rates, depending on how often and/or how long they were mounted.
The large amounts of machine and user time required to compare hash sets meant that the user was unlikely to run such comparisons often. I, myself, was only willing to run them every five days, when I updated my backups (below). That was unfortunate. If I could have run them daily, I would have had faster warning of ransomware infection, as mentioned above. There could also be another benefit: I could compare today’s source drive file hashes against yesterday’s, without involving a backup drive. If I knew that there had been no unwanted changes to files on the source drive, I would have greater insight into the cause of a sudden, major difference between source and backup drives: I would suspect that, if there was an infection, it may have been on the Linux side. Without this separate check on the source drive, I could only assume (incorrectly, perhaps) that the infection probably came from the Windows side.
The right software could could significantly decrease time burdens on the Windows (source) drive side, opening the way to faster ransomware detection. Consider, for example, the impression that PowerShell was capable of detecting whenever a file had changed within a folder. With suitable exclusions (for e.g., user-approved changes in a smallish number of files, by a file-editing program presently in use), that sort of functionality might be able to notify users of ransomware file encryption in real time – without requiring my method’s full rehash of every file in a drive. Such software might also be able to log a history of hash changes efficiently, so as to help users identify the last good date of a now-corrupted file. With the right tool, same-drive hash comparisons (e.g., today vs. yesterday) could become the dominant form.
A different way to reduce the hashing burden would be to set red flags – to salt various folders, throughout my source drive, with specific files that I would never change, and to use those as canaries in the coal mine – as leading indicators, that is, that entire folders were being corrupted. So, for instance, if Red Flag File No. 3 was one that I would never change, and if one day I got an indication that its hash value had changed in the past 24 hours, I would know that something was wrong with that file and/or its folder. Relying on hash values for a few hundred red flag files would be much faster than hashing every file on the entire drive, every time I wanted to do a backup. Unfortunately, browsing revealed that I was not the only one who had contemplated this solution. I had to assume that creators of ransomware – spending all day, every day, thinking about this stuff, in pursuit of potential wealth – would be a step ahead of me in tricks of this sort. It was possible that ransomware, in touch with its home base through my Windows computer’s Internet connection, would relay information that would detect something unusual about my red flag files (e.g., that they all shared the same creation time, similar names, or similar patterns of being listed or otherwise accessed around the times when I made backups), triggering instructions from the home base to refrain from encrypting those files. So then we would be back at at the starting point: files were being encrypted while my anti-malware solutions slept.
Comparing Against Old Backups
Regardless of hash calculation and comparison method, it seemed that comparisons of sets of hash values would depend upon the user’s judgment and recollection. Software (be it my spreadsheet or a sophisticated PowerShell script) could only notify me that, today, some number of files on the source drive differed from their counterparts in a comparison set (from the backup drive or from the source drive previously). I would have to decide whether this difference was expected or unexpected.
Consider what happened when I compared new hash values from the source drive against newly calculated hash values for a nine-month-old backup. I found that, unfortunately, there were many discrepancies that I could not vouch for. In a comparison of today’s data drive against such an old backup, I might have no idea why D:\Folder\File001.doc on the source drive was different from its copy on the backup drive. There may have been a good reason – this may have been a change that I (not ransomware) made, months ago, with full knowledge of what I was doing. The point is just that I couldn’t say, with any confidence, that ransomware was not the reason for the change.
When I faced those many files that had changed in the past nine months, for reasons that might or might not be due to ransomware corruption, I had to decide whether to just accept the changes and hope for the best, or manually examine the contents of the changed files on a different (perhaps Linux) system that would hopefully not be infected with the same ransomware, where (as sketched in another post) corrupted files would not be decrypted on the fly, to fool me into thinking they were OK.
Predictably, in the real-life situation, I didn’t go to that trouble. I just took my chances: I connected the old drive to the Windows computer and updated it to match the current state of the source drive. I hoped that my other comparisons against more recent backups would draw upon a better recollection of more recent file activities. If ransomware was corrupting my files, the comparison against this nine-month-old drive could not be trusted to alert me to that fact.
So if I was going to rely on comparisons of older and newer hash sets, it would help if I didn’t let the older ones get too old. If I ran a hash comparison every few days, for example, I could expect to remember most if not all of the file changes that it highlighted. If I made a regular habit of that, I wouldn’t have to worry so much about unexplained changes from nine months ago: I could be confident that I had been keeping up with changes on more of a day-to-day basis.
Against that nine-month-old drive scenario, consider what happened when I compared the current state of the source drive against a five-month-old backup. Would my experience be different with this more recent backup?
The answer was yes. In this case, I took a closer look at the discrepancies. My spreadsheet indicated that, after five months, ~85% of files remained completely unchanged – in their filenames, paths, and contents as summarized by SHA-512 hash. For most of the remaining ~15% of files, the names and/or paths had changed, but the hash values remained the same. This was not surprising: I recalled specific relevant activities, notably including a lot of file renaming and rearranging.
Thus, I was able to account for almost all of the discrepancies between the source drive and this five-month-old backup. Only 2.4% of the files on the source drive had no file with a matching hash on the backup drive, and 0.5% of the files on the backup drive had no match on the source drive. It was not surprising that there were more unexplained file changes on the source drive: that drive contained files that had been newly added in the past five months. I was able to identify those files by extracting timestamps from a directory listing of files on the source drive, for each file in that remaining 2.4%.
The few remaining unexplained files on the source drive came mostly from others (e.g., developers) who had created those files earlier, but which I had only downloaded or otherwise acquired in the past five months. On the backup drive, the 0.5% unaccounted for were mostly files that I had deleted in these past five months, or had updated with a newer version, or had archived into compressed .rar files.
Ideally, I would not wait five months to update a drive. Instead, these results seemed to support my general impression, as discussed in another post, that I would be better advised to set up a drive rotation plan that might require me to revisit a quarterly backup drive every three months, along with more frequent (e.g., monthly, weekly) backups, on my way to semiannual or annual immutable backups. This experience with the five-month-old backup seemed to suggest that such rotations would be frequent enough to let my memory contribute to a relatively rapid review of differences between source and backup drives.
Recently, I had initiated a five-day rotation scheme: one partition (on a large drive) for backups on the 5th, 15th, and 25th of the month; three other partitions for the 10th, 20th, and 30th, respectively (supplemented with other drives for cyclical monthly and quarterly backups). At present, a five-day period seemed to be an acceptable compromise between spending too much time computing and comparing file hashes, and not making ransomware-resistant backups often enough to give me reasonable protection and good recollection of my file-altering activities. Once I had brought my backup drives up to date, it would still take me 20-30 minutes to get the hash lists into my spreadsheet, but my actual review of discrepancies typically took only a few minutes: the reasons for the differences tended to be obvious to me.
Implementation
As indicated above, I developed a spreadsheet to compare hash sets and to help me figure out the reasons for differences between those sets. The spreadsheet is available for download at SourceForge. (Note: a later post explains changes in some ideas expressed in this post, and introduces a revised version of the spreadsheet.)
In its present form, the spreadsheet consisted of a half-dozen tabs (i.e., worksheets). I would start at the leftmost tab – importing and sorting values, and revising formulas – and work my way across to the rightmost one. I would freeze the values in a given tab when I had it in shape, so that Excel did not have to recalculate all formulas every time it recalculated.
I added a tab containing step-by-step instructions, designed to reduce the time spent waiting for Excel to sort and calculate. At a certain point, however, I didn’t need those instructions anymore; I knew what I needed to do. At this point, as noted above, I found that I could usually finish the hash comparison in 20 to 30 minutes. Again, though, that would vary according to the speed of the machine and the number of files compared.
While this solution did not facilitate the kind of efficient file change history envisioned above, it did offer the potential for at least a rudimentary file history record. After each file comparison session, the user could zip the spreadsheet and the file and hash lists used during that session, for possible future reference.
Generally speaking, I was updating backups often enough to remember why files had changed. That was somewhat less true when I updated one of the older cyclical backup drives. The spreadsheet was helpful when I didn’t remember: it allowed me to sort the discrepancies by folder, filename, extension, and to some extent by date, so as to identify sets of many files that had changed at the same time. That, plus occasional review of related materials, would typically help me to confirm that nothing amiss had occurred. I didn’t need to nail down every last file: I assumed that ransomware would tend to encrypt substantial numbers of files. Few users would pay a ransom to recover a small number of encrypted files, unless they were unusually important – and usually those would be older, in which case I would have redundant backups using another of my ransomware-resistant backup systems.