As described in another post, I had created a large spreadsheet (formally speaking, an Excel workbook) to study the performance of various musical artists on the Billboard Hot 100. The file size was ~800MB. It had about 3,300 columns – one for each (in most cases weekly) Hot 100 report between 1958 and 2021 – and about 30,000 rows (one for each artist-song pairing). That gave it almost 100 million cells in the main tab, each containing a lookup (SUMIFS) formula looking at the data in a reference tab. This size made the spreadsheet very slow to calculate. This post describes how I worked around that.
As I had recently learned, when you try to ask for help with something like this in an Excel forum, the first thing people will tell you is to optimize the spreadsheet. I found there were many sources of advice on that. It appeared that I had already optimized the spreadsheet to the extent possible, consistent with my objectives. There may have been additional optimizations that I was not aware of.
But for purposes of this post, that is immaterial. The question here was not, how can I make this spreadsheet more efficient? The question at hand was, how can I help the computer complete certain desired steps in a very large spreadsheet? I wanted to know if there were techniques that would assist the hardware to handle the various steps in my computing process.
The Problem
For this purpose, I had two machines to work with. Both were running Excel 2016 on Windows 10. One was a desktop machine with an Intel Core i7-4790 CPU and 32GB DDR3 RAM. The other was an Acer Aspire laptop with an Intel Core i5-7200U and 20GB DDR3 RAM. According to PassMark, the desktop machine’s Core i7-4790 (CPU Mark score: 7228) was about 2.1 times faster than the Acer’s Core i5-7200U (CPU Mark score: 3397).
For reference, at this writing, the Core i7-4790 was only about one-tenth as fast as the fastest sub-$1000 CPU listed on PassMark’s high-end CPU chart, namely, the AMD Ryzen 9 7950X ($699, CPU Mark score 64727). With the right hardware, the problems in this particular spreadsheet could be significantly ameliorated. Of course, one could always find a way to build a bigger or more complicated spreadsheet, to such an extent that even a mighty CPU could be humbled – in which case one might still need to deal with issues discussed here.
On my machines, CPU speed was an issue at certain points in the process. I found it convenient to use Moo0 System Monitor to keep me informed of system bottlenecks, of CPU temperature, and of the load on CPU, RAM, and SSD. The desktop machine seemed to become sluggish at just the time when Moo0 reported 100% CPU usage. To mitigate that, I used Task Manager (taskmgr.exe in Windows) > default view: Processes tab > right-click on Excel > Go to details (i.e., Task Manager Details tab) > right-click on Excel > Set affinity > uncheck one “processor” (i.e., thread). On my quad-core i7-4790 CPU (i.e., with eight threads), Moo0 indicated that that step reduced CPU load to ~89%. On the dual-core, four-thread laptop, that step reduced CPU load to ~74%.
Unchecking one processor in Task Manager would thus slow processing. The tradeoff was that this step would make the machine more responsive for other purposes. I hoped it would also eliminate the freezeups that I experienced on the desktop machine, where I was performing multiple tasks simultaneously, mostly taxing RAM but not CPU. I didn’t multitask on the laptop during this process, didn’t experience any freezeups there, and also didn’t notice that unchecking a processor in Task Manager reduced the speed at which the laptop’s internal fan was running – so I set the laptop to use all processors.
The tasks that I needed to perform on this monster spreadsheet were as follows:
- Run a calculation once. I had the spreadsheet set to manual calculation (Excel > menu > File > Options > Formulas > Calculation options > Workbook Calculation > Manual > uncheck the option to Recalculate workbook before saving). I had created the spreadsheet by copying a formula from one cell to many others, but I had not yet allowed the spreadsheet to begin the very long process of calculating the values of those formulas in those many cells.
- Save the spreadsheet after calculating as version A. Then save again (F12) as version B.
- In version B, freeze values (i.e., convert formulas to fixed values). Then I could change to Automatic calculation: there would be no remaining formulas to calculate, and Automatic calculation would prevent the mistake of assuming that displayed values (set to manual calculation, but perhaps not yet actually calculated) were correct. Autocalc would also try again automatically if a previous attempt to calculate was interrupted.
- Delete tabs (i.e., worksheets within this spreadsheet) whose values were used to produce the values calculated in the main tab. Now that I had frozen the final values in the main tab, I didn’t need those subsidiary worksheets anymore.
The reason why I mention the puny laptop is that the desktop machine was not turning out to be an ideal partner for this project. For one thing, experience taught that Microsoft would be perfectly willing to induce system instability in order to compel users to reboot, so as to finish installing updates that Microsoft deemed essential – more accurately, updates that Microsoft made practically unavoidable. I was not sure whether that was the cause of the intermittent crashes or other instances of instability that plagued the desktop computer. Possibly in this case the instability was due to interactions among installed software and/or my own tweaks to the system.
Whatever the reason, I had to reboot this system periodically – perhaps every week or so, on average; more often when I stressed the system by demanding too much of it, as in the case of this large spreadsheet. After a few crashes, I realized that I wasn’t going to be able to just forge straight through this project on the desktop computer.
The desktop machine was also not ideal because, in some of the tasks just listed, Excel would abandon its calculations, taking me back to the starting point, whenever I turned to some other task, or when another spreadsheet did something, or basically whenever I breathed. Multitasking was essential: the desktop machine was my main tool for getting work done. Some of the steps just listed would require many hours, if not days, even on the desktop machine. I would not be able to devote that machine solely to this project for any such period of time.
Excel did not normally seem hostile to multitasking. I wasn’t sure why it was acting like that in this case. I posted a question on it. Unfortunately, Excel Forum lost the discussion – not that we were reaching any solution to the problem I had posed. It seemed that I would be most likely to finish this project if I attempted it on some computer other than the desktop. A virtual machine (VM) was a possibility. A VM would allow me to restrict the system resources being allocated to this task. Unfortunately, the VM would still be vulnerable to desktop system instability.
The whole spreadsheet was about 800MB. But evidently the formulas in that 800MB spreadsheet would operate most efficiently if they could fill 20GB of RAM, or more. Unfortunately, Excel and my Internet browsers did not seem to coexist peacefully. I had many tabs open in my browsers. They needed RAM. I did not sort out the exact nature of the problem, but the nature and frequency of crashes appeared to depend on whether I started Excel before starting the browsers or opening new browser tabs. At any rate, I was not able to work smoothly with that 800MB spreadsheet while doing my other work on the desktop.
Thus, I turned to the laptop. It had less RAM than the desktop, but apparently that wasn’t the problem. It seemed that Excel just needed to start by filling all available RAM. Once it finished doing that, it would be able to settle down and get to work – as long as I was not running browsers or anything else that might infringe upon Excel’s jealous seizure of resources. Excel might actually get to that point more quickly if there was less RAM to fill, though presumably the extra RAM would enable Excel to finish its calculations faster.
I was pleased and surprised to see that the laptop was able to open this monster spreadsheet. At that point, I had not yet switched the spreadsheet to manual calculation. So Excel immediately launched into an attempt to update all those 100 million cells. I let it have its head, with that; I let it run for 24 hours. At that point, it was 6% finished. So it appeared that Excel would need about 16 days to complete one calculation step on the laptop.
There was the option to let it go ahead. The problem there was that Acer designed its laptops so that the battery would eventually die if you left the laptop plugged in constantly. The battery on this Acer, which I usually left plugged in, had died long ago. I did have it plugged into an uninterruptible power supply (UPS), which it shared with the desktop – which meant that the laptop would have only a few minutes of uptime, once the power went out. So if there was any appreciable power outage during those 16 days – and we had been known to have power interruptions from time to time – I would lose all of the unsaved calculations to that point. Since Excel was very touchy during some of its operations, I would also lose all unsaved work if I hit the wrong key at the wrong time. Excel also stopped calculating if I closed the lid on the laptop, even though I had adjusted the Windows power settings to prevent any actual system sleep, hibernation, or shutdown. There was also the question of whether it was good for the laptop to run full-bore for 16 days without a break.
In short, I didn’t want to gamble that the laptop would make it through 16 days’ worth of calculations without having to start over, or worse. It seemed I might have to complete this project on the laptop, and that doing so would require me to adjust the size of the project.
The Solution
After the abortive efforts described above, my first successful step was to break the spreadsheet into pieces. This particular spreadsheet was organized around dates, from 1958 to 2021, so I wound up breaking it into six pieces, one for each decade in that period (i.e., 1958-1969, 1970-79, 1980-89 …). This breakup took some time. I divided the big spreadsheet into two parts, each containing three decades, and then divided each of those two parts into three smaller parts. Each of these pieces was about 135MB and contained the raw data worksheet plus an average of about 16 million formula cells.
The actual procedure I used to break up those pieces was to make a copy of the original spreadsheet in Windows. This was faster than making the copy in Excel. Then I opened the copy, renamed it (as e.g., 1960s.xlsm), and deleted all columns that did not pertain to the specified decade. Each of these steps was simple for the user, but deleting columns took many hours of machine time. I think the fastest way through this step may have been to subdivide into three larger chunks. That is, the 1960s spreadsheet would contain columns for both the 1960s and the 1970s; likewise, the 1980s spreadsheet would also include columns for the 1990s; and the 2000s spreadsheet would include columns for the 2010s. Then make copies of those three, rename them appropriately (e.g., the contents of 1970s.xlsm would be identical to the contents of 1960s.xlsm), and delete extraneous columns from each (so that e.g., 1960s.xlsm would contain only columns for the 1960s).
If I had figured out those things at the start, I might have been able to do that subdividing on the desktop. What actually happened was that, once I had converted the one large spreadsheet into six smaller spreadsheets on the laptop, it occurred to me that the desktop might be able to work with those, without conflicting with other tasks underway. That seemed to be true. With a smaller chunk, Moo0 might still report that Excel was hogging the CPU (depending on the task), but now Excel seemed to be using only about half of the desktop system’s RAM. So I would guess that, depending on formula complexity, if you have 20+GB of RAM, you might be best advised to keep each of your chunks to somewhere around 15-25 million calculated cells.
The first step in each of these six manageable chunks was to hit F9 to run the calculation once, or (assuming no further calculations) just switch back to automatic calculation. This calculation could take a long time. The laptop seemed to process formula cells at a rate of roughly a half-million per hour. In this case, each of the chunks required ~36 hours for the calculation step. Consistent with its PassMark score, the desktop’s CPU processed formula cells about twice as fast – roughly one million per hour.
Once that calculation finished, I saved the spreadsheet as version A. So, for example, 1960s.xlsm now became 1960sA.xlsm. Then I saved it again as version B. The latter was where I would finally start cutting this spreadsheet down to its essential size.
The first size-reduction step was to freeze the spreadsheet’s formulas into fixed values. Now that I had calculated them, I wouldn’t need to calculate them again. Freezing meant replacing the formulas with the resulting number. For example, a cell that previously contained =3+2 would now contain 5. It would not require any recalculation. In this case, that could make things a lot simpler. For instance, the performance of The Beatles’ “Penny Lane” on the Billboard Hot 100 for March 18, 1967 was a settled fact. Once Excel looked up that ranking from the raw data tab, the calculation step was finished: the main tab merely needed to contain the number 1, specifying that song’s Hot 100 rank on that date.
Freezing values was easy enough. My customary solution, almost second nature, was to select the range to be frozen and then hit Edit > Copy, Edit > Paste Special > Values. Since I still used Alt-key combinations from back in the day, for me it was Alt-E C, Alt-E S V > Enter.
Once the values were frozen, I no longer needed the raw data tab. So I could delete that. These two steps – freezing values and deleting the raw data tab – significantly reduced the size and time to load, for each of my six decadal chunks. For instance, 1960sA.xlsm was 136MB, whereas 1960sB.xlsm was only 42MB.
When I finished those steps with each of the six decadal spreadsheets, the final task was to merge them back into one larger spreadsheet, so that I could conduct further calculations using this established data. With the aid of Excel Forum, that larger spreadsheet became the foundation for the Billboard Hot 100 analysis in another post.