Massive File Compression Using Deduplication in Btrfs and Borg

Summary

This post introduces various types of deduplication, and then focuses on block-level post-process local deduplication.

In other words, I was trying to save disk space by feeding selected files to a tool capable of identifying similarities among those files.

For example, if File A and File B both contained the same Paragraph 23, the tool might leave the copy of Paragraph 23 in one place; but in the other place, it would replace that duplicative text with a pointer to the place where Paragraph 23 was stored. So then the duplication would be gone and space would be saved.

At first, I thought I was going to be using the Btrfs file system for my deduplication. As I explored further, however, I concluded that the Borg backup tool offered a better solution. Thus, after an initial look at Btrfs, this post mostly focuses on using Borg. This post works through the details of getting oriented and set up, with a few experimental runs, looking at what I would eventually understand to be “inner” deduplication; another post applies this information to a Borg backup of an entire drive, seeking “historical” deduplication.

In the specific case that motivated my search, I began with a huge pile of drafts of a Microsoft Word document. These drafts varied somewhat over time, but there was still a great deal of similar information among them. I found that Borg was able to reduce the amount of disk space required for those documents by about 93%. This was far better than the compression I got by leaving all that duplication in place and merely compressing it, using a file-zipping program (e.g., WinRAR). So that was pretty amazing.

Unfortunately, that was the only test case in which Borg produced results that would justify its slowness and hassle. Borg didn’t achieve anything noteworthy when working with a random pile of data files, nor when trying to digest already compressed materials that might seem to contain similar material (e.g., image backups of a Windows system drive C).

[Later note: Waldmann’s comment, following this post, points out that my testing neglected scenarios, over time, where Borg could make a huge difference. I hope to update this post with further exploration of some such scenarios.]

Borg wasn’t hard to learn: there seemed to be only a few basic commands, though there were many possible uses. Nonetheless, setting it up and using it in Linux was much less convenient than using a right-click > WinRAR (or 7-Zip, or other compression tool) in Windows File Explorer.

At present, it seemed that my most effective way of using Borg might be to restrict it to those special-case scenarios where I had many copies of highly similar files (e.g., earlier drafts of a document) that I might want to keep for possible future reference – but that, most likely, I would never need to access.

My plan for those situations was to create a Borg archive – and then zip it, along with a link to this post and/or an instruction sheet (such as provided below), into a non-compressed .zip or other archive file. That .zip file would be a self-contained archive and road map, so that I or others could recover those deduplicated contents if needed.

Contents

Introducing Deduplication
Considering Btrfs
Learning Borg
Using Borg
Borg Command Summary

.

Introducing Deduplication

I had a couple of sets of files that I thought might benefit from something known as deduplication. This post discusses the need and a possible solution.

I will describe those filesets in a little more detail (below). For the moment, the general idea is that there was a lot of potential redundancy among the files in those sets. The files were not exact duplicates – I had already deleted duplicates using DoubleKiller in Windows. But they contained a lot of the same material. It seemed that deduplication could dramatically reduce that redundancy – saving the time required to make backups or other copies of those files, and saving disk space required for storage.

As Enterprise Networking Planet (Hiter, 2021) explained, deduplication came in multiple flavors:

  • File- vs. Block-Level Deduplication. File-level deduplication was simply the elimination of duplicate copies of a file. DoubleKiller was my preferred tool for this. Duplicate file deletion was generally not what people meant when they spoke of deduplication. Rather, the objective here was to find something that would work on the sub-file level, finding chunks of data that were the same even if the files containing them were different. The basic idea was that, if identical chunks appear in three different files, you keep one copy of that chunk, and in the other two (or possibly all three) of those files you merely insert a pointer to the place where you’re storing that chunk. So, not surprisingly, if you had a lot of highly similar files, deduplication could give you a 15-to-1 compression ratio (e.g., 6%) – far better than the 30-40% that would be a great result from a file-by-file compression tool like 7-Zip or WinRAR.
  • Fixed vs. Variable Deduplication. The idea here seemed to be that, in fixed deduplication, the software would just grab the first 4KB (or whatever) and treat it as a chunk, regardless of what it contained. If two files were identical except that one had an extra character at the start, apparently fixed deduplication would not detect that similarity. Variable block-level deduplication was thus potentially superior: Hiter said that it divided blocks “based on their unique properties.” But as with other variations on the deduplication theme, the better solution could also require more resources in terms of CPU, RAM, and/or working disk space.
  • Source vs. Target Deduplication. Do you deduplicate the source drive or the backup drive? Deduplicating the source would be a one-time project, reflected onto all backups without further processing. But Hiter said, without explanation, that it required more CPU resources and was more difficult to implement.
  • Inline vs. Post-Process Target Deduplication. Assuming you were deduplicating the target rather than the source drive, you would have the further option of doing inline (i.e., on-the-fly) deduplication: the system would be checking the data for possible size reduction while the backup process was underway. Not surprisingly, inline deduplication would tend to slow the backup process. Post-process deduplication would take place after the data was copied to the backup drive: the system would go raking through that data, looking for duplicate blocks that it could boil down to a more reduced state. Post-processing would require some additional storage space to park the pieces being compared.
  • Local vs. Global Deduplication. Local deduplication would apply only to the local machine. Global deduplication would compare stored data across all nodes on a network, thus potentially increasing deduplication: duplicative chunks might not be stored on your machine at all. In a sense, the Internet was a global deduplication scheme, insofar as you didn’t need to keep a copy of the whole thing on your computer.

Enterprise Networking Planet (Hiter, 2021) cited a handful of important features in deduplication software. These features tended to emphasize that deduplication was typically a task for large systems, where storage savings could be substantial. For example, Hiter asked whether the deduplication software would “integrate with other relevant data platforms” and would offer security features to protect sensitive data from unauthorized users. In that spirit, Hiter listed deduplication providers – such as Veritas, which listed 87% of the Fortune Global 500 corporations as customers.

Considering Btrfs

My search did not turn up much end-user post-process deduplication software. As discussed in another post, I did see limited references to deduplication in some backup software – though, again, more for servers than for desktop computers. At first, I thought Acronis Cyber Backup (starting at $69/year) might be a viable alternative for Windows machines, but it seemed oriented toward cloud-based backup. In that realm, it did not appear to be a top option at the corporate level (e.g., PCMag, Gartner), and anyway cloud backup would depart from the focus here, which is on maximizing local storage.

Thus, it seemed that possibly the only way to implement deduplication on a single computer would be to choose a deduplicative disk filesystem. Post-process deduplication was one of the appealing features of Btrfs; as discussed in another post, it seemed that Btrfs might also be less complex than the ZFS filesystem. A Hacker News discussion (2020) favored Btrfs over ZFS for end-user deduplication.

Could I use Btrfs in Windows? The answer seemed to be no, for my purposes. Paragon offered a driver to allow Windows users to read drive partitions formatted as Btrfs, but not to write to them. Alternately, WinBtrfs was a Windows driver that allowed reading from as well as writing to Btrfs, but I wasn’t sure I trusted it: its webpage didn’t mention deduplication; it said defragmentation was still on the to-do list; SuperUser comments indicated that it was considered experimental as of 2017, and still had problems in 2019; and the “use it at your own risk” and “be sure you take backups” warnings were still on the webpage at this writing in early 2022. As another possibility, a SuperUser answer suggested using a virtual machine with raw disk access to manipulate the Btrfs drive from within Windows, but that sounded a bit Rube Goldberg; anyway, I didn’t test it.

In short, it seemed that Btrfs was still best used via Linux. In that case, the question was, what tools are best suited to that environment? There did not yet seem to be a GUI capable of addressing all Btrfs features well. Snapper offered a Btrfs-compatible snapshot manager, but according to a Reddit commenter, Snapper “doesn’t get configured automatically and knowing how to handle snapshots is not an easy task for average users.” ButterManager seemed to be intended for those who planned to make Btrfs the filesystem for the entire Linux installation. Its page described it as “a BTRFS tool for managing snapshots, balancing filesystems and upgrading the system safetly [sic].” I didn’t plan to be doing any of those things.

I hoped that I could just use standard Ubuntu disk and file manipulation tools to produce a Btrfs-formatted drive, and then read from and write to it in the usual ways, just as if it were NTFS- or ext4-formatted. To test that, I plugged a 500GB external Samsung solid-state drive (SSD) into my Ubuntu machine and ran GParted. In GParted, I selected the SSD, deleted its existing partitions, created a new one, and tried to format it as Btrfs – but I couldn’t, because that option was grayed out. The solution (per Ask Ubuntu) was to begin with sudo apt install btrfs-progs and then restart GParted and try again. Once I had done that, Ubuntu’s Disks tool (i.e., gnome-disk-utility) acknowledged the 500GB Btrfs drive that I had named BTRFS, and was able to mount it at /media/[username]/BTRFS; and Ubuntu’s Files utility (f/k/a Nautilus) seemed to treat it like any other drive.

In the absence of a GUI, Datahoards (2021) provided a helpful walk-through of commands to invoke Btrfs compression and deduplication. The effectiveness of the compression feature was apparently not much different from compression in other filesystems (e.g., NTFS): it was able to compress some types of files more than others. For at least one type of file, Datahoards achieved 33% compression, meaning that compression reduced them to one-third of their original size. Btrfs was also like Windows, apparently, in being able to compress files as they were added to the drive, and also to compress files that were already present on the drive when compression was turned on.

As Datahoards (2021) observed, there seemed to be several different deduplication tools for Btrfs. The Btrfs wiki indicated that inline (a/k/a inband or synchronous) deduplication “is not actively developed” in Btrfs; rather, the focus was clearly on post-process (a/k/a batch, offline, or out-of-band) deduplication. That was fine with me: as suggested above, post-process deduplication would hopefully yield greater (albeit slower) deduplication with less of a drain on system resources.

The Btrfs wiki (n.d.) listed three Btrfs post-process deduplication tools that were “known to be up-to-date, maintained and widely used.” From among those three, Datahoards (2021) chose duperemove, but the wiki said it was file- rather than block-based – which, as noted above, would be useless for me: I had already run a duplicate file detector. The two tools that (according to the wiki) were capable of working at the block level were bees and dduper. Checking those tools on those linked websites, I found that the wiki seemed to be wrong (or perhaps just outdated) regarding duperemove: its site said that it “will hash [file] contents on a block by block basis.” Of the three, duperemove also seemed to be the most extensively developed. (There were others: for instance, it seemed that some years had passed since the most recent work on btrfs-dedupe and bedup. See also Reddit.)

I did find some guidance (from LinuxHint, 2020, and Forza, n.d.) on using duperemove. But by this time I had encountered more warnings against using Btrfs, and had also discovered Borg. So at this point I paused my Btrfs inquiry and turned to Borg.

Learning Borg

According to its homepage, “BorgBackup (short: Borg) is a deduplicating backup program.” That page went on to say that Borg’s deduplication used the more sophisticated variable- rather than fixed-length chunks, and that Borg’s deduplication (unlike that of Btrfs, apparently) would not be broken if files were moved around.

Borg drew praise from a Reddit commenter, who said, “I would like to warn you not to rely on dedup for archival or backups. Use dedicated deduplicating archivers like Borg, ZPAQ or even .tar.xz instead.” Among such archivers, it appeared (at e.g., DEV, IT Notes) that Borg was favored over alternatives such as Restic. Participants at HostedTalk (2020) and Hacker News (2019) largely favored Borg, though at least one participant at the latter expressed significant reservations (see also Awesome SysAdmin).

I was interested in Borg, partly because Btrfs was not seeming like a very robust alternative, and partly because a deduplicating backup program sounded more appropriate for my situation. I wasn’t looking for a permanent deduplication solution covering an entire drive. I just wanted to shrink one set of files, on a one-time basis, and then archive them in that shrunk form.

Borg’s website said there were several ways to install Borg. At this writing, for my Ubuntu 20.04 installation, they said the repository contained version 1.1.15. I considered that close enough. (I could apparently have obtained version 1.1.17 by downloading binaries and then running some commands and dealing with some possible complications.) I installed from the repository by running sudo apt install borgbackup.

(Later, I would find that, actually, version 1.2 was a significant upgrade. Since that wasn’t yet available through the Ubuntu repository, I tried using the PPA. That failed. I opted to upgrade to Ubuntu 22.04 in March 2022, a month before its release, relying on upgrade advice from LinuxConfig (Brown, 2022). With that in place, I was able to upgrade to Borg 1.2 using sudo apt install borgbackup. That upgrade took place after I wrote this post. The commands discussed below were for Borg version 1.1.15. Possibly some of those commands would be different in version 1.2+.)

According to the website, Borg was “easy to use”: I just had to initialize a new backup repository and then create a backup archive. OSTechnix (Karthick, 2021) explained that a repository was simply a directory where an archive would be stored, and an archive was just a backup copy of my data. To initialize the repository, I used this:

sudo borg init --encryption=none /media/ray/SSD/BorgArchive

As indicated, I did not want to encrypt this repository. Alternately, instead of none, I could have specified repokey (presumably requiring a password) or keyfile. Also, I could have given the repository a name other than BorgArchive. (I would find that pathnames containing spaces did need to be surrounded by quotation marks.)

After running that command, I used Files to verify that the BorgArchive folder did exist on the SSD.

Now I could create an archive. For this, the documentation and other sources (e.g., Ferrari, 2021) inspired me to use this form:

sudo borg create --progress --stats /media/ray/SSD/BorgArchive::WordDocs /media/ray/SAMSUNG_A/Test/

In that command, progress would reassure me, along the way, that Borg wasn’t hung – that it was actually doing something; stats would produce an ending report on what Borg had done; BorgArchive was the repository; WordDocs was an archive folder within the repository; SAMSUNG_A was the name of the source (USB) drive; and Test was the name of the folder, on that USB drive, that contained a few of the files I wanted to deduplicate. The –progress and –stats options were among the common options that were supposed to work with any Borg command.

I named that archive WordDocs because, in this archive, I wanted to store deduplicated copies of many Microsoft Word files. Each of these files contained a backup copy of a document that I had created in Word. The document was large, partly because it had a huge amount of text, but mostly because it incorporated many photos, charts, and other graphic materials. Each of these backup copies had mostly the same text and images as the one before it, so I hoped that deduplication would greatly shrink the size of the full collection. I doubted I would ever need these files, or an archive containing them, but this did seem like a good sample case for testing deduplication.

Borg finished executing that command almost instantly. It wasn’t surprising: the source folder (Test) contained only three of the large Word docs that I hoped to deduplicate. The stats option gave me the information that the original size of those three files was 58MB, the compressed size was 56MB, and the deduplicated size was 28MB. The compression was not impressive, partly because Word .docx files were already compressed – and perhaps also because, according to the documentation, Borg’s default lz4 compression algorithm was very fast, but had a low compression ratio. For better compression, I could use –compression zstd,22 (or a number smaller than 22, down to 1, for faster speed and less compression). I wasn’t sure, but it appeared that there would be diminishing returns beyond 10 (see e.g., Level1Techs; see also Squash Compression Benchmarks). The documentation also described other compression options, using the less efficient zlib and lzma algorithms.

The stats option also informed me that, in those three Word docs, there were 24 chunks altogether, of which 14 were unique. So apparently Borg found some reason or guideline by which to divide the Word docs into 24 distinct sections, of which 10 were duplicative. In other words, it seemed to work. It wasn’t waiting to begin its deduplication with later files, compared against this first set; even the files in the first set were deduplicated against one another.

I ran sudo nautilus and took a look in the BorgArchive folder. The README file seemed to be the only human-readable file there. (Without the sudo, Nautilus (i.e., Ubuntu’s Files utility) reported that BorgArchive had zero items.) A right-click > Properties yielded the information that the folder’s contents were unreadable. When I ran sudo borg list /media/ray/SSD/BorgArchive it listed WordDocs as an archive, with a time and a hash value. For information about the contents of that WordDocs archive, I ran sudo borg info /media/ray/SSD/BorgArchive::WordDocs. But that only repeated the stats information. For a list of the specific Word files, I had to replace info with list in that command.

The big remaining issue was restoring. The documentation recommended restoring as the same user, working on the same machine as where the backups were created, so as to avoid potential problems. At the moment, I was the original user, working on the original machine, so I didn’t get into any of those issues.

To restore, the documentation said I could use borg mount to poke around in the archive using a file manager, or borg extract if I was prepared to issue a single command that would haul everything back from the archive(s). Compare:

borg mount /media/ray/SSD/BorgArchive /mnt/borg
borg extract /media/ray/SSD/BorgArchive

In that mount command example, I would specify at least the repository I was mounting. For faster results, if there were many archives, I could add ::WordDocs after BorgArchive to specify a particular archive. The mount command also required me to specify a mount point (e.g., /mnt/borg). I might have to create that mount point first (e.g., mkdir /mnt/borg).

Similarly, in the extract example just shown, I could optionally specify ::WordDocs to indicate that I wanted to extract only that one archive. I could narrow it further with ::WordDocs SomeFolder (with no slash before the folder name) to specify that I wanted to extract only the named folder from within the WordDocs archive. Note that the extract example does not allow the user to specify a target: extractions come to the current folder. Hence, the user would want to navigate to the desired target folder before running a Borg extract command. The documentation said that extract was best if:

  • you precisely know what you want (repo, archive, path)
  • you need a high volume of files restored (best speed)
  • you want a as-complete-as-it-gets reproduction of file metadata (like special fs flags, ACLs)
  • you have a client with low resources (RAM, CPU, temp. disk space)

For my purposes, the simplest mount or extract solution was just to move to the folder where I wanted to extract stuff to, and then use parts of the create command shown above, like this:

sudo borg extract /media/ray/SSD/BorgArchive::WordDocs

That command restored the archived files to the NTFS USB drive from which I ran that command, and Word on a Windows computer was able to read the archived-and-restored files.

It didn’t seem possible to edit an archive’s files directly, but the recreate command seemed to allow the user to specify an unwanted folder (or perhaps an unwanted file) for removal (from all archives in a repository). Thus, I used the following variation on the original create command (above) to delete the Test folder from the WordDocs archive:

sudo borg recreate --stats --progress /media/ray/SSD/BorgArchive --exclude media/ray

Note that there is no slash before the “media/ray” part of that command. After I ran that command, running sudo borg list /media/ray/SSD/BorgArchive::WordDocs listed nothing, suggesting that I had wiped out the contents of the archive as intended. I tried to take a look and verify that, by running sudo borg mount /media/ray/SSD/BorgArchive /mnt/borg, but that produced an error (“Mountpoint must be a writable directory”). In a later try, I seemed to get past that by using sudo mkdir /home/ray/borgmnt – that is, by creating the mount point in my home directory – with warnings that I should not use that newly created (and soon-to-be-deleted) borgmnt folder for anything else.

Finally, I wanted to delete the archive, so that I could start over with a fresh create command. To delete the archive, the documentation said I could use sudo borg delete /media/ray/SSD/BorgArchive::WordDocs. Now running sudo borg list /media/ray/SSD/BorgArchive produced nothing.

Using Borg

With those background explorations behind me, it seemed that I could proceed to use Borg on the full set of Word docs that I wanted to deduplicate. For that, I repeated the create command shown above, this time not limiting it to the Test subfolder. To maximize compression, I added the parameter suggested above. My command thus read as follows:

sudo borg create --progress --stats --compression zstd,22 /media/ray/SSD/BorgArchive::WordDocs /media/ray/SAMSUNG_A/

With that, Borg went to work. The documentation explained the abbreviated results displayed by the progress option as it worked: O = Original file size, C = Compressed, D = Deduplicated, N = Number of files processed so far. It finished in 1:08:56 (h:mm:ss). It reported that it had deduplicated 595 files with an original size of 71.3GB. Compression did not seem to improve much on what Microsoft built into its Word .docx compression: the report said that Borg’s compression reduced the size to only 68.3GB. That was pretty much what I found when trying to use WinRAR to compress these files. As noted above, a compression setting of 10 might have been more efficient.

The good news about the resulting WordDocs archive was that the deduplicated size was amazing: 4.8GB (i.e., less than 7%, or 1/15, of the original size). That was due, no doubt, to Borg’s finding that those files consisted of 31,470 chunks, of which only 2,706 were unique.

One nice thing about that outcome: an underpowered computer could do it. The Linux machine running Borg was an old (2012) Lenovo ThinkPad E430 boasting a dual-core Intel Core i3-2350M with a PassMark score of 1,254 – as compared to, say, the score of 10,171 for the Core i5-1135G7 in an HP laptop now available for $535. Worse, the ThinkPad was running Linux from a Samsung FIT USB 3.0 drive. And yet the machine did not seem overwhelmed: its cooling fan didn’t even speed up during the process.

Those Word docs were a more or less ideal test case: they contained a great deal of obvious redundancy. I wondered how Borg would fare in a more challenging test. I had a bunch of bulky system backups. Would Borg find anything deduplicable in these? The sets I tested, and the results, were as follows:

  • Drive Images of Windows 10 Installation. I had seven backup images of a desktop Windows 10 installation. Six were AOMEI Backupper .adi files; the seventh was an old Acronis .tib. Borg found almost nothing to deduplicate in these. The numbers were a little confusing. It said that only about 62% of chunks were unique. To me, that suggested that 38% of chunks were duplicative and could therefore be deduplicated, resulting in substantial space reduction. But the final stats report indicated that the original size was 304.79GB and the deduplicated size was 304.58GB.
  • Uncompressed Windows 10 Installation. The previous paragraph inspired me to take an unscheduled detour. Unlike the results from that attempt to deduplicate drive image backups of Windows 10 installations, I believed that Borg would probably find a lot to deduplicate in an actual Windows 10 installation. So, as the source drive for my next Borg create command, I named the Windows To Go (WTG) USB drive that I used sometimes to boot the Lenovo laptop in Win10. More specifically, I targeted that WTG drive’s partition containing the Windows system files. I wasn’t sure if there was a Borg command that would capture the entire drive, with the boot and other partitions Windows required. If not, one solution would be to capture the entire drive with dd command, and then Borg the resulting image. That would be much slower than using a tool like AOMEI or Acronis to create the simply compressed backup images tested in the previous paragraph. But, for situations requiring extreme compression, we were about to see whether Borg might produce something significantly smaller than those standard drive images. Of course, Borg would only run in Linux, so (aside from perhaps a setup featuring VM passthrough) there would not be a question of whether the Windows files being Borged were in use and might thus not be rendered correctly. In this case, the Borg create command reduced an original 95GB to 63GB compressed and 53GB deduplicated. I wasn’t sure how much it might have helped to use a more intense compression setting. Even with the default compression setting, that 53GB final result compared favorably against a recent .adi (running AOMEI on its maximum compression) of 57GB.
  • VMware VMs. I had a set of eight RAR files, totaling 143GB, each containing the VMware Player files for one Windows 10 x64 virtual machine (VM). I ran Borg create on just one of those RAR files, to see whether deduplication could find anything noteworthy inside the RAR. The result: nada: the deduplicated size was actually a wee bit larger than the size of the RAR. I tried again with the whole set of eight. Same thing. I figured that trying Borg on just one of the VMs unzipped (i.e., restored to full size from the compressed RAR file) would yield results like those of the preceding paragraph: the compression tool (e.g., AOMEI, WinRAR) would do about as well as Borg on a single installation, and would be more convenient and much faster. So instead I tried Borg on all eight of the VMs unzipped. Borg reduced the 346GB of unzipped contents to 201GB compressed and 143GB deduplicated. That was no better than the 143GB that WinRAR produced.
  • Windows 10 Installation ISOs. I had a folder containing five 64-bit and four 32-bit Windows 10 installer ISO files. Borg reduced their total size from 38GB to 34GB.
  • Data Files. I had a folder containing thousands of mostly smallish data files of various types (e.g., PDF, TXT, DOC). From this collection, I had removed exact duplicates. In this case, Borg’s deduplication reduced the original 48GB slightly, to 44GB. That was also the result I got with WinRAR (RAR 5, Good (i.e., between Normal and Best) compression setting, 32MB dictionary, solid archive).

A month later, after moving the Borg archive containing the many draft Word documents (above) to another drive, I tried restoring it. For that, I began with sudo borg mount /media/ray/SP_7/BorgArchive /mnt/borg, reflecting the name of the new drive. (Sudo was necessary, apparently because of the “permission issues” that the documentation warned me about if I tried using Borg on a different machine – or, in this case, a different drive.) That command produced a warning:

The repository at location /media/ray/SP_7/BorgArchive was previously located at /media/ray/SSD/BorgArchive

Do you want to continue? [yN]

Once I approved that, the documentation reminded me to start by moving (in Terminal) to the drive (named 1TB) where I wanted to restore files to. In this case, the command for that was cd /media/ray/1TB. Then I was reminded (reviewing my earlier remarks, above) that extract was better than mount in this case. So I ran sudo rmdir /mnt/borg to get rid of the mount point I had just created. To verify that the files were still there, I conducted a review via the documentation: sudo borg list /media/ray/SP_7/BorgArchive. (As Thomas Waldmann pointed out in a comment (below), it wasn’t actually an archive; I should have called it BorgRepo.) This verified that I had a WordDocs archive in the BorgArchive repository. Now I thought I could retrieve its contents, using a command similar to the create command used above. But there were several adjustments. Here’s what I used:

sudo borg extract --progress /media/ray/SP_7/BorgArchive::WordDocs

Note that I had to alter the previous create command to reflect the change from the SSD (above) to the SP_7 drive. I didn’t need to specify compression in order to extract. Borg wouldn’t let me request stats in that command.

In terms of quality, the Borg process was a complete success: a DoubleKiller byte-for-byte comparison found that the files that had been Borged and then unBorged remained exactly the same as the originals of which they were a copy.

Borg Command Summary

This section summarizes the primary commands discussed above.

To install Borg on Ubuntu 20.04, I chose this option, for a slightly older but easy installation:

sudo apt install borgbackup

Next, to create a repository on the solid state drive that I named and mounted with the name of “SSD” when logged in as ray, I used this:

sudo borg init --encryption=none /media/ray/SSD/BorgArchive

I could have replaced none in that command with repokey (i.e., require a password) or keyfile (to use a keyfile to encrypt the archive).

To create an archive within that repository, I used commands like this one:

sudo borg create --progress --stats --compression zstd,22 /media/ray/SSD/BorgArchive::WordDocs /media/ray/SAMSUNG_A/Test/

where WordDocs was the name of the created archive and the Test folder on the Samsung USB drive was the source of the files being deduplicated into that archive. The stats and progress options used there were common options, apparently usable in many Borg commands.

The compression option used in that example was apparently the most intensive and slowest compression option available. For faster but less effective compression, I could replace the number 22 in that option with any number down to 1, or I could just leave out the —compression zstd,22 part of the command to use Borg’s default compression.

To list the archives existing within my BorgArchive repository, I used this:

sudo borg list /media/ray/SSD/BorgArchive

Similarly, to obtain information about a specific (e.g., WordDocs) archive, or a list of the files contained in that archive, I used these:

sudo borg info /media/ray/SSD/BorgArchive::WordDocs
sudo borg list /media/ray/SSD/BorgArchive::WordDocs > /media/ray/SSD/BorgArchive-WordDocs_List.txt

The last part of the list command is new, steering output of a long file list to a text file. That file could be useful both to check the contents of the archive and to zip along with the Borg archive for future consultation.

To mount my WordDocs archive and look at its contents using the Ubuntu file manager, apparently I could specify that archive thus:

borg mount /media/ray/SSD/BorgArchive::WordDocs FolderName /mnt/borg

where /mnt/borg was my specified mount point. I could leave off the FolderName if I wanted to look at the entire WordDocs archive, and I could leave off the “::WordDocs” part if I wanted to mount the whole BorgArchive repository. I did not use these commands, but my understanding was that the broader commands could be much slower if there were many archives in the specified repository.

To avoid such delays, if I knew exactly what I wanted to restore from an archive, I could use extract instead of mount, making sure to locate myself in the target folder (to which the files would be extracted) before running this command:

borg extract /media/ray/SSD/BorgArchive::WordDocs FolderName

I wasn’t sure if the FolderName part worked here; I only used this command at the archive (i.e., ::WordDocs) level.

The recreate command could manipulate an archive in various ways. I used it to delete the contents of my WordDocs archive by using something like this:

sudo borg recreate /media/ray/SSD/BorgArchive --exclude media/ray

To delete the entire WordDocs archive, I used this:

sudo borg delete /media/ray/SSD/BorgArchive::WordDocs

 

1 thought on “Massive File Compression Using Deduplication in Btrfs and Borg

  1. Thomas Waldmann

    Good writeup, some feedback:

    When backing up stuff like text / word processor documents primarily and the overall volume being not too big (rather GBs than TBs), borg can maybe achieve better dedup with a more fine-granular chunker setting, see the –chunker-params docs. This comes at the cost of higher management overhead (esp. RAM need for hash tables), but might be tolerable or even preferable for small to mid sized repositories if space saving is your primary goal.

    One thing slightly confusing in your blog post / usage is that you name your borg repository “BorgArchive” (not: “BorgRepo”). “Archive” is also borg terminology, but refers to the result of a single backup operation, which creates one more borg archive inside your borg repository.

    About getting a full drive: yeah, either use dd and pipe it into borg or directly let borg read the device using –read-special. The drive contents must not change while you do that, of course – for consistency reasons.

    sudo usage: be careful, it is recommended to always use same user to access a borg repo to avoid creating a mix of differently owned files in the repo. so either always use root or always use your normal user. for FUSE (borg mount), check the fuse configuration man page if you run into permission issues.

    space saving: you were mostly referring to space saving due to compression and “inner deduplication” of your data set. if you use borg for longer (for many archives in a repo), you will also notice the “historical dedup” – creating another archive of similar data takes almost no additional space. Guess this is the biggest advantage, you can have a lots of backups without wasting lots of space (and, due to the way how it works, you can easily delete any of these backup archives without affecting another one).

    Reply

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.