In a previous post, I worked through various difficulties, toward the goal of producing a readable offline copy of selected websites. This post presents a simplified and refined (and in some instances perhaps corrected) version of that previous post’s contents.
First, let me review what the title of this post means. A readable copy of selected webpages would differ from an XML backup: the latter would be hard to read (at least without some tinkering), but would require seconds rather than days to restore. I wanted both kinds; each had advantages for certain needs.
The readable offline copy would be stored on my hard drive. Thus it would let me view the backed-up webpages when Internet connections were unavailable or unreliable. The readable offline copy would also be more useful than the online original for some purposes. First, views of different pages within the offline copy would be much faster than many Internet connections would allow. In addition, there might be some tasks for which a self-contained version of a blog might be preferable (e.g., when giving an exam about the contents of a website, without allowing students to actually go online). And if I did ever have to restore my online material from scratch, the offline copy would provide a point of reference, to confirm how things were supposed to look when I was done.
Next, regarding the “integrated” part of the title. Some websites, such as those of CNN and eBay, have very little in common. Users will rarely click on a link in eBay and expect to be taken to a CNN webpage, or vice versa. That was not my situation. The websites I wanted to back up were my own blogs. Many of my blog posts contained links to one another. If I backed up each blog separately, clicking a link in one would not open the relevant page in the other. I wanted a backup in which clicking a link to Ray’s Blog B from within a page in Ray’s Blog A would take me directly to an offline copy of that Blog B page, even though it was in another blog. In short, I wanted an integrated readable offline copy of multiple blogs.
The title of this post says that I had previously decided to use Wget to produce that offline copy. Wget was a command-line program. The mission, then, was to work up the command(s) needed to produce that offline copy. This post provides a short version of the more extended explorations reported in the previous post. This post also adds a few things that I hadn’t attempted previously.
Using Wget to Achieve the Objectives
As far as I knew, I would have to use a single Wget command, not multiple commands, if I hoped to wind up with an integrated copy of the online materials. A single command containing everything would be very long. Apparently Windows 7 would accommodate very long commands. But overly complicated commands could be disorganized and could make troubleshooting difficult.
The approach I took was, instead, to write a simple Wget command, and to let it draw upon related files that would contain some of the details that I could have packed into a longer Wget command. My wget command was as follows:
wget 2> "D:\Blogs Backup\wgeterr.log"
That command ran Wget and created a wgeterr.log file containing any error messages, in the Blogs Backup folder. That folder had to exist already; if it didn’t, Wget would return an error: “The system cannot find the path specified.” The wgeterr.log file was overwritten after each run of the Wget command (though that would not have been the case if the command had used “>>” instead of “>”), so I checked that log file frequently for clues as to what might or might not be working in my download.
That Wget command did not specify where the backup itself should be created. In fact, that wget command failed to specify a lot of things. That’s because, as I say, I wanted to keep the Wget command simple. I had put most of the hairy details in a separate file called wgetrc. Wget would look automatically for that wgetrc file, and would use it if found. I stored the wgetrc file in the C:\Program Files (x86)\GnuWin32\etc folder. The full contents of my wgetrc parameter file appear in another post, with built-in descriptions of the settings I used. (The version shown there has been updated in light of the following discussion.)
Some of the settings in wgetrc involved potentially long lists. It would have been awkward to jam those lists into wgetrc. Doing so would have created long lines that would be difficult to proofread and troubleshoot. Instead, I used options within wgetrc to point toward still other files containing those lists. Or at least I tried to. As noted in my wgetrc file, some of the desired options did not seem to work, regardless of how I tried to invoke them.
One option that did work was the “input” option. It told wgetrc which URLs I wanted to include in Wget’s download. As shown in wgetrc, I used the input parameter to point to another file containing the list of desired URLs. I called that other file IncludeURLs.txt. The contents of my IncludeURLs.txt file (created in a plain text editor like Notepad) were just the URLs of the blogs I wanted to back up. The list started with these:
The list continued from there. It included most of the blogs specified in the list on my homepage. Several others did not fit within the scope of this attempt. That is, it may have been possible to use Wget to back them up, but I did not seriously pursue their inclusion in this project. Specifically, brief attempts to include the Scribd.com and YouTube.com accounts listed on my homepage were not successful. I had found, but had not yet been successful with, a seven-year old tutorial in using Wget to back up LiveJournal. The best I could manage for a WYSIWYG download of my LiveJournal blog was to use Semagic‘s Links > Synchronization option, producing a single HTML page containing all posts.
It might not have been necessary to include the Blogger blog included in my list of blogs, since I was not adding posts to it anymore, except that I wanted to capture the latest comments. But having discovered that Blogger no longer offered that blog’s template (i.e., the layout or theme that made it appear as it did) — indeed, having had to restore that template from my own backup or else give the blog an entirely different appearance — I also felt it would be best to have spare copies of the template in case of emergency.
The first time I listed the desired URLs in IncludeURLs.txt, I arranged them in approximate order of size, from smallest to largest. That is, I began the download with the blogs containing the least material, so as to see what was happening with different blogs, before getting tied up in the blogs at the end of the list, each of which contained many posts. For test purposes, while figuring out the necessary parameters, sometimes I also used an alternative form of IncludeURLs.txt, containing just one or two blog URLs. (I could have specified those on the command line or in wgetrc instead.)
To emphasize, this wasn’t just a grab-bag of random URLs; these were websites that were especially appropriate for integration because of their numerous interconnections. If they had been substantially unrelated, it would probably have made more sense to devise separate Wget commands and wgetrc files for each, so that I could update them as needed without necessarily having to download all of them every time.
Then again, even a bunch of unrelated websites could probably be stitched together into a single package by creating an HTML page with links to the desired websites. That is, even if the downloaded webpages didn’t refer directly to each other, as my own blog posts often did, it would still be possible to go to that manually created HTML links page, within the offline download, and click on the links that would take the user to specified pages within the various downloaded websites. Most of my blogs (above) contained links, in the sidebar, to Archives and My Other Blogs pages that would serve a similar function. That is, as inspection of my blogs may demonstrate, it would be possible to use links between those two pages and my various posts to move around in my blogs and to view their respective contents.
Once all this was sorted out, the wget command ran. I found that the command window in which it ran would display activity, reporting many “Connecting” and “Request sent” commands. If this didn’t happen — if the command just seemed to complete without anything being downloaded — then something (probably something in the foregoing discussion) was not right.
Social Media Site Problems
As described above, I used an external text file (IncludeURLs.txt) to list the URLs that I wanted to download, and I invoked that file with an appropriate reference in wgetrc, which Wget would automatically use if available. As shown in the separate wgetrc post, I tried to do the same thing — to use external files, referenced within wgetrc — to handle lists of domains and directories that I specifically wanted to include or exclude. Unfortunately, it did not appear that those settings — or even their command-line or wgetrc-internal counterparts — were working.
The specific problem I was having had to do with social media sites. It seemed that none of wgetrc’s “exclude” options were preventing Wget from including separate folders bearing the names of various social media websites (e.g., Twitter, Facebook, Digg). This was not a small problem. It looked like Wget was trying to download some kind of Twitter link (and similarly for several other social media services) for each of the posts that I was downloading. In addition to the unwanted clutter, this process added a huge amount of time to what should have been a relatively straightforward process. For instance, I let Wget run for ten hours, in an attempt to finish downloading just one blog, and finally I killed it. (Wget could be killed on the command line with Ctrl-C, or by closing the command window in which it was running, or by invoking the Windows 7 Task Manager via Ctrl-Alt-Del.)
Since the process had not terminated naturally (and might never do so), I was not entirely confident, even after all that time, that I did have a full download of the desired blog; and as described below, it was difficult to do a manual count of posts and downloads to be sure. (A few years later, doing another run of wget, I realized that I could make a copy of wgeterr.log and view it. Viewing a copy, rather than the original, would avoid causing problems if wget tried to write to wgeterr.log and found that it was not available because I was viewing it. The copy of wgeterr.log would show whether wget was still doing legitimate backup or was hung up somewhere. On that occasion, eight hours after starting, I saw that wget was in fact still backing up files. See below for an explanation of what it was doing during all those hours.)
Eventually, it occurred to me that I might be having these problems because links to social media sites appeared on each of my WordPress blog posts. My wgetrc file contained a parameter, “page_requisites = on,” that instructed Wget to download all pages needed for proper page display. Possibly Wget was interpreting the Twitter and Facebook “Like” and “Share” links on each blog post as something that would have to be “downloaded,” in some sense, in order to provide a proper page display. I could understand the need to copy the Twitter and Facebook icons, but it was not necessary to download the actual Twitter and Facebook pages to which the icon links would lead. So this appeared to be a bug.
Since Wget’s existing parameters appeared unable to override that behavior, I looked for a workaround. I went into the WordPress Settings > Sharing area, for each of my blogs. As of summer 2016, this had changed: I now had to proceed further, into the > Sharing Buttons tab. The procedure also changed. Formerly, I would drag all of the icons (for e.g., Twitter and Facebook) from the Enabled Services area to the Available Services area. Now, instead, I would just click on grayed buttons (for e.g., Facebook) in the “Edit visible buttons” area (which became visible when I checked “Edit sharing buttons”), and that would remove them from the “Share This” section. Either way, once all the buttons were removed from the Enabled Services or Share This section, I had to go to the bottom of the Sharing Settings page and click Save Changes. This change in WordPress was helpful. Under the old method, I had to click the Save Changes button before removing any buttons from the Enabled Services area, click Save Changes, and then repeat the Enabled Services and Save Changes step: it would only work if I went through the same steps twice.
So that was the solution. Once I did have the sharing icons out of the Available Services or Share This area, the Wget command (above) gave me only the desired download folder, without any folders for Twitter or Facebook. I would have to remember to go back into the Settings > Sharing page, for each of my WordPress blogs, and manually restore the sharing buttons when I was done with the Wget process.
Recap and Results
In the end, I succeeded. To review briefly, I developed a Wget command, an accompanying wgetrc file with various notes and parameters, and (as explained in that wgetrc file) another accompanying file listing the URLs from which I wanted to download materials. The wgetrc file was customized for my WordPress blogs, but it also worked with my Blogger blog. Running the Wget command generated a log file that I checked to see if I was missing anything significant in the download process. To make the download relatively fast and compact — really, to get it to complete within a reasonable amount of time — I also had to go into the Settings in my WordPress blogs and remove social media sharing buttons, if only temporarily.
With those things done, I ran the Wget command and obtained an integrated, readable, offline download of my WordPress blogs. As described above, the download was integrated in the sense that, when I clicked links leading from one blog to another, and to different posts within the various blogs, those links would take me directly to the desired locations. The download was readable in the sense that it was not an XML download suitable for actual blog backup-and-restore operations: it looked just like my online blogs. The download was offline in the sense that, when I clicked links among my blogs, I was taken directly to the desired download page without having to go online. I had not attempted to download the many external pages to which my blog posts linked, so clicking on a link that did not lead to another of my own blog posts would have produced an attempt to go online and find the page in question.
This download of all of my blogs took a total of about 2.5 hours. I don’t know how much faster it would have run if I had not used a speed control parameter and/or had not required random delays in wgetrc; I also don’t know whether WordPress would have banned me for downloading in that system-hogging way. I let the download proceed while doing other work. It did not seem to slow down my computer at all, so I did not mind this slow pace. Note that some of my posts were really long, and few were very short, but few posts had pictures and none contained video, so downloads of other blogs might yield different results.
I prepared a spreadsheet to check the numbers of downloaded files, as shown in Windows Explorer (or, in the picture shown above, Explorer++). I hoped to verify that I was getting one index.html file for each post, page, and comment on each blog. It turned out that I was also getting a separate file within its own folder for each picture, and that some comments (perhaps those made by me) may have been generating more than one downloaded file. So a precise numerical check was not possible without investing more time than I could spare. It seemed I would be limited to a spot check consisting of random clicking around within the downloaded files, starting with a given index.html file, to achieve some degree of verification that the download seemed to be working as intended.
In that spot check, I noticed some failures of integration — some exceptions, that is, where the download saved a link to the online version of one of my blog pages or posts, rather than to the downloaded offline version. I noticed several such exceptions in the downloaded version of my list of blogs. I did not presently have an explanation for those exceptions. They appeared to indicate failures in wgetrc’s convert_links parameter.
Altogether, the target folder on my computer’s drive, containing all of the downloaded files produced by Wget, amounted to 101MB. I decided to zip it, so that its internal contents would not come up in searches and would not appear when I ran drive-cleanup searches for duplicate files. (Others, using other file retrieval schemes, might want those files to remain visible). Zipping it meant that I could save the whole download in a single compressed 27MB file. I could then do the same with later downloads, saving a series of them for purposes of keeping a series of historical copies without filling much drive space.
Updates: Subsequent Tries
October 2014. In another run in early October 2014, I found that I was getting lots of errors in my log file. Many of these errors said, “WARNING: cannot verify [my blogname’s] certificate, issued by . . .” and “Unable to locally verify the issuer’s authority.” A search led to the tentative impression that the URLs specified in my IncludeURLs.txt file (above) should begin with https rather than http. That run did complete; I would have to test this next time around. Also, I noticed that the download was much larger (286MB), possibly due to imperfections in my Wget download of a LiveJournal blog that I decided to include. In another change, this time I used the “solid archive” option in my zip program (WinRAR). That produced a much smaller zipped archive (11MB) with a much tighter compression ratio (3%).
January 2015. When I ran the program again in January 2015, I got an error:
wget.exe – Entry Point Not Found
The procedure entry point EVP_md2 could not be located in the dynamic link library LIBEAY32.dll
A search led to the thought that perhaps I needed to make sure that read-only permission was turned off in all relevant folders. But doing that did not solve the problem. Another possibility was that I needed an updated version of LIBEAY32.dll. A search of my computer revealed a number of files of that name, in varying locations, of varying dates. I guessed that the one I was concerned about here would be the one in C:\Program Files (x86)\GnuWin32\bin, where wget.exe was located. Alberto Martinez suggested links that ultimately involved downloading GnuWin32, which at that point had apparently last been updated in 2010 — which was well before the date of most copies of libeay32.dll on my computer, albeit not the one in the wget folder.
That did not seem likely to be the solution; I had not needed an updated version of LIBEAY32.dll previously, and I had not changed anything about wget since my last run. Instead, I tried renaming the copy in that wget folder to libeay32.dll.old, and copying one of the newest ones from elsewhere on my computer to the wget folder. (The contents of neither was intelligible in Notepad.) That didn’t help either. It eventually seemed that the problem was simply that I was attempting to run a file called “Run wget.bat,” though I was not sure why that would be an issue. The problem seemed to change when I renamed it to be wget.bat.
Once I did that, and tried running wget.bat again, the command contained in that batch file (wget 2> “I:\Current\wgeterr.log”) kept repeating very rapidly while producing no output. That also happened if I ran that command directly on the command line, without using a batch file. I tried uninstalling wget. That involved running the installation program and checking the Uninstall box. That attempt produced an error message: “Setup was unable to create the directory ‘C:\Program Files (x86)\GnuWin32\etc’. Error 5: Access is denied.” I think that was the installer’s bass-ackwards way of saying that the etc subfolder had now been deleted, because in fact it had. (Warning: this also wiped out that copy of my modified wgetrc file.) Then I re-ran the installer to install wget (and restored my modified wgetrc). Sadly, uninstalling and reinstalling did not solve the problem of rapid scrolling. Another possibility: I used my right-click Take Ownership option in Windows Explorer (installed via Ultimate Windows Tweaker > Additional Tweaks) to take ownership of all folders on my drive (other than special ones like System Volume Information). Still, alas, no joy.
It developed that the problem was simply that the wget batch file was not leading to the folder where wget.exe was installed (namely, C:\Program Files (x86)\GnuWin32\bin). I was not sure why this had changed. To resolve that problem, I revised my wget.bat file to look like this:
:: WGET.BAT C: cd "\Program Files (x86)\GnuWin32\bin" echo off cls echo. echo. echo Now running wget to back up my blogs ... echo. echo. wget 2> "D:\Current\wgeterr.log" cls echo. echo. echo wget backup should now be completed ... echo. echo. pause
Also, this time I reviewed the wgeterr.log file in more detail than I had done after other recent uses of the wget command. It was voluminous. I found that, by viewing the contents of the log file in a spreadsheet, I could apply various tests, so as to narrow down the quantity of material to review. In particular, after eliminating rows containing extraneous information, it seemed I wanted to use the REVERSE function to identify entries immediately preceding “ERROR 404: Not Found” messages. This yielded a set of commands that I could run in a batch file containing a command that would verify, first, that the specified blog posts did still exist online, like this:
start Firefox "http://blogname.wordpress.com/postname.html"
As it turned out, the ERROR 404 message was correct: those posts did not in fact exist, at least not at the addresses specified in wgeterr.log. So then there was the question of whether they should have existed, or why wget tried to download them in the first place. A few were references to external websites; possibly the ERROR 404 messages meant that the external links were no longer working, as I could perhaps investigate using a broken link checker. Others may have been deleted posts. There weren’t many, and as far as I could tell I was not missing any significant content. But the inquiry did raise a further question of whether I should perhaps be comparing present and prior versions of the Archives lists of files in my blogs — because those Archives lists were composed automatically, and would thus change without notice, permitting silent disappearance of a blog post that a blogger might not detect.
Early 2016. In a run in early 2016, a review of wgeterr.log revealed that wget spent eight hours spinning its wheels in an attempt to back up a single blog post. The wget batch file (above) did finish, more than 13 hours after starting. The situation with that single blog post was that, for some reason, wget kept going into deeper and deeper subdirectories, each of which seemed to contain more or less the same material as its parent. The process ended at a point nearly 30 subfolders deep. The post itself was only about 300K, but in my wget backup, the entire subdirectory tree pertaining to that single post contained nearly 1,600 files comprising more than 400MB. This process was not prevented by wgetrc because I had set the recursion level to infinite.
To prevent that problem, I tried to figure out how many subdirectory levels would be normal. I used DIR and Excel to count the number of levels throughout the entire wget backup. I sorted the Excel list to exclude the posts that had somehow generated very deep levels of subfolders. There were not many of these. It appeared that my backup was complete if it went down to the level of the first subfolder named “amp.” That was the level at which I found an index.html file containing the complete blog post. That level was five levels below the level of the blog name (e.g., blogname.wordpress.com). Five also happened to be wget’s default maximum recursion level. In my Excel spreadsheet, I looked at a number of folders that were six levels deep. It appeared that, at that level, all were backing up the amp subfolder. Since the index.html file at that level already appeared to contain what I wanted to back up, it appeared the sixth level was superfluous. Therefore, I set reclevel = 5 in wgetrc. I set aside the previous backup and ran wget again. This time, the entire backup finished in just two hours. Instead of 3,039 files comprising 452MB, I had 1,826 files filling 111MB. I was concerned that comments on a post might be stored in a deeper sublevel. But a look at some posts with comments seemed to indicate that the post and the comments were all captured in one index.html file and appeared just as they would appear online.