This post reports on a long and detailed investigation of Wget, a command-line program that could be used to download a readable offline copy of a WordPress blog. The discussion begins with an explanation of the purpose and meaning of this quest. The final paragraphs of this post present the solution I arrived at. The extended discussion may be useful for those who want to understand what went into that solution. A subsequent post provides a refined version of the information provided here.
I was interested in finding a good tool to back up my WordPress and Blogger blogs. A review suggested that there were many possibilities, and also that there could be at least two distinct objectives: to have a backup that could be restored, in the event of disaster at the website host (e.g., WordPress), or else to download a useful copy of the blog (or any website) that I could store and read offline — to preserve a historical snapshot, for instance, or to help me review what I had written previously, if I again found myself on a camping trip without Internet access.
The latter kind of backup, for offline browsing, would contain altered copies of the original blog posts. Specifically, the links within those posts would no longer refer to other pages on, say, WordPress, but would rather be changed to point to other pages within the offsite copy. Of course, it was not possible to download the entire Internet; some links would always remain “live” rather than “canned.” That is, some links in the offline copy might point to other pages stored on my hard drive (or DVD, or USB memory stick, or wherever), but other links would continue to point to actual webpages out there on the Internet somewhere.
So, to emphasize, my previous review had suggested that I could use a backup tool or procedure that would download the original files, in XML format, from my blog host (e.g., WordPress), suitable for restoration to that host or some other; or I could use an approach that would save the original XML files in HTML format that I could read offline, using a web browser (e.g., Firefox), with some links in those HTML files pointing to other downloaded files on my system, but with no easy way to use those altered offline files for purposes of restoring my original blog at my blog host. Of course, a person could use both approaches, for their separate purposes: keep a virgin XML backup, that is, and also keep a readable backup of the original blog for offline browsing. This post focuses on the latter.
To obtain a readable download of my blog, I had investigated HTTrack in its Windows version, WinHTTrack. That investigation had not convinced me that HTTrack was the way to go. In the course of that investigation, I learned that places like Softpedia and AlternativeTo listed a variety of other programs that one could use for purposes of generating an offline backup of a blog or other website. Wget was prominent, though not necessarily the leader, among those alternatives to HTtrack. Like HTTrack, Wget was available on multiple platforms, meaning that the work I did in learning and using wget in Windows would also be useful if I found myself having to do similar work on a Linux or Mac computer. A comparison suggested that Wget had certain technical advantages over HTTrack.
As might be suggested by its ungainly name, Wget was a command-line program. It was possible to obtain GUI front ends that would dress up Wget in a putatively more user-friendly form. An example: VisualWget. I felt that HTTrack was already a sufficient GUI for Wget or something similar to it, so I did not investigate these other variations on that theme.
Of course, command-line programs were so called because they required users to type commands, on the command line, in order to make the programs run. Command-line programs could be complex, difficult, and unforgiving. Then again, I had invested an inordinate amount of time into my study of HTTrack and had come away disappointed. If I had invested that time into Wget, I would probably have been well on my way to achieving good results by now. At this point, in fact, I preferred a command-line program, because one of my objections to HTTrack had been that I was trusting the “black box” of a canned software program to deliver reliable results, when the output of my HTTrack efforts did not warrant such trust. With the command line, I was more likely to be aware of exactly what I was requesting and getting.
There was, of course, the need to find a command line on which I could enter a Wget command. One way to get a command line was simply to go to Start > Run (or Search, if that was the available option) > CMD. Another way was to install a context menu (i.e., right-click) option in Windows Explorer that would allow me to “Open command window here.” That option was available in Ultimate Windows Tweaker > Additional Tweaks. The system’s file finder might also find copies of cmd.exe lying around elsewhere; double-clicking one of those could open up a command prompt, or could conceivably have horrible consequences, though the latter was not likely. There were also other ways to open command prompts.
Finding and Installing Wget
I decided that a practical way to learn Wget would be to study it with a focus on the present need for a blog backup, as distinct from trying to master all of its options for all purposes. The specific command I would use would depend on the version of Wget that I installed, so it seemed best to begin by finding and installing an appropriate version.
Wget came in various flavors. To find the best version, I went back to its roots. Wget was originally a Linux program. It was sometimes referred to as GNU Wget, where GNU was one of those waggish geek names (short for “GNU’s Not Unix”). That is, Wget originated in the GNU Project and was a part of the GNU operating system. From there, it had been adapted into multiple versions for various operating systems.
It appeared that I could get an official version for Windows from SourceForge. At that SourceForge page, I downloaded the installable package and installed it. I soon observed that the version I downloaded from that site, 1.11.4, dated from 2008 or 2009. The latest release, 1.14, dated from August 2012. It appeared that that version, presumably available for Linux, had not yet been developed into a Windows version. This did not seem worrisome. The functions I was seeking were probably basic and, as such, had presumably been part of Wget for years.
The installation included documentation, in the form of “man” (short for manual) pages. Wget had been around since 1996, and it appeared there were various versions floating around out there, so it seemed advisable to stick with the documentation included in my installation, using online sources (two recommended examples: the Wget 1.13.4 official manual and the About.com guide) only as needed to clear up ambiguities in the included documentation. The included documentation consisted of files called wget.pdf and wget-man.pdf, along with a link to a Windows Help file. The two PDFs seemed mostly redundant of one another, so I tended to use the longer and more explanatory one, wget.pdf. Its full name was GNU Wget 1.11.4: The non-interactive download utility by Hrvoje Nikšić and others (referred to here as simply the “manual”). (Note that, at this writing, the Linux versions were more current than this Windows companion.) The installed program’s link to the Windows help file did not work until I installed WinHlp32.exe (which Microsoft would only download, for me, via Internet Explorer) and rebooted.
Command-line programs would not necessarily run after being installed. Windows had to recognize the command I was typing. The “wget” command was most likely to be recognized if my command prompt was located in the folder where wget.exe was stored. Otherwise, typing “wget” at a command prompt would likely result in an error: “‘wget’ is not recognized as an internal or external command, operable program or batch file.”
To avoid such errors and make Wget run from the command line, I would want to make sure that the folder containing wget.exe was included in the system’s “path.” To determine what the path now contained, I typed PATH at a command prompt. I saw that both C:\Windows and C:\Windows\System32 were named in my path. In some cases, I could just put the executable file (in this case, wget.exe) in either of those folders, and then the command (in this case, wget) would be recognized. But that technique didn’t work this time. I guessed that my version of wget.exe needed certain supporting files to function. (That problem might not exist for portable versions of files, or possibly for older or newer versions of Wget.) Apparently I had to leave wget.exe in the folder where the installer put it (namely, C:\Program Files (x86)\GnuWin32\bin), so that wget.exe could find its supporting files, and just add that folder to the system’s path.
To add that folder to the path, I searched for advice and proceeded as follows: Windows > Start > Run > SystemPropertiesAdvanced.exe (or run SystemPropertiesAdvanced.exe from a command prompt) > Advanced tab > Environment Variables > User Variables > Edit (if the PATH variable already appears there) or New > Variable name: path > Variable value: C:\Program Files (x86)\GnuWin32\bin (or wherever wget.exe was located) > OK. I had to get all of the way out of that dialog for the change to be effective; at that point, “wget -h” did product the Wget help information. So it appeared my installation was complete.
Figuring Out the Right Wget Command
wget –mirror -w 2 -p –html-extension –convert-links -P -H -Dwordpress.com ~/path/to/save/locally http://yourblog.wordpress.com
(Note that, as of this writing, WordPress, host of this post, irritatingly insisted on converting nonsensical URLs like that one, yourblog.wordpress.com (and even -Dwordpress.com), into hyperlinks; there did not seem to be anything I could do to prevent that. Saving it in the Code format caused it to scroll off the right side of the page.)
To be clear, all of the words and other information shown in that command were part of a single long command. Although that material does not fit on a single line here, I would not enter it as multiple lines.
Thus, Matt reportedly used a single command to achieve what I had been trying to achieve in HTTrack. This was one of the advantages of a command-driven approach: if you know what you are doing, you can get right to it. Of course, you could also save the command — in, say, a batch file — and run it, or schedule it to run, whenever, without having to re-research the proper commands. You could also include comments, in your batch file, to make note of various explanations and possibilities, so that it wouldn’t be necessary to start from scratch, and all your notes would be right there, if you decided to do things a bit differently at some point.
To learn about using Wget for purposes of backing up my WordPress blogs, I examined Matt’s command, one piece at a time. My examination began with the first word: wget. I knew that, typically, you begin a command with the name of the program being used.
Next: “–mirror” (alternately, “-m”). Explaining that, the manual (p. 20) said this:
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf –no-remove-listing.
In other words, the “–mirror” option included in Matt’s command was shorthand for several commands that, taken together, would do what people would tend to want, when they were trying to mirror a website. So my learning process required me to examine those several commands, one at a time: what did “-r -N -l inf –no-remove-listing” mean? According to the manual (p. 19), -r (alternately “–recursive”) meant “turn on recursive retrieving.” Recursive retrieving meant the process of following links and directory structure to move through a website or server (p. 24). In an http or html document, that meant retrieving the specified document, then all documents to which that document’s links pointed, then to all documents pointed to in those documents’ links, and so forth, to the specified level. In an ftp address, recursive retrieving meant starting with the given address, then going to its subdirectories, then to the subdirectories’ subdirectories, and so forth — again, to the specified level. In short, Wget used recursive retrieval, which on WordPress would apparently mean following links from the specified URL.
Next, the -N flag (alternately “–timestamping”) meant “turn on time-stamping” (p. 7). (Note that these flags were case-specific; -N was different from -n.) -N meant that Wget would check the time at which the local and remote versions of a file were last modified (p. 29). Wget would download the remote file to the local (i.e., the user’s) computer unless there already existed a local copy that was (a) the same size as the remote copy and (b) not older than the remote copy. Vincent Danen suggested, however, that this -N option might often fail to work “with dynamic sites being the norm.” He apparently meant that many websites regularly update elements on webpages, making them appear to have been revised, and thus needing updating, even when there has been no substantive change.
The “-l” option (that’s an L, not a one; alternately “–level”) specified the maximum recursion depth level (see -r, above) (p. 19). The default was 5. In other words, if -l was not included in a Wget command, Wget would search through five levels. As just described, in an http URL, that meant that Wget would download the file found at the specified URL, plus all files to which that file linked, plus all files to which those files linked, plus all files to which those files linked, plus all files to which those files linked. So if you had six files or blog posts; and if file A linked to B, and B linked to C, and C linked to D, and D linked to E, and E linked to F, but otherwise there were no links among them; and if you pointed Wget to file A and said Fetch, and if you specified “-l 3” or “–level=3,” then your downloaded mirror would have files A, B, and C only, and not D, E, or F. To get them all, you would have had to specify “-l 6.” “-l inf” meant infinite recursion (p. 21) — that is, just keep following links throughout the blog or website being mirrored, until you’ve followed them all. (Perhaps unexpectedly, “-l 0” (that’s a zero) meant the same as “-l inf.”)
Finally, “–no-remove-listing” meant “do not remove the temporary listing files generated by FTP retrievals” (p. 18). I didn’t think I would be using FTP retrievals to back up a WordPress blog, so this seemed extraneous for my purposes. But then, it didn’t seem to hurt anything either; it was just a part of what was meant by “–mirror,” the second word in Matt’s command. To sum up, then, Matt’s “–mirror” command said, “Run Wget until you have followed all links on the specified URL, using timestamping to decide whether a local copy of a remote file needs to be updated.”
After “wget –mirror,” the next part of Matt’s command was “-w 2” (alternately “–wait=2”). According to the manual (p. 8), that meant wait two seconds between retrievals. This was recommended to reduce the load on the remote server. Boutell warned that failure to throttle the download request could cause my IP address to be banned from the target server, at which point my downloading options would presumably be limited or terminated.
Next, “-p” (alternately “–page-requisites,” p. 20) would tell Wget to download all of the files necessary to make an HTML page display properly. So if one of my blog posts had a colorful header, stored in a separate file, and also a picture, likewise stored in another file, those two extra files would be included in the mirror of that blog post. The exception was if it took more levels than the command permitted. For instance, if the command specified “-l 1,” then none of those accompanying files would be downloaded, because Wget would stop searching at the first level — at, that is, the URL specified in the Wget command.
This discussion takes the next two items out of sequence. First, as its name suggests, the “–convert-links” flag (alternately “-k,” pp. 19-20) converted all hyperlinks in the downloaded files so that they would work locally — would refer, that is, to other downloaded files, rather than to the online locations specified in the original (i.e., remote, online) files. The local links would be relative — that is, their saved addresses would convey instructions like, “Go up one folder, and from there go down into the subfolder named “Arkansas,” and in that subfolder retrieve the file called “Fort Smith.” Thanks to this relative addressing — not dependent, that is, upon specification of the absolute address, including the drive on which the files were stored — the entire set of downloaded files could be moved from one drive or folder or computer to another: they would continue to work as long as the whole downloaded set was kept together and its internal directory structure was not changed.
The “–html-extension” part of Matt’s command (alternately “-E,” p. 12) said, basically, Please rename everything as an HTML file. The purpose of this was to insure that certain types of non-HTML files (e.g., .asp, CGI) would be converted into viewable form. To prevent unnecessary re-downloading of the renamed files, they recommended including -k and -K in the command. “-k” is described in the previous paragraph. “-K” (alternately “–backup-converted,” p. 20) would tell Wget to back up the original version of a converted (e.g., .asp or CGI) file with an .orig extension. Timestamping (above) would then use this .orig file to see whether the file’s modification date had changed (p. 30).
The purpose of the “-P” flag (alternately, “–directory-prefix=prefix,” p. 12) was to specify the directory where the download would be stored. For instance, “-PDownload” would create a subfolder called “Download” in the current folder (p. 41), and “-P/tmp” would create (apparently at the top level of the drive, thanks to the slash) a folder called “tmp” (p. 42) — and then the downloaded files and folders would go there. Note the use of a forward rather than backward slash for directories in Linux commands like Wget: the backslash was used to tell the processor to ignore the next character, such as a line break, like this:
command -a -b other command stuff continues \
and this is still part of that same command \
because it’s a long command filling three lines
On that basis, within my limited understanding, Matt’s command did not seem to use this flag correctly: no directory was specified. It seemed this might have been the original purpose of a later portion of his command (i.e., “~/path/to/save/locally”): the tilde (~) was a Linux artifact. While a command of “-P D:/Folder1/Subfolder2/” might have been valid in the Windows version of Wget, designating Subfolder 2 as the location where the download would be stored, I suspected that a backslash would be the proper punctuation when included within quotation marks — which, themselves, would be needed when the path name had spaces in it:
-P “D:\Folder 1\Subfolder 2\”
The trailing slash was needed to clarify that “Subfolder 2” was the name of a folder, and not the name of a single file that was somehow supposed to contain all of the downloaded material.
The “-H” flag (alternately “–span-hosts,” p. 22) enabled spanning across hosts when doing recursive retrieval. This command to “span to any host” was contrasted against “-D” (alternately, “–domains=domain-list,” p. 22) and “–exclude-domains” (p. 25). Matt’s command used two of these in conjunction: “-H -Dwordpress.com” (evidently making sure not to include spaces, so as to avoid confusing the program as to what all these different web locations meant). That combination would allow a search into any domain (-H), if that’s where the links from the targeted webpage led, provided the domain in question was part of wordpress.com (-D). The purpose of this -HD combination seemed to be that I would want my download to look on another server, if necessary, to find a picture or other file used in one of my blog posts. It would be up to me to know whether I was, perhaps, storing my images in some other place (e.g., flickr.com), so as to specify that other place in this command. Without such specification, I think, the remote image would just not appear in the downloaded copy of the post. Matt explained that he included these -HD options to make sure of capturing the blog’s stylesheets and images. In other words, if D was wordpress.com, then the links from the page to which I directed Wget’s attention would be included in the mirror if they led to files in bluejeans.wordpress.com and any other wordpress.com domain. A more complex example would be like this:
wget -r -H -Dmit.edu,stanford.edu –exclude-domains philosophy.mit.edu http://www.mit.edu/
This example would anticipate the last of Matt’s options. It says, run Wget, starting at http://www.mit.edu/ and spanning all servers everywhere (-H), with two limitations: they have to be at mit.edu or stanford.edu, and they cannot include MIT’s philosophy department site. (It would also have been OK to combine the -r and -H into -rH, but I guessed that I could run into problems combining any commands consisting of more than one letter. For example, -rHDmit.edu,stanford.edu would probably have confused Wget. The order of the flags apparently did not matter, though it seemed best to name the target URL at the end.) Matt’s command ended with http://yourblog.wordpress.com, representing the URL of the WordPress blog being downloaded.
So, except for a one or two tweaks, the manual seemed to suggest that Matt’s command would work for my purposes too. I decided to check this approach against other approaches. In an otherwise similar approach, Richard Baxter added the “–user-agent” option (alternately -U). His purpose in using this option echoed a concern in the manual (p. 15): some sites apparently block downloading attempts, like that of Wget, unless the probing computer (e.g., mine) is identified as a desirable inquirer. Baxter suggested pretending to be Google, by adding this to the Wget command line:
Peter Upfold offered other examples of user agents. But the manual, discussing –user-agent, also said, “Use of this option is discouraged, unless you really know what you are doing.” I could imagine a situation in which Google or the target site changed their policies, such that this –user-agent addition ultimately wound up flagging me rather than disguising me. In other words, I didn’t “really know what [I was] doing,” so I figured I might best hold this option in reserve in case I really needed it. I doubted that I would need it for WordPress; the idea that bloggers would want to back up their websites seemed noncontroversial.
Another source recommended using the “–no-parent” (alternately “-np,” p. 23) option to prevent downloading from any directory higher than the specified target. So, for example, if I pointed Wget to http://www.website.com/miscellany/Alabama, no files stored in http://www.website.com/miscellany would be downloaded, even if my links led there. I didn’t plan to target a specific subfolder when mirroring raywoodcock.wordpress.com, however, so at the moment this option didn’t seem too useful for me.
That source also recommended the –limit-rate (no alternative; p. 8) option to keep from overloading the server. By default, values were expressed in bytes; for instance, “–limit-rate=20000” would be similar to the valid alternative “–limit-rate=20k.” I had seen repeated references to that particular value, so tentatively I would be including that option in my own Wget command.
Another possibility was to use -A (alternately “–accept,” p. 22) to specify the only kinds of filename extensions or patterns that I wanted to mirror. For instance, “-A html” would download only files with an .html extension. While I could see that HTTrack had downloaded various blog posts as .html files, I suspected that this was part of HTTrack’s conversion process; the URLs for those posts did not end in .html. So it did not seem that I would use this option to download my WordPress blogs.
Assembling and Revising the Elements
The remarks in the previous section suggested the following command, to make a downloaded copy of a WordPress blog that would be readable offline in a web browser like Firefox or Internet Explorer:
wget –mirror –wait=2 –page-requisites –convert-links –html-extension –backup-converted –directory-prefix=”F:\Some Folder\Blogs Backup\Blog 1\” –span-hosts –domains=wordpress.com https://raywoodcockslatest.wordpress.com/
or in the abbreviated version
wget -mpEkKH -w 2 -P “F:\Some Folder\Blogs Backup\Blog 1\” -Dwordpress.com https://raywoodcockslatest.wordpress.com/
I copied and pasted the abbreviated version onto a command line (using right-click to bring up the paste option). This produced some unwelcome information. First, I got this:
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
A search did not yield an immediately obvious explanation. It appeared, though, that this might have something to do with what the manual (p. 31) described as wget’s global startup file “wgetrc.” It also seemed that the lines just quoted might be merely informational in nature, probably reporting system values of some sort: a search led to several posts in which people were asking how they could suppress those lines. BobStein suggested this was a bug in Wget version 1.11.4, and proposed adding “2>>wgeterr.log” as a fix. That worked for me, for purposes of suppression; I did not appear to wind up with any resulting wgeterr.log file.
I also got another message each time I ran that command (i.e., not only the last time, when I had added that wgeterr.log part):
wget: missing URL
Usage: wget [OPTION]… [URL]…
Try `wget –help’ for more options.
This seemed to be saying that Wget did not like the form in which I had provided the URL at the end of the command. I tried again without the trailing slash. That wasn’t the solution. A search led to the possibilities that one of my command-line options was claiming the ending URL as its own, and that perhaps I should be using single quotes. I tried that. Now I saw that, correction, I was getting information, in F:\Some Folder\Blogs Backup\wgeterr.log. The “>>” symbols in my revised command were causing error messages to accumulate, so I deleted the file and ran the command again. Now wgeterr.log contained the following:
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
–2014-02-13 03:07:06– http://folder%5Cblogs/
Resolving folder\blogs… failed: No data record of requested type.
wget: unable to resolve host address `folder\blogs’
After correcting the forward slashes, I was able to open C:\Program Files (x86)\GnuWin32\etc\wgetrc in Notepad. It began with these words, in one long paragraph evidently unchanged from the Linux version of Wget: “Sample Wget initialization file .wgetrc. You can use this file to change the default behaviour of wget or to avoid having to type many many command-line options.” I didn’t plan to take that approach at this time.
Next, I looked into the indication that my command had caused Wget to look for http://folder/blogs. Apparently my quotation marks (in -P “F:\Some Folder\Blogs Backup\Blog 1\”) were not working right. It seemed that P was finding nothing relevant in “F:\Some Folder\Blogs Backup\Blog 1\” and was therefore skipping on down the line, using the ending https://raywoodcockslatest.wordpress.com/ as its argument, thereby leaving Wget without a specified URL to search. (Note: at this point, some funky interaction between my mouse and the WordPress post editor resulted in the deletion of some material when I wasn’t looking; I am having to reconstruct what I originally wrote here. This has happened several times lately; I figure I had better mention it in case any of these efforts at reconstruction result in confusion.)
I tried again with various combinations of single and double quotes, and forward and back slashes. Eventually, I tried this:
wget -mpEkKH -w 2 -P “F:/Some Folder/Blogs Backup/Blog 1/” -Dwordpress.com https://raywoodcockslatest.wordpress.com 2>>wgeterr.log
As before, there was a delay. Then I realized that the Blog 1 folder was filling with subdirectories. The command appeared to be working. Too well, as it turned out. I stopped it after an hour and a half. It had downloaded more than 100MB in more than 1,400 files — and, judging from the folders in F:\Some Folder\Blogs Backup\Blog 1, it had drawn those downloads from a large variety of wordpress.com domains, and also from some sites outside of wordpress.com. Wgeterr.log appeared to contain thousands of messages. So I was still some distance from having worked out the right wget command.
I didn’t attempt, at this point, to troubleshoot wgeterr.log. It would be a start if I could just figure out how to restrain Wget from downloading all that stuff from all those other domains, and putting it into those 202 folders that it had created in my Blog 1 subfolder. Clearly, the search across wordpress.com was too permissive. I tried again, without the -H and -D options. The bet in this case was that pictures and other files relevant to the display of my blog posts would be included in my own raywoodcockslatest.wordpress.com domain. Before running this revised command, I deleted wgeterr.log, so that it would be newly created with only new error messages. I forgot, however, to delete the contents of the Blog 1 folder. Judging by timestamps, though, this version of the command didn’t create any new folders, and the process still ran for almost two hours.
I decided to see if there was a way to tell Wget to delete folders that were no longer needed to run the desired mirroring operation. A search turned up no immediate solutions. I deleted the contents of the Blog 1 folder and ran the command again. Its form was now as follows:
wget -mpEkK -w 2 -P “F:/Some Folder/Blogs Backup/Blog 1/” https://raywoodcockslatest.wordpress.com 2>wgeterr.log
That is, there was no -H or -D, and the ending part had been changed from “2>>” to “2>.” The new wgeterr.log indicated that it ran for about 110 minutes and downloaded 1,631 files. But after conversion to HTML, Windows Explorer indicated that the Blog 1 folder contained 2,571 files totaling 82MB. Brief exploration suggested that there might indeed be an .orig file corresponding to every file converted to HTML (above). I double-clicked on an HTML file. It was a nice-looking reproduction, seemingly identical to a post on my blog. I clicked some of its links. The internal ones went to other downloaded files, as desired; the external ones (to e.g., CNN.com) led to those external sites. This was all what I had experienced with HTTrack, with one exception: the Archives links on my blog pages did not lead to downloaded copies, but rather to live webpages. That is, when I went over to the sidebar, in one of the downloaded HTML pages, and clicked on the Archives entry for a given month (e.g., January 2014), I did not get a path-and-file result in the Address bar: I got a URL for a live webpage on my blog. The archives were not downloaded. According to my notes and my recollection, this was not always the case with HTTrack: sometimes the Archives links were converted too.
These results provoked further searching and insight. For one thing, I now understood that HTTrack was probably not behaving oddly when it gave me thousands of files in response to an attempt to back up 87 blog posts. Evidently the project of creating backups for all of the links on a page, including tags and everything else, could add up. I was also reminded that there was a more official way of creating a logfile: in place of the “2>wgeterr.log” approach, I could use “-o [logfile]” (alternately “–output-file=[logfile]”). Then again, an experiment demonstrated that this approach would not suppress the SYSTEM_WGETRC error. As I reviewed wgeterr.log, I realized that it might be best to suppress some of the verbose output, so as to give me a more concise focus on error messages. It seemed the better approach might be to keep “2>wgeterr.log” (or whatever name I preferred for the output log file) and add “-nv” (alternately, “–no-verbose” (p. 4)), which would still report error messages and basic information.
Another innovation would be to use the “-nd” (alternately “–no-directories,” p. 11) option to put all downloaded files into a single folder. It did seem that this innovation would simplify the thicket of nearly 1,500 folders and subfolders that Wget was giving me. An experiment indicated, however, that this would not necessarily provide good links among files for purposes of offline browsing. Also, Peter Upfold recommended using a random wait (“–random-wait,” pp. 8-9; no alternative) rather than “-w 2” to avoid being detected and possibly prevented from doing a Wget download. Finally, at about this time, I realized that I had forgotten about the “–limit-rate=20k” option discussed above, to practice considerate computing and avoid burdening the remote server.
As a separate insight, D’Arcy Norman pointed out that I could use the “-e” (alternately “–execute,” pp. 4, 48) to turn off the “robot exclusion standard.” He said this was a very bad practice if you are not the owner of the website. I wasn’t sure whether a personal blog on WordPress would make me the “owner” in the sense he intended. I wondered whether this standard was perhaps preventing me from getting downloads of the Archives links. Simon Schönbeck said this standard prevented him from downloading certain files; he circumvented it by using “-e robots=off.” There were other indications that server administrators might not like this option, for reasons I did not fully grasp; but there were also lots of suggestions to use it, and I was not yet enjoying the luxury of being able to dial back my downloading efforts; I still wasn’t consistently getting the desired material. I decided that other server load protections (especially the random-wait and limit-rate options, above) would probably be adequate.
I noticed, incidentally, that people were talking about other tools for related purposes. One was rsync, used for simple site mirroring, as distinguished from creation of a readable offline copy of my blog. In recognition of those comments, I added some rsync notes to the preceding post about selection of a blog backup tool. Another tool was cURL, whose website seemed to indicate purposes similar to those of Wget. A search led to a comparison — composed by cURL’s creator, but accepted by others — supporting the conclusion that cURL had many advantages but that Wget would probably remain the tool of choice for my purposes, due to its ability to parse HTML and do recursive downloading or copying of an entire website (rather than being limited to just the specified URL, such as the URL for a single blog post). A search of the cURL man page for “recurs” seemed to support other indications that cURL lacked recursive ability.
Another finding: the debug flag: “-d” (alternately, “–debug,” p. 4). It appeared that this option could yield information that would help me to figure out what was going wrong at various places. I was not sure how this flag would interact with the -nv (i.e., no-verbose) flag (above). I decided to try it and see what happened.
I wasn’t sure that the all-purpose -m (mirror) option was doing what I needed. In particular, as I reviewed the fact that -m was shorthand for -r -N -l inf –no-remove-listing, I had no problem with -r and -N; but I did not seem to need –no-remove-listing, and I wondered whether infinite recursion (-l inf) was really what the doctor ordered. My experiments with HTTrack had led me to conclude that a five-level recursion (i.e., “-l 5”) ought to be adequate. In other words, I thought I might try replacing -m with -r -N -l 5.
The mention of timestamping (i.e., -N) reminded me to check specifically for further information on updating. The use of dynamic webpages (above) seemed to suggest that Wget might not be very good at identifying only those pages that I had changed. It might proceed, in other words, to re-download everything, or at least a substantial portion of my blog, every time I ran it. That seemed to be what I was getting, not only in Wget but also in those recent HTTrack experiments. Then again, there had been occasions, previously, when I had found that HTTrack was quickly updating my blog backups. I did not know what might have been different, between those earlier attempts and these later uses.
Along with the question of updating pages that had changed, there was also the question of deleting pages that were no longer present on the blog or website. It appeared that Wget was not necessarily good at this. For instance, a search led to one person’s attempt to write a Python script that “makes wget remove files locally that are no longer on the server.”
The subdirectories that Wget was creating in my Blog 1 folder were not nesting in bothersome deep layers. For my purposes, it appeared unnecessary to reduce those layers, or the names given to them, by using the -nH (alternately, “–no-host-directories,” p. 11) or “–cut-dirs” (no alternative, p. 11) options. But I did want to make note of those options here, for possible use in other Wget projects.
Examining the Error Log
In response to the foregoing thoughts, I deleted the contents of the Blog 1 folder and tried another version of the command:
wget -rNmpEkKd -l 5 –random-wait -nv -P “F:/Some Folder/Blogs Backup/Blog 1/” -e robots=off –limit-rate=20k https://raywoodcockslatest.wordpress.com 2>wgeterr.log
(It would be a long time before I would realize that I was still including the mirror (-m) option here. When I did finally recognize that, I conducted further experiments (below).)
The command just shown yielded a number of insights. For one thing, I should have indicated where I wanted wgeterr.log to be saved, else it would apparently land wherever I executed the command — which, now that Wget was on the system path, could be any directory on any drive. It seemed that a better command would look like this:
Also, it was probably for the best that I did retain the no-verbose (-nv) flag, because wgeterr.log was now truly voluminous: 171MB. In other words, somehow, Wget’s error messages for the 87 posts that I was trying to download filled, like, a lot of pages. I gave up trying to figure out how many pages, exactly; WordPad gagged and retched when I tried to cram all that text into its gaping maw, as did Microsoft Word; LibreOffice Writer simply crashed. I tried to print the contents of wgeterr.log to PDF while viewing it in Notepad, but the PDF printer page counter claimed it was printing more than 40,000 pages. At that point, the count was still rising, but I canceled.
The contents of the Blog 1 folder were not, themselves, completely out of line with what I had been getting in previous Wget and HTTrack downloads: 81MB in 2,588 files. The sidebar links had been downloaded, but the Archives links still had not been; clicking on those still took me back to the blog online. In other words, the results did not seem to have improved much, but now I had thousands of pages of messages about the process.
It seemed that those messages probably had some kind of meaning. Possibly my effort would even be improved if I could understand them. I decided to look first at the most frequent messages, so as to eliminate those from consideration, if possible, and focus on whatever else the log might be telling me. As I viewed the log in Notepad, I saw that one very common type of message looked like this:
F:/Some Folder/Blogs Backup/Blog 1/raywoodcockslatest.wordpress.com/2013/07/31/dragon/index.html: merge(“https://raywoodcockslatest.wordpress.com/2013/07/31/dragon/“, “https://raywoodcockslatest.wordpress.com/2013/07/31/dragon/“) -> https://raywoodcockslatest.wordpress.com/2013/07/31/dragon/
appending “https://raywoodcockslatest.wordpress.com/2013/07/31/dragon/” to urlpos.
That message, or set of messages, appeared on just two lines of wgeterr.log (when viewed with Word Wrap turned off). The message seemed to be telling me that Wget had merged two identical URLs (and, presumably, their included and attached files), each ending in “dragon,” into a single URL and, ultimately, into a single index.html file in the dragon folder, and that the URL of that folder was being appended to urlpos. A search led to the tentative conclusion that urlpos was a programming concept with more complexity than I was inclined to explore.
I thought that perhaps I could find a tool that would delete certain lines from wgeterr.log, so as to whittle it down to manageable size, leaving its more pithy insights exposed to view. A search took me to advice suggesting a command like this:
findstr /v /c:”to urlpos.” wgeterr.log > wgeterr-cleaned.log
This command would perhaps liberate me from news about urlpos — that is, would produce a copy of wgeterr.log, named wgeterr-cleaned.log, that would not contain lines containing (most likely, ending with) the characters “to urlpos.” According to various DOS batch reference sources, the /c flag specified the text that findstr would work with, and the /v flag would print only lines that did not contain that text.
I ran that command. It was almost instant. It looked like some unwanted lines of wgeterr.log were removed from the resulting wgeterr-cleaned.log file. But we still had a ways to go. It seemed I might want to include several such commands in a batch file, each working with the output of the previous command, so I changed the preceding line and added to it as follows:
findstr /v /c:”to urlpos.” wgeterr.log > wgeterr-clean-01.log
findstr /v /c:”.html: merge” wgeterr-clean-01.log > wgeterr-clean-02.log
With those two lines, I had reduced the file size from 171MB (in wgeterr.log) to 68MB (in wgeterr-clean-02.log). Still massive, but shrinking. Now, in wgeterr-clean-02.log, I saw many references to “the black list,” followed by indications that Wget had decided not to load specific files. A blacklist did not sound good, and that surprised me in some of these cases. For instance, wgeterr.log had a blacklist remark in connection with some URLs for valid posts in my blog. A search led to little insight, but one source suggested that the “–no-cookies” option (no alternate; p. 12) might help. At this point it occurred to me, though, that maybe a better approach would be to remove the debug option, at least temporarily, so as to focus on a relatively small number of errors. To that end, I tried this version of the command:
wget -rNmpEkK -l 5 –random-wait -nv -P “F:/Some Folder/Blogs Backup/Blog 1/” -e robots=off –limit-rate=20k –no-cookies https://raywoodcockslatest.wordpress.com 2>wgeterr.log
This version left off -d and added –no-cookies. The resulting download took about an hour and gave me 81MB in 2,588 files — identical to what I had gotten last time. (Note that this file count came from Windows Explorer; wgeterr.log reported only 1,547 files backed up.) The resulting wgeterr.log was, of course, much shorter, without all that debug information. I was able to copy and paste its contents into Excel, unlike the previous round, and thus did not need batch commands to manage its size and parse its contents. Wgeterr.log had many lines with the “->” key combination, as in this example:
2014-02-14 22:10:49 URL:https://raywoodcockslatest.wordpress.com/2014/02/14/configuring-wget/  -> “F:/Some Folder/Blogs Backup/Blog 1/raywoodcockslatest.wordpress.com/2014/02/14/configuring-wget/index.html” 
That line, provided without surrounding commentary, appeared to indicate that this very post, now in revision, had been backed up, in a subfolder in the Blog 1 folder, as index.html. I put a Sort Order column as column A in the Excel spreadsheet, containing ascending numbers (Alt-E,I,S), so as to remember the original order of the lines, in case I did any sorting on other columns. I created a Status column as column C, using this formula:
to mark those rows that appeared to tentatively denote a desired outcome. I used COUNTA to determine that 1,547 files were accounted for by the “->” characters. I converted those formulas to fixed values with Edit > Copy and then Edit > Paste Special > Values, and used a filter to display only the Successful lines or, alternately, to hid them from view.
Displaying those Successful Download rows (and adding more FIND columns) revealed that 1,186 of these 1,547 files consisted of tag files. For example, one or more of my blog posts had apparently used the word “httrack” as a tag; there was apparently a tag file for that httrack tag, somewhere in my WordPress blog; and now Wget had downloaded that file as F:/Some Folder/Blogs Backup/Blog 1/raywoodcockslatest.wordpress.com/tag/httrack/index.html. I created a column that used FIND and MID formulas, in that Excel spreadsheet, to identify the tag words I had apparently used. A filter for unique values demonstrated that I had used a total of 593 unique tags. A sort of the Excel speadsheet indicated that each tag page, on WordPress, was accompanied by a matching tag feed page, hence the 593 tags had produced a total of 1,186 pages.
Using an Excel filter, I viewed the 361 (i.e., 1,547 – 1,186) Successful Download rows that did not pertain to tag files. It appeared that my copy-and-paste approach had resulted in truncation of some long lines in wgeterr.log. Possibly I should have used File > Open to import the log file as text. I also saw that Wget had downloaded separate pages for comments, replies, month folders (i.e., the folder that would open upon clicking its link in the Archives box), and other miscellaneous items.
That gave me a rough sense of what had been downloaded. I could now see how my 87 blog posts might translate into more than a thousand downloaded pages. I turned my attention to the many lines of the Excel spreadsheet — that is, the many messages in wgeterr.log — that did not pertain to successful file downloads. Among these, many contained the message, “Last-modified header missing — time-stamps turned off.” In many if not all cases, these preceded the “Successful Download” lines (above). A search led to the conclusion that many web servers (in this case, WordPress.com) would fail to share the Last-Modified header accompanying a file, thus preventing Wget from determining whether the file on the server was more recent than the copy (in this case, nonexistent) on my computer. In other words, it appeared that the Wget timestamp option (-N) was superfluous in the case of a WordPress.com backup, and that every Wget download would be a full download of all files in the blog.
After filtering out those “Last-modified header missing” lines, I saw that other messages were repeated on many rows. These frequent messages included the following:
- https://raywoodcockslatest.wordpress.com/2013/07/31/dragon/‘%20+%20liker.profile_URL%20+%20’: [Example: specific URL differs in each case.] [Along with that example referring to “liker.profile,” there would also be one nearby for “liker.avater” and another for “wrapper.data.”]
- 2014-02-14 22:12:58 ERROR 404: Not Found. [Example: the time stamp differed with each such 404 error.]
- ERROR: cannot verify raywoodcockslatest.wordpress.com’s certificate, issued by [identifying information]. [Similarly for “twitter.com’s certificate” and “www.facebook.com‘s certificate.”]
- Unable to locally verify the issuer’s authority.
- To connect to raywoodcockslatest.wordpress.com insecurely, use `–no-check-certificate’. [Similarly for connecting insecurely to twitter.com and www.facebook.com.]
- Unable to establish SSL connection.
- Cannot back up F:/Some Folder/Blogs Backup/Blog 1/raywoodcockslatest.wordpress.com/2012/06/29/index.html as F:/Some Folder/Blogs Backup/Blog 1/raywoodcockslatest.wordpress.com/2012/06/29/index.html.orig: File exists. [Example: file name differed in each case.]
It appeared, in other words, that an attempt to account for these frequent messages might take care of most of the contents of wgeterr.log, at least when the -d option was not used to fill wgeterr.log with thousands of pages of information. Moreover, as I began to explore these frequent messages, I found that several tended to occur in proximity to one another:
- Message types 1 and 2 from the foregoing list often occurred together. It appeared, specifically, that wgeterr.log would contain a liker.profile message, a Not Found error, a liker.avater message, another Not Found error, a wrapper.data message, and a third Not Found error, in that order. I was not sure what triggered these messages, but they appeared to be related to the act of clicking “Like” or adding a comment to a certain blog post. And yet spot checks indicated that Wget was capturing comments along with blog posts onto my local hard drive. As far as I could tell, these particular messages were not important for my purposes.
- Message types 3 through 6 often occurred together. A search led to some indications that I could try using the “–no-check-certificate” option. The manual (p. 16) provided no shorthand alternative for that option. It said that Wget would check the server’s certificate against “the recognized certificate authorities, breaking the SSL handshake and aborting the download if the verification fails.” Thus, it said, I should use the –no-check-certificate option only if I was certain of, or indifferent to, the site’s authenticity.” I did not really know or care why Wget would have been unable to verify certificates for Twitter or Facebook. They weren’t part of my blog, and I had no interest in backing up their connection. I guessed they were mentioned because I had set up (but had since removed) automatic sharing of WordPress blog post information with those two social media sites. But I didn’t need any indications of that in my blog download. It seemed odd, however, that Wget was saying that it also could not verify the certificate for this very blog, raywoodcockslatest.wordpress.com. One source said this could happen when trying to download from https sites, but that was not the case here. A search did lead to at least one person who seemed to be having similar problems with WordPress. I tentatively decided that I could afford not to care about this, and could ignore the –no-check-certificate option and its attendant security risks, as long as a careful check of the final results demonstrated that Wget was capturing all of my blog posts. I would assume, that is, that the links producing this set of error messages (3 through 6, above) were not directly related to blog posts, which were my primary focus here.
Message type 7 in the foregoing list did not necessarily appear in conjunction with other messages. It seemed, rather, to be a simple indication that Wget would not overwrite an existing file in my local Blog 1 folder. These type 7 messages all appeared at the very end of wgeterr.log. It seemed that they all reported failure of an attempt to back up a .html file as a .html.orig file. As such, the messages appeared to be reporting on the results of the -K option (above). The purpose of that option was to create .orig files that would be used, with timestamping, to see whether a file had been modified and would thus require a new download. Since I had already learned that timestamping was not working with this WordPress blog, it seemed I could discard the -K option. That would not explain why Wget seemed to be attempting a duplicative creation of .orig files; it would just avoid the question, at least when dealing with WordPress blogs. My backup of these 87 posts had generated nearly a thousand .orig files; surely it would save a lot of time and disk space to stop creating them, especially if they weren’t accomplishing anything. It occurred to me that an alternate possibility was that WordPress would support timestamps in some circumstances. That could explain why I was sure I had witnessed rapid HTTrack update downloads of WordPress blogs. Possibly some aspect of my Wget command was causing WordPress to turn off timestamps, or to function as if there were none?
These remarks accounted for virtually all of wgeterr.log. That is, after investigating the lines of that file that pertained to successful downloads, timestamps, and messsage types 1-7 (above), only a small number of messages remained to be examined. About a dozen of these involved files that were reported as “ERROR 404: Not Found.” Spot checks suggested that the blog posts named in several of those messages were in fact being downloaded. I was not sure what these messages were saying, but if files were missing, hopefully I would catch that in my final review. Aside from these error messages, there were only two others, and they appeared not to be important.
Refining the Wget Command
That concluded my review of the short version of the error log. Rather than revisit the long version (containing the debug information), I decided to verify that all of my blog posts had been backed up. I would not want to lose users’ comments and other information, but my core concern — the one that I was willing to invest time in, sufficient to do a manual check — involved the text of the blog posts themselves. To check this, I wrote a command that would produce a list of the files in my download:
dir /s /b /a-d > dirlist.txt
If I ran that command in the root folder where Wget had stored its download (i.e., in F:\Some Folder\Blogs Backup\Blog 1), it would give me a list of file names, complete with their paths. I could copy that list from dirlist.txt to Excel, massage it, and use it to generate batch file commands that would put renamed copies of those downloaded files in another folder, where I could count and inspect them. I would want to use renamed copies because it appeared that all of the blog post downloads created by Wget were named index.html. They could exist because Wget created a separate folder for each such downloaded file, but no folder could have more than one file with a given name (e.g., index.html): those copies would mostly wipe each other out if I just copied all of those index.html files to a single folder. The rename could consist of a massaged version of the path plus filename (e.g., index.html becomes 2013-06-09-yumi-unlisted.html). If all went well, the target folder would contain 87 different html files, and hopefully each would consist of a perfect offline-readable copy of one of my blog posts.
I decided not to follow that procedure yet. In the foregoing discussion, I had tentatively concluded that the -N and -K flags were not working properly, at least not with my WordPress blog and the other elements of my current Wget command. Hence, there was no point in spending time and space to create and store the .orig files needed for timestamping. So I emptied out the download folder, F:\Some Folder\Blogs Backup\Blog 1, and prepared to start over with a new download that would not use the -NK flags. I had thought it would be better to use timestamping to keep one up-to-date offline copy of my blog, but there was another possibility. Maybe it would be better to keep several distinct full downloads; maybe the download process would be much faster if .orig files were not part of the picture. I could keep the most recent download in readable form while zipping and archiving previous downloads to save several generations of the blog. In this light, the state-of-the-art Wget command for my purposes would be
wget -rmpEk -l 5 –random-wait -nv -P “F:/Some Folder/Blogs Backup/Blog 1/” -e robots=off –limit-rate=20k –no-cookies2>”D:\Selected Folder\wgeterr.log”
As just noted, this command varied from the previous one by removing the -N and -K options and by using the more elaborate (but not yet tested) 2>”D:\Selected Folder\wgeterr.log” piece that I had neglected to use in the previous go-round. This command also reflected my decision not to incur the risks of the “–no-check-certificate” option that would have eliminated a few download errors.
Then it occurred to me that there were two questions to which I did not yet know the answer, but that I would like to answer before proceeding. I wrote up those questions and posted them, seeking illumination. Those questions, and the places where I posted them, were (1) could we address the problem where Wget was failing to capture my Archives links in its canned download and (2) was there perhaps a way to train Wget to download only the blog posts, without all those tag files and other detritus? Because if I could combine the answers to those questions, I might wind up with an offline copy that would offer full working links (including Archives links) among my posts, without all kinds of other stuff that I didn’t need. My readable offline copy could be much more useful and much faster to create and easier to store. I didn’t get much action in response to those questions, but when I reposted them in 1 2 posts in StackExchange, one response suggested that maybe I could just tell Wget to exclude the tag folder. By mousing over the tags that appeared (in my WordPress theme) at the bottom of each post, I could see that the tag files were all saved as subfolders of https://raywoodcockslatest.wordpress.com/tag/. The relevant flag would be -X (alternately, “–exclude-directories=,” p. 23) followed by the folders to exclude, presumably with commas but no spaces between them (if there was more than one), using quotes if there were spaces inside the path name. So here the -X flag (and the specified tag folder) would be added to the command shown above as follows:
wget -rmpEk -l 5 –random-wait -nv -P “F:/Some Folder/Blogs Backup/Blog 1/” -e robots=off –limit-rate=20k –no-cookies -X https://raywoodcockslatest.wordpress.com/tag/ https://raywoodcockslatest.wordpress.com 2>”D:\Selected Folder\wgeterr.log”
I tried running that. I got an error: “The parameter is incorrect.” I tried it without the new -X addition. Again, parameter incorrect. So that wasn’t the problem. Oh, do you suppose that could be because there was no “Selected Folder” directory on my drive D? Or, for that matter, because drive D was not currently accessible? I corrected it, specifying a folder that actually existed, and tried again:
wget -rmpEk -l 5 –random-wait -nv -P “F:/Some Folder/Blogs Backup/Blog 1/” -e robots=off –limit-rate=20k –no-cookies -X https://raywoodcockslatest.wordpress.com/tag/ https://raywoodcockslatest.wordpress.com 2>”I:\Current\wgeterr.log”
Yeah. That ran. The resulting download consisted of 41MB in 1,642 files, amounting to about half the size, in about 60% as many files as I’d gotten before adding the -X flag. While I was on the file-reduction warpath, I noticed that the downloaded subfolder for each blog post contained a “feed” subfolder with its own index.html file (e.g., F:\Some Folder\Blogs Backup\Blog 1\raywoodcockslatest.wordpress.com\20126\29\welcome\feed\index.html). I wondered if it was possible to add “*/feed/” to the -X list. A search led to an indication that Wget was not quite that flexible — that, instead, it would be necessary to exclude “feed” at each of several layers.
But before getting to that, I realized that I wanted to reduce the number of folder levels in my download. That is, I didn’t need a folder structure that consisted of folders like “F:\Some Folder\Blogs Backup\Blog 1\raywoodcockslatest.wordpress.com\20126\29\welcome.” I already knew that the Blog 1 folder was going to contain the raywoodcockslatest.wordpress.com download, so I could eliminate that longwinded extra folder layer, leaving me with “F:\Some Folder\Blogs Backup\Blog 1\20126\29\welcome.” It sounded like the “-nH” (alternately, “–no-host-directories,” p. 11) option might take care of that.
So now, returning to the issue of unwanted “feed” folders, it seemed that the recommended approach was to start the calculation at the host level (i.e., at the level of https://raywoodcockslatest.wordpress.com/). Looking at my download, it appeared that the “feed” folders occurred at http://%5Bhost%5D/%5Byear%5D/%5Bmonth%5D/%5Bday%5D/%5Bpost name]/feed (e.g., https://raywoodcockslatest.wordpress.com/2012/06/29/welcome-to-my-blog/feed). That is, to stop Wget from downloading the feed folders, I would need to use something like this:
-X /[year]/[month]/[day]/[post name]/feed
X option to apply to all years, months, days, and posts, I would use wildcards four layers deep. Now I had to X statement with the one already used in the previous command. I found, through trial and error, that I didn’t need single or double quotation marks for this; I just had to separate them by a comma. So I emptied out the download folder and tried this:
wget -rmpEk -nH -l 5 –random-wait -nv -P “F:/Some Folder/Blogs Backup/Blog 1/” -e robots=off –limit-rate=20k –no-cookies -X /tag,/*/*/*/*/feed https://raywoodcockslatest.wordpress.com 2>”I:\Current\wgeterr.log”
That worked. This time, Wget created download folders like F:\Some Folder\Blogs Backup\Blog 1\2012 and 2013, each containing the posts from those years, without the excess “raywoodsockslatest.wordpress.com” verbiage in the path. And most of the feed folders were gone. A search revealed that there were still a half-dozen left, because they had occurred at higher levels: I would need to add more -X items at different levels (e.g., “/*/feed”) if I wanted to get rid of those. This time, without the feed folders, my download consisted of only 360 files (23MB). Judging from the starting and ending times in wgeterr.log, it took only 27 minutes.
To inspect the download, I double-clicked on F:\Some Folder\Blogs Backup\Blog 1\index.html. It appeared that the links among posts in this blog had been successfully canned — that, in other words, those links referred, as intended, to other files in my download, and not to webpages at WordPress.com. It occurred to me that, at the risk of having a sidebar that would run far beyond the end of a post’s text, I could adjust my Recent Posts widget (in WordPress > Appearance > Widgets) so that it would show a large number of posts; then I would be able to click directly from any post to any other. Another possibility was to create an archives page. I did that. This raised the question of whether I still needed to restrict the number of links that Wget would check (“-l 5”). I decided to try again with “-l inf.” So this time, the command looked like this:
wget -rmpEk -nH -l inf –random-wait -nv -P “F:/Some Folder/Blogs Backup/Blog 1/” -e robots=off –limit-rate=20k –no-cookies -X /tag,/feed,/*/feed,/*/*/*/*/feed https://raywoodcockslatest.wordpress.com 2>”I:\Current\wgeterr.log”
As expected, the download contained hardly any feed folders. As I had hoped, its size was substantially unchanged: 364 files, 23MB, 28 minutes. level index.html file, I saw that the Archives page links had been incorporated as a working part of the download, providing a convenient two-step route from any post to any other.
There were still a few more items I wanted to experiment with. For one thing, I was still not sure if I should use the robots=off option. Also, I had just learned that downloading from some websites would be helped (and apparently no harm would be done in other cases) by including the “–ignore-length” (p. 14) option. In addition, the download still contained a few stray folders that I wanted to get rid of.
Finally, I now realized, very belatedly, that I had inadvertently continued to include the -m option in these many permutations on the Wget command line. Now it was time to remove it and see what difference it made, given that I already incorporated some of its components and decided against others. Thus, I ran this Wget command:
wget -rpEk -nH -l inf –random-wait -nv -P “F:/Some Folder/Blogs Backup/Blog 1/” –ignore-length –limit-rate=20k –no-cookies -X /tag,/feed,/*/feed,/*/*/*/*/feed,/author,/category,/i,/wp-includes https://raywoodcockslatest.wordpress.com 2>”I:\Current\wgeterr.log”
To my relief, these changes seemed to achieve the desired effects without unwanted surprises. The download now consisted of 22MB in 355 files, and was completed in 27 minutes. When I came back to it and tested its error log with the wgetlogcleaner.bat file described in the next section, I found that it, too, contained no surprises.
Automating Scrutiny of the Error Log
I returned to the wgeterr.log file resulting from the foregoing command. Using the same kinds of batch commands as before, I sought to slim down the log file, so as to highlight the relatively small number of unexpected remarks. Using Notepad (not something like Microsoft Word, which would insert invisible codes that would frustrate the desired results), I put the following commands into a file called wgetlogcleaner.bat, saved in I:\Current (where wgeterr.log was being created):
findstr /v /c:” -> ” wgeterr.log > wgeterr-clean-01.log
findstr /v /c:”Last-modified header missing — time-stamps turned off.” wgeterr-clean-01.log > wgeterr-clean-02.log
findstr /v /c:”ERROR: cannot verify twitter.com’s certificate” wgeterr-clean-02.log > wgeterr-clean-03.log
findstr /v /c:”ERROR: cannot verify http://www.facebook.com’s certificate” wgeterr-clean-03.log > wgeterr-clean-04.log
findstr /v /c:”To connect to twitter.com insecurely, use `–no-check-certificate’.” wgeterr-clean-04.log > wgeterr-clean-05.log
findstr /v /c:”To connect to http://www.facebook.com insecurely, use `–no-check-certificate’.” wgeterr-clean-05.log > wgeterr-clean-06.log
findstr /v /c:”Unable to locally verify the issuer’s authority.” wgeterr-clean-06.log > wgeterr-clean-07.log
findstr /v /c:”Unable to establish SSL connection.” wgeterr-clean-07.log > wgeterr-clean-08.log
findstr /v /c:”ERROR 404: Not Found.” wgeterr-clean-08.log > wgeterr-clean-09.log
findstr /v /c:”liker.profile_URL” wgeterr-clean-09.log > wgeterr-clean-10.log
findstr /v /c:”liker.avatar_URL” wgeterr-clean-10.log > wgeterr-clean-11.log
findstr /v /c:”wrapper.data” wgeterr-clean-11.log > wgeterr-clean-12.log
copy wgeterr-clean-12.log wgeterr-cleaned.log
start Notepad.exe wgeterr-cleaned.log
Of course, I could have modified or added to those lines if needed. But for my purposes, running wgetlogcleaner.bat sufficed to reduce the log to a small number of items that I could inspect quickly, to see if there were any surprises. Having used the foregoing commands to deleted error messages, I assumed that lines remaining in the final wgeterr-cleaned.log file were indications of a problem; I could do a search in the original wgeterr.log to clarify what errors or other information might have accompanied them.
Using Other Files to Streamline Wget Commands
By this point, my Wget command was getting long. That had two potential drawbacks. First, it was possible for a command to become so long that it would exceed Windows limits. There were evidently ways to get around that. Even so, I didn’t like having an overly long and complicated command. As I had just seen, it was possible for things to get lost or overlooked in there. The time had arrived to organize the Wget command by delegating some of its duties to other files.
Potentially the most important external file was wgetrc. A search revealed that a default version of this file had been created in a subfolder where Wget was installed. On my machine, wgetrc was in C:\Program Files (x86)\GnuWin32\etc. Wgetrc was Wget’s startup or initialization file (p. 31). Commands in wgetrc would apply to all Wget sessions. It was also apparently possible to override them by specifying something different when invoking Wget.
As noted just before the preceding section of this post, my state-of-the-art Wget command now looked like this:
wget -rpEk -nH -l inf –random-wait -nv -P “F:/Some Folder/Blogs Backup/Blog 1/” –ignore-length –limit-rate=20k –no-cookies -X /tag,/feed,/*/feed,/*/*/*/*/feed,/i,/wp-includes,/author,/category https://raywoodcockslatest.wordpress.com 2>”I:\Current\wgeterr.log”
I proceeded to convert the options in to wgetrc parameters, to the extent possible. The wgetrc file that I created looked like this:
So it looked like I was going to be able to offload a lot of material from the Wget command line to the wgetrc parameters file. Another potential source of command-line bulk: the URL, or list of URLs, that I wanted Wget to download. In the command that I had been using up to this point, the URL was stated in full: https://raywoodcockslatest.wordpress.com. The command line would be shorter if I used the “-i” (alternately, “–input-file=,” p. 4) option. This option would allow me to designate a separate file to list one or more URLs that I wanted Wget to download. In other words, I would use “-i URLlist.txt” as a Wget option, where URLlist.txt would contain a plain-text list of the URLs (e.g., https://raywoodcockslatest.wordpress.com) that I wanted Wget to download. Then I wouldn’t have to list those URLs in the Wget command (though apparently I still could use both the command line and URLlist.txt to designate websites for downloading, if I had some reason for doing so).
So, to repeat, this was the state of the art in my Wget command line, as last used (i.e., just before the preceding section of this post:
wget -rpEk -nH -l inf –random-wait -nv -P “F:/Some Folder/Blogs Backup/Blog 1/” –ignore-length –limit-rate=20k –no-cookies -X /tag,/feed,/*/feed,/*/*/*/*/feed,/i,/wp-includes,/author,/category https://raywoodcockslatest.wordpress.com 2>”I:\Current\wgeterr.log”
And here was the new and improved version of the Wget command line, slimmed down in recognition of the contents of wgetrc:
wget -P “F:/Some Folder/Blogs Backup/Blog 1/” -X /tag,/feed,/*/feed,/*/*/*/*/feed,/i,/wp-includes,/author,/category https://raywoodcockslatest.wordpress.com 2>”I:\Current\wgeterr.log”
Well, it was a little shorter, at least. It would have been a lot shorter if I had made maximum use of those auxiliary files (i.e., wgetrc and URLlist.txt). But at least for the time being, I wanted to specify values for -P and -X and the URL list (i.e., https://raywoodcockslatest.wordpress.com) manually, as required in the particular case.
I decided to try using Wget to download a readable offline copy of my WordPress blog. It took a lot of time to work through and understand its options and limits for this purpose. I ended up with a Wget command and an accompanying wgetrc file that, taken together, gave me a relatively light and fast download. Now I wanted to apply what I had learned in this situation to a more complex set of needs.