In a previous post, I investigated tools that would give me a readable offline copy of a WordPress blog. This would not be a backup intended to preserve the blog, in case I needed to restore it to WordPress or elsewhere after a disaster. As described in that previous post, WordPress offered its own XML download tool that appeared adequate for that purpose. (A subsequent post provides a refined version of the information provided here.)
The purpose of the readable offline copy was to give me a set of files that I could read on my computer — files that, viewed in an Internet browser like Firefox or Internet Explorer, would look like they looked on my blog. These files might not be easily restored in the event of disaster, but they would be useful if, say, a hardware problem or the lack of an Internet connection left me unable to go online and view the blog itself. There was also a possibility that the readability of this copy would prove useful in an attempt to restore a backup: it might help me reconstruct how the blog was supposed to look and function.
That previous post also contained brief looks at various programs designed to produce readable offline copies. After some efforts to use HTTrack for this purpose, I decided to learn how to use Wget. An extended discussion of Wget command syntax did yield a working Wget command and an accompanying wgetrc parameter file. This setup gave me a readable offline download that was considerably faster to download, and more compact to store, than that which I had originally obtained from HTTrack or Wget.
My problem was that I had a number of blogs, and so far I had worked out the solution to just one. Most of these blogs were on WordPress, so it seemed I could probably repeat the same approach with each blog — using, say, a total of eight different Wget commands to make readable offline copies of eight different WordPress blogs, one command per blog. But I wondered whether it would also be possible to use just one Wget command to create a single download combining all of those blogs.
This objective might not make sense in some situations. In my situation, however, there seemed to be some potential advantages to an integrated readable offline copy. My blogs tended to be interrelated with one another. Often I would link from a post in one of my blogs to a post in a different one of my blogs. The user experience would be more seamless if those links worked directly — if clicking on one such link, in the offline copy, took the user directly to the offline copy of the target post in that other blog. Also, my blogs tended to offer a list of my other blogs, in a sidebar. It might be easier to test that the whole download was working if I could click on any one of those links to “My Other Blogs” and verify that doing so would take me to the offline copy of that other blog, not to the URL for the copy at WordPress.com.
(Note: this writeup follows shortly after the previous posts linked above. As such, this post may assume familiarity with some concepts developed in those previous posts. It has unfortunately not been possible to give this post a space of several days, so as to clear my head of the prior investigations and start over at a beginning level.)
Developing the Wget Command
In this bid to capture multiple blogs with one command, I decided to start with the command I had already developed to download a readable offline copy of one WordPress blog. That command, appearing near the end of a previous post, was as follows:
wget -P “F:/Some Folder/Blogs Backup/Blog 1/” -X /tag,/feed,/*/feed,/*/*/*/*/feed,/i,/wp-includes,/author,/category https://raywoodcockslatest.wordpress.com 2>”I:\Current\wgeterr.log”
That command would have been longer, but for the concomitant development of a wgetrc parameter file specifying numerous default options. The construction and contents of that wgetrc file are likewise laid out at length near the end of that preceding post. The basic idea was that wgetrc could spell out numerous parameters at length, whereas the command line might not be able to accommodate all that information.
That Wget command could have been shortened even more, with the aid of additional parameters in wgetrc, as noted in several comments within the wgetrc file. In particular, I could have replaced those multiple items after the -X tag (e.g., “/tag,/feed”) with an exclude_directories list in wgetrc. There were also ways of naming multiple input URLs, in place of the foregoing command’s reference to just one blog (i.e., https://raywoodcockslatest.wordpress.com). (Note that, at this writing, WordPress.com could not be prevented from converting all sorts of things into links, including the ending words of the preceding sentence.)
The involvement of multiple blogs would also require me to change the specified target folder for the download. In addition, it belatedly seemed that it might have been helpful to include the no_parent option in wgetrc.
With these thoughts in mind, I revised the Wget command and wgetrc, so as to test them with a set of three distinct WordPress blogs. The revised Wget command looked like this:
wget -P “F:/Some Folder/Blogs Backup/Blog 1/” 2>”I:\Current\wgeterr.log”
It had thus been shortened, thanks to the following revisions and additions to wgetrc:
# Specify list of directories to exclude from download. # Command-line equivalent: -X [directory list] # or --exclude-directories=[directory list] exclude_directories = /tag,/feed,/*/feed,/*/*/*/*/feed,/i,/wp-includes,/author,/category # Specify list of URLs to include in download. # Command-line equivalent: -I [URL list] or --include [URL list] # URLs are separated by commas, no spaces. # Could also use “-i URLlist.txt" to point toward list of URLs in URLlist.txt. # Command-line equivalent for that: --input-file=URLlist.txt include_directories = https://raywoodcockslatest.wordpress.com,http://leisurepurpose.wordpress.com/,http://raywoodcockbio.wordpress.com/ # Prevent retrieval of material above the specified URL. # Command-line equivalent: -np or --no-parent. no_parent = on
Clearly, it would have been extremely cumbersome to include all those included and excluded directories, and all of the other parameters set in wgetrc, within a single Wget command. The question now was whether this revised Wget command, aided by wgetrc, would produce an integrated and organized download. I ran the command to find out. I immediately got an error message:
The filename, directory name, or volume label syntax is incorrect.
Careful scrutiny revealed that, somehow, the quotation marks in the Wget command had been changed to smart quotes — that is, to a pretty kind of quotation mark that was not recognized by various command processors. I was not sure when or how that had changed. As far as I recalled, I had originally typed the command in Notepad or the WordPress editor. It now appeared that smart quotes also pervaded the previous post. Somehow, they had run successfully then, but were not doing so now. Well, whatever. I corrected the smart quotes back to ordinary quotation marks (on my command line, and also in the foregoing reproduction of the command) and tried again. Now I had another error:
wget: missing URL
Usage: wget [OPTION]… [URL]…
Was Wget not getting the include_directories information from wgetrc? I reviewed the latter. Now I saw that I had misunderstood the include_directories option. It was an option to include multiple directories, not URLs. I should perhaps have been using this option to include the few WordPress directories that interested me, rather than using the -X option to exclude the many that didn’t. Maybe I would do exactly that, once I learned more about the different kinds of files and folders that these various blogs were offering. But this was not the command for specifying source URLs. I revised wgetrc again, changing the middle part of the text shown above, as follows:
# Specify list of directories to include in download. # Command-line equivalent: -I or --include, followed by comma- # delimited list of directories. # Not presently used. # include_directories = [comma-delimited list of directories] # Specify list of URLs (separated by commas; no spaces) # to include in download. # Could also use to point toward list of URLs in URLlist.txt. # Command-line equivalent: -i URLlist.txt or --input-file=URLlist.txt input = "I:/Current/URLlist.txt"
In this case, I could have used the command line or wgetrc with equal effect; both pointed toward the external URLlist.txt file. I wasn’t sure whether the three source URLs had to be separated by commas in URLlist.txt. I tried putting each on a separate line. I also wasn’t sure whether my use of quotation marks and subfolders for the input parameter would work. With wgetrc thus revised, I ran the Wget command again. This gave me an error in wgeterr.log:
“I:/Current/URLlist.txt”: Invalid argument
No URLs found in “I:/Current/URLlist.txt”.
I verified that there definitely was a file called URLlist.txt, located in I:\Current, and containing URLs. Focusing on the first line of the error, I wondered if perhaps Wget suddenly wanted to see backslashes here. It seemed like everything in this Linux-based program had preferred forward slashes, but I tried changing that part of wgetrc. That didn’t help. Maybe the quotation marks (plain, not fancy) were the problem? I tried removing them. (I didn’t need them in this case, as there were no spaces in the name or path of URLlist.txt, but I hoped their use would be available in other settings.) That ran. The download began. So the backslashes were OK — not sure if forward slashes would have worked too — but the quotation marks were apparently not permitted.
At this point, I saw that the download was going into the wrong folder. That reminded me that I had also wanted to change something else about wgetrc, so I stopped the download and made this further revision to wgetrc:
# Specify where download will be stored. # Command-line equivalent: -P or --directory-prefix= followed by # specified path (e.g., -P "F:/Some Folder/") dir_prefix = "F:/Some Folder/Blogs Backup/"
With that change, the Wget command became shorter:
But when I ran that, I got error messages in wgeterr.log. There were three sets of them, and they all said the same thing:
“F:/Some Folder/Blogs Backup/”: Invalid argument”F:/Some Folder/Blogs Backup/”/index.html: Invalid argument
Cannot write to `”F:/Some Folder/Blogs Backup/”/index.html’ (Invalid argument).
This seemed to be saying that Wget could not create the index.html file (or anything else) for any of the three URLs provided, because I was mistakenly telling Wget to store the download in F:/Some Folder/Blogs Backup/. My mistake apparently had something to do with slashes and quotation marks. I was not sure why this would suddenly be a problem, when I had been using nearly the same folder in numerous prior Wget commands, complete with slashes and quotation marks. I guessed that the problem was due to some bug or foible in wgetrc, where it would not accept something that had worked just fine with -P on the command line. Testing that hypothesis, I commented out the relevant line in wgetrc (that is, I put hash marks (#) in front of the line regarding dir_prefix) and put the -P option back on the command line:
wget -P “F:/Some Folder/Blogs Backup/” 2>”I:\Current\wgeterr.log”
That ran. And this time, I let it run.
Results of the Wget Command
When that command ran, it created subfolders for year, month, and day of month. That is, if I had a blog post dated March 12, 2013, it would create a 2013 folder, and a 03 folder under that, and a 12 folder under that, and then it would put an index.html file and perhaps other stuff somewhere under that. What was different now, as compared to the situation in the previous post, was that I had specified one target folder for all three downloaded URLs. So evidently my March 12, 2013 subfolder could have materials for a post from Blog No. 1 and also for a post from Blog No. 3, if I happened to post both of those items in those different blogs on that same day. I was not yet sure of this, nor could I be sure how it was going to work if that was the situation.
When it was done, I clicked on the top-level index.html file. I discovered, right then, that this approach was not going to work as hoped. It appeared that the top-level post or page in one blog had overwritten the top-level page from another blog. Basically, if I opened up lower-level index.html files, it all worked fine; but if I wanted to click on those top-level posts in the list of my blogs, I needed to be assigning these blogs to their own separate folder systems. Which probably meant that I needed to set “add_hostdir = on” in wgetrc.
Overall, the download comprised 459 files (29MB) and was done in 36 minutes. Application of a modified wgetlogcleaner.bat script (see previous post) to the wgeterr.log file left few error messages of interest. Spot checks at other downloaded webpages suggested that, except at the top level, the download had gone well: I did have offline copies of many blog posts. I emptied out the target folder, added a few more folder names to the exclude_directories option in wgetrc, and ran the command again. This time, thanks to that add_hostdir setting, I had a separate top-level folder for each blog (e.g., raywoodcockslatest.wordpress.com). The download ran for 36 minutes and produced 467 files (28MB). In addition to the desired three top-level folders — one for each of the three blogs — there were also undesired top-level folders for Digg.com and StumbleUpon.com. Information in the previous post suggested the following addition to wgetrc to eliminate those unwanted folders:
# Restrict list of permitted domains # Command-line equivalent: -D or --domains=[domain list] domains = wordpress.com
Within the three desired folders, I saw one subfolder for each year in which I had posted to the blog in question, along with one page for each WordPress page. For example, the top bar of my leisure blog listed only two pages: Home and About. Home existed by default — it was not a separate page to be backed up — so there was just one non-year folder in that blog’s download, titled “about.”
This all looked good. There was no top-level index.html file: the highest-level index.html file I could find was the one at the top of each of the three blog subfolders. I could fashion a top-level lead-in page manually, but that would apparently not be necessary. Once I started up one of the index.html files at the top of a blog subfolder, things functioned as hoped: links led to various posts within the blog, and also to the home pages of the other downloaded blogs, all showing a “file” location on the Address bar. That is, I was looking at their downloaded versions, not the versions online, and they looked good.
I ran the wgetlogcleaner.bat batch file against wgeterr.log and observed no worrisome messages. I emptied the download folder, re-ran the Wget command, and saw that the foregoing addition to wgetrc did not eliminate the undesired folders. I was drawing a blank as to why this would be. I decided to experiment by turning off the robots option discussed in the previous post. This did not fix the problem; those Digg and StumbleUpon folders were still there. I took a closer look at the contents of those two folders. The Digg folder consisted of 27 items, each named like this:
submit@url=http%3A%2F%2Fleisurepurpose.wordpress.com%2F2010%2F06%2F25%2Fabout-the-leisure-purpose-of-life-blog%2F&title=About This Blog.html
So, OK, there were 26 posts plus the About page, comprising the total contents of my leisure blog. At the bottoms of each of those pages, I had “Share” buttons for a variety of social media services: Facebook, Twitter, etc. — and, yes, Digg and StumbleUpon. Were those perhaps being interpreted in a way that would not only make them count as webpages, but would put them into new Digg or StumbleUpon folders, rather than as otherwise unnoticed contents of the three blog folders? Part of the answer was yes, perhaps: I had turned on the option to add .html endings to files that did not have them. Could I do without that? I didn’t think so. The blog posts were originally XML, not HTML; I believed they, too, needed that option. So maybe I just had to exclude Digg and StumbleUpon explicitly? In wgetrc, I turned off the -D option (above) and added these lines:
# Exclude specified domains # Command-line equivalent: --exclude-domains=[domain list] exclude_domains = digg.com,stumbleupon.com
And yet that didn’t even do it! I was out of ideas. For the time being, I would have to live with these extraneous folders, or delete them manually after doing a backup. They did not seem likely to have any effect on casual browsing.
I did not do an item-by-item check to verify that every last post from all of these blogs made it into the download. I was content, at least for now, that I had not hit any nonworking links. The links that I expected to be preserved in the download were preserved, in every case.
In a previous post, I developed some familiarity with Wget and with auxiliary files that it might draw upon, especially wgetrc, for purposes of creating a readable offline copy of a WordPress blog. In this post, I built upon that familiarity to create an integrated readable offline copy of several blogs. This copy was “integrated” in the sense that links between the several downloaded blogs took me seamlessly from one blog to another, each saved on my local drive in a form that was apparently identical to its online form. My next step would be to add Archives pages to each of my blogs, so that I could use those to find my way quickly through all of my posts on all blogs.