As you see, I had some blogs. I wanted to back them up. At other times, I had wanted to download a whole website and store it on my hard drive. There seemed to be numerous approaches to these sorts of tasks. This post reports on my investigation.
The XML Approach
It was possible to save a single webpage in various HTML and related formats, using File > Save and similar menu picks in a web browser (e.g., Firefox, Internet Explorer). But for a blog backup that would include hundreds or thousands of blog posts in a single operation, it seemed inevitable that I would have to use a program or tool that would download and save those files in their original XML format.
There seemed to be two ways to download XML files en masse. One was to use the export and import features built into the weblog website. At this writing, my blogs were hosted by two different providers. For my Blogger (a/k/a Blogspot) blog, the advice was to start with the public face of my blog — viewing it as everyone else would do — and click on Design > Settings > Other > Export Blog > Download Blog. This would give me an .XML file on my hard drive, with a name like myblog-02-07-2014.xml. I would also download the accompanying template by clicking on Template (at the left side of the Settings > Other page) > Backup/Restore > Download full template. For my WordPress.com blogs, the recommended approach was to go, within the blog in question, to Dashboard > Tools > Export. (More recently, it was My Sites > WP Admin > Tools > Export.) For bloggers with multiple blogs, this would have to be repeated within each such blog. This process yielded files with names like myblog.wordpress.2014-02-07.xml.
There was another bulk download method. This method involved more steps than the simple approaches offered by blog hosts like Blogger and WordPress (above). It seemed advisable to learn and use this approach, if possible, in case some unforeseen failing in the hosts’ simple approaches left me with a backup that would not work at all, or would not work anywhere except on the original host. (I had known the experience of finding that a backup was corrupt or otherwise useless.) At least when used for WordPress backup, there were two parts to this alternate method: I had to back up my WordPress database, and I also had to back up my WordPress files, which were apparently distinct from the database. To back up the database, I would use a program called phpMyAdmin or perhaps a lightweight alternative. To back up the files, I would choose among the many programs offered as free or paid FTP clients. FTP (i.e., File Transfer Protocol) was a venerable means of transferring all sorts of files for a variety of purposes. A search suggested that, at this writing, FileZilla and WinSCP (or possibly CyberDuck or FireFTP or others) were the best or most popular free FTP clients. I was quickly able to find 1 2 3 different sites offering advice on this approach.
It appeared that some people were using rsync in similar efforts. According to Wikipedia and others, rsync could be very useful for mirroring websites. For example, David Madore offered a ready-to-use rsync command for purposes of mirroring his own website, and Melanie Pinola suggested a combination of wget (below) and rsync for purposes of converting a dynamic website, with usable comments and other features, into a flattened version that would still be available online but would no longer have those dynamic features.
For my purposes, it did not presently appear necessary to learn how to use FTP and/or rsync. I decided instead to employ the simple approach outlined at the start of this post. Once I did succeed in downloading the XML files from my blog host via that approach, there was a question of whether I could use them for anything, other than just to hold onto them in case I ever needed to restore them to recreate my blogs. When I double-clicked on one of those XML files, I saw that my system’s default XML reader was Notepad. Viewed in Notepad, the XML file was an ugly mass of code. I hoped it would be useful for restoring my blog in the event of disaster. I was not inclined to experiment with my hundreds of posts, so for this I was proceeding on faith that an XML backup was just what the doctor ordered.
I had the impression that XML was like HTML — that, viewed in the proper viewer, I should see, on my hard drive, a reproduction of my Blogger blog that would look more or less like what I would see if I went to my Blogger blog in Firefox or some other web browser. This proved to be a mistaken impression. I tried opening the downloaded XML backup using Internet Explorer (File > Open). It gave me the same mess of code that I had seen in Notepad, with this comment at the top:
This is an Atom formatted XML site feed. It is intended to be viewed in a Newsreader or syndicated to another site. Please visit Blogger Help for more info.
It wasn’t a site feed for syndication, so I didn’t go to Blogger Help. I did search, however, for an Atom newsreader. It appeared that RSSOwl had made a good impression (4.5 stars from 97 users; 16,108 downloads at Softpedia), so I downloaded the portable version for permanent addition to my customized, relocatable, shareable Start Menu. But then I saw that the portable version had not been updated, so I went with the installed, nonportable version after all. I ran RSSOwl > File > Import. It wasn’t designed for this. Wrong program. While I was there, though, I went into Feedly (my RSS reader) > Organize > Save as OPML, and I imported the resulting OPML file into RSSOwl. I liked the results in some ways, but there was something missing, so I posted a question on it.
Anyway, it seemed that Internet Explorer had misdirected me toward an RSS newsreader. I tried again, with another search for something that would read the XML file. A helpful About.com page offered tips and pointed me toward two options: either convert the file to another format or find a program that could view it as-is. The conversion options didn’t look great — at best, it seemed, they would convert the XML to plain text — so I considered suggested tools such as Notepad++, XML Notepad 2007, and XmlReader. It appeared, however, that these were treating XML as a programming language, and were thus positioning themselves as programmer’s tools. That is, they were not producing nice, pretty, end-user output like you’d get from an HTML viewer. At that rate, I might just as well view the XML by opening the file in Firefox or, for that matter, in Libre Office Writer or Microsoft Word, both of which seemed able to handle the 2500 pages of code and text comprising the contents of my Blogger blog. Incidentally, if I did open the XML file in one of these programs, and inadvertently introduced and saved a change, I might make it useless for purposes of restoring my blog to Blogger or elsewhere.
Simple Blog Backup Alternatives
There were alternatives to the approach of downloading (and preferably not fooling around with) XML exports containing the full contents of the blogs being backed up. Among those commonly mentioned, some were free; some were not. Also, some provided automatic or scheduled backups, while others didn’t.
One option: BlogBackupr. This website would make and store a free, quick, and simple backup of WordPress and Blogger blogs. It also appeared to be prepared to export those backups to restore a blog if needed. As a cloud solution, unlike the XML approach, it would not give the user a backup that s/he could hold and save in a secure local place. Also, unlike readable approaches (below), it would not save the blog in a form that could be read offline. I tried it and found that, for some of my blogs, it downloaded only some of the posts. There did not seem to be a way to get it to delete those attempts and try again.
Other relatively simple blog backup options, not free, included Mozy, VaultPress, Automattic, blogVault, BackupBuddy, Snapshot, and CodeGuard. Bloggers with paid subscriptions on WordPress.org (as distinct from WordPress.com) could also accomplish backup via plugins (some free, some not) like Google Drive for WordPress, WP-DBManager, BackUpWordPress, WordPress Database Backup, BackWPupFree, and WordPress Backup to Dropbox. As those titles suggest, it was possible to link some blog backup efforts to a general- or multi-purpose cloud backup solution like Google Drive, Dropbox, Amazon S3, or SugarSync.
Readable Offline Approaches
The foregoing approaches were of varying usefulness for purposes of keeping a mere backup, so as to enable restoration to a previous or new blog somewhere online. As a separate matter, instead of or in addition to those backup approaches, some users might want a different kind of backup — one that would allow them to view the contents of their blogs, or of other websites, without actually browsing online to those blogs. This sort of backup could be useful in situations where it would be faster or easier to view blog posts or other webpages stored on one’s hard drive, without having to download and browse among such materials online. Such an arrangement could be essential due to, say, problems with networking hardware, or an absence of safe Internet connections. For instance, I could have used this sort of thing on my ten-week camping trip in remote parts of Kansas and Texas.
The Static Image Approach
For offline viewing that would look much like the blog looked online, it seemed there were two possible routes. The first would be static or relatively unchanging. An example of a static approach would be a simple hard-copy printout (in B&W or color) of each blog post. To improve somewhat on that simple approach, one could print each blog post to PDF. The viewing experience would still tend to be quite different from viewing HTML webpages, but the PDF approach could produce clickable links — typically leading, by default, to webpages online, not to other PDF pages in the offline output.
There were a variety of ways to produce static PDF images of blog posts and other webpages. One could simply print them. This seemed to produce varying results, depending on how the webpages were set up. When I had done this, sometimes I had wound up with a set of decent-looking PDF pages; sometimes I had wound up with a mess. Results could vary according to whether I was trying to print selected portions as distinct from the entire page, and perhaps according to the PDF printer I was using. It would be possible to print a large number of blog posts at once, if the blog was set up to display multiple posts at a single time. But they might all appear as part of a single document, without page breaks for the start of each new post. There would also be the possibility of using a batch file to automate the printing of many distinct webpages. Another option was to use a PDF program, like Adobe Acrobat, that could capture an entire website and save it in PDF form.
Instead of going straight from webpage to PDF, sometimes I had achieved good results by taking screenshots of the webpage, and then massaging and printing those screenshots to produce PDFs. There was also the option of forgetting about PDFs, and just saving screenshots in an image format (e.g., JPG, PNG). There were different ways to do that. Sometimes there was literally no alternative but to take a series of screenshots (hitting PrintScreen or PrtSc, or using a screenshot utility), and then stitch the resulting screenshots together, using a panorama image editing tool like Microsoft’s Image Composite Editor. (Sometimes it was necessary to edit stray material out of the screenshots before attempting to stitch them together. For this purpose, I found IrfanView very effective. The basic procedure in IrfanView was to use Shift-drag to draw a cropping box, Ctrl-Y to crop, and then save and move on to the next screenshot.) Note that the multiple-screenshot approach would be improved, in some cases, by first rotating the monitor’s orientation 90 degrees (that is, put the top of the page at the right or left side of the screen, so that you would have to lie on your side to read the thing comfortably). This approach would produce a long, pagelike view, reducing the number of screenshots to capture some webpage layouts. Some graphic software control panels would allow Ctrl-Arrow (or other) key combinations to switch orientations easily. Failing that, it might be necessary to use Control Panel > Display > Change display settings > Orientation.
It was possible to employ a kind of automation, capturing a series of screenshots, by using a batch file. So, for instance, a person could open the desired pages in a web browser; start the batch file; let it take a few screenshots per second; close or change the screens being captured by using Ctrl-W (or, in other situations, by hitting PageDown); use a duplicate photo detector (e.g., VisiPics, Awesome Duplicate Photo Finder) to eliminate the duplicates; and then stitch together the resulting images with that panorama editor as desired.
In good times, I was able to get a single screenshot of an entire webpage at once, including parts that extended beyond the visible viewing area. For a while, I’d had good luck with the Screenshoter (sic) add-on for Firefox, but lately it had been nonworking. Fireshot (installed automatically from Firefox to Internet Explorer; also available in Chrome) looked more promising at the moment. These long screenshots could be saved as image files; alternately, they could be converted to PDF by selecting a custom paper size. I had the best luck when keeping the resulting page sizes no longer than 78 inches (44 inches, if I intended to OCR them).
Timethief suggested using an offline blog editor (e.g., Windows Live Writer) to perform backups. The idea here seemed to be just that the offline editor would save your work, at times when the online editor offered by blog hosts (e.g., WordPress) would sometimes hiccup and lose your work. This would not provide a ready way of backing up multiple files at once. Another technique she recommended was to subscribe to your own blog’s RSS feeds, though the items listed in that feed would just lead back to the webpage. Subscribing to receive email notices of new posts would require making sure that a subscription button was visible on the blog; but, again, unless I was missing something, this gave me only brief email introductions to my new blog posts; I still had to go back to the blog website to get the full text.
A way of getting the full text by email would be to create a label in Gmail and then use that label to burn a feed. The Gmail label part went as follows: at the left side, choose More (if necessary) > Create New Label > BlogBackup (or some other name). The Feedburner part followed advice to begin by verifying your Gmail address in Feedburner’s My Account page. Next, create an RSS feed for the blog: paste the address of the blog being backed up into the Burn a feed right this instant box > Next > choose the default (text of blog posts) or alternative (comments on blog posts) > click Next a few times to complete the process. Now go to Feed Management (i.e., My Feeds) > click on the newly added blog feed > Publicize tab > Email subscriptions > Activate. Now add the resulting code into a widget in the WordPress theme, so as to offer an email subscription option. The remaining advice, which I did not pursue, was to insert a customized email address, somehow, into that widget, using a specialized format. To illustrate, if my gmail address was firstname.lastname@example.org, then I would insert email@example.com, with the plus sign and all, where “blogbackup” was the Gmail label I had just created (above). I did not go through the complete process of trying to figure this out because it seemed complicated and unnecessary, prone to break if anything changed in the Feedburner setup, and duplicative of the email subscription widget I was already using in my blogs. Other than that, it seemed like a great idea.
Static approaches would tend to require manual updates. That is, I would have to check when I last did a printout (or whatever form of backup I was using), and then I would have to seek out all blog posts published since then. I did not explore the question of whether it would be possible to find and sort a list of blog posts according to the last time they were updated. Such a list could make it easier to determine whether previous blog posts had been modified and would therefore have to be reprinted. The blogger’s notification of comments, by email, might be his/her only way of knowing which posts had received comments since the last manual backup.
There were some tools that would provide an easier and more organized approach to the capture of static images of blog posts and other webpages. BlogCollector offered the capability of downloading blogs from some hosts (notably Blogger) and saving the individual posts as PDF files collected into a single set and, optionally, exported into a single PDF book comprising the entire blog. BlogBooker offered similar capabilities for Blogger, LiveJournal, and WordPress blogs. PDFMyURL went further, offering PDF capabilities for any website, though in my quick test that webpage froze.
The Linked HTML Page Approach
This main section of this post discusses readable offline approaches to backups and copies of blogs and other webpages. The first part of this section (above) discusses static readable approaches, like those found in PDF printouts of webpages. Now we turn to more dynamic approaches that capture websites in such a way that the user’s computer provides an environment similar to that encountered when reading a blog online: you view a page that looks like a page in your blog (because it was copied from your blog), you click a link, that link leads you to another page, and so forth. It goes without saying that, by altering a blog’s contents from the original XML format, tools designed to provide a dynamic offline environment would produce data that could not be uploaded in bulk to restore or recreate a blog. Any restoration of a blog from contents saved with these tools would have to be done manually, copying and pasting the text of one post at a time.
The dominant player in this sphere was HTTrack, whose Windows version was also known as WinHTTrack. On Softpedia’s list of offline browsers, WinHTTrack was far ahead in terms of downloads, and near the top in ratings by users and editors alike. This was especially so in light of the fact that most others had few reviews. In my experience, ratings tend to come down as people become more familiar with a piece of hardware or software. It seemed to me that the potential complexity of HTTrack had been somewhat reduced by its default configuration with values that would work best for most people, and that that complexity would work in my favor in some situations, giving me options that other programs might not offer.
The output from an HTTrack downloading session would be stored in a set of folders whose constituent files and subfolders would be viewable in Windows Explorer, just like any other files and folders. The contents of individual webpages could be approached by double-clicking on the top-level index.html file and then following links into the various blogs and among those blogs’ constituent posts. Those posts would all be stored as files on the hard drive, and internal links among those posts would be converted during the HTTrack download and formatting process, so that a link from one blog post to another would lead to from one file to another, among the files that HTTrack had downloaded and stored on the local hard drive. After the initial download, the user might never have to go online, though of course the downloaded content could become outdated if meanwhile new content was being uploaded to the blog online.
HTTrack’s chief problem was that it could be somewhat complicated and frustrating. Its interface was OK but not great; it really did not make clear, or provide much help, in figuring out how to capture the desired website (e.g., your blog) without (a) unwittingly overlooking important pages or (b) on the opposite extreme, taking a week to download and store half of the Internet on your hard drive, because if you go down far enough, everything on the Internet is connected to everything else, and you may have essentially ordered HTTrack to bring it all home. The situation prompting me to compose this post, in fact, was that I had previously been using WinHTTrack with what seemed like considerable success, but then I returned to it and found that, for some reason, all of my previous settings had disappeared and I really had no idea what I had done to make it work (if, indeed, it was working right; I had faith but not proof).
An alternate approach: wget. This was a command-line program, available across multiple platforms (e.g., Windows, Linux), with great flexibility in terms of its options. Learning its commands could be a hassle, but those commands could be stored and commented within a batch file. In that sense, the wget user might not experience the complete loss of settings, and disorientation as to how to start over, that I had experienced when my saved HTTrack settings disappeared: s/he could just view the saved batch file, read the comments it contained, and decide whether any of its settings needed to be tweaked for a particular task. Then again, HTTrack had command-line options too. In this sense, HTTrack was something like a front end or GUI for wget, offering many of its options but without much onscreen explanation (though of course one could research various problems within HTTrack’s internal help files or elsewhere). Since my previous efforts to use wget had not been richly rewarding, it seemed that the HTTrack GUI might be close to what I needed.
On 1 2 3 4 different reviews of or guides to website copying programs, the most frequently named alternatives to HTTrack were BackStreet Browser (3 mentions), PageNest (3 mentions), Local Website Archive (2 mentions), and wget (2 mentions). At both Softpedia and CNET, there was a wide difference between HTTrack and these contenders, in terms of the numbers of downloads and reviews and the ratings given in those reviews. At the introductory stage, there did not seem to be a good reason to choose one of these other programs (some of which were not freeware) instead of HTTrack.
There was one other possibility that I did not explore. It did not seem directly relevant to the goal of backing up or copying a website. Then again, it was interesting. Evidently it was possible to download WordPress system software and set up a WordPress environment on one’s home computer. I found an introduction to this process in an article titled “How to Create a Local WordPress Website in Windows with Xampp.” It seemed worth mentioning because of the possibility that such a local website could function as a backup of one’s WordPress blog. It didn’t seem that I needed to become embroiled in that kind of undertaking, not if I could get HTTrack to function as desired.
I set out to learn more about ways of copying and backing up websites and blogs. It developed that there were several different categories of tools and techniques, designed for different purposes.
There was, first, the need to make a backup of my blogs, for purposes of restoring those blogs in the event of some calamity at the blog host (e.g., WordPress). The solution to this problem seemed to be simply to use the host’s export capability to obtain a download of the blog’s contents (and, where applicable, its template). Of course, I would want to update these downloads periodically, and keep backup copies. There was an alternative possibility, advisable but not developed in this post, of using FTP and a program like phpMyAdmin to back up the blog in a different way.
In addition, I thought it might be useful at times to have readable, easily updated versions of my blogs on my local hard drive. It appeared that WinHTTrack would be the tool of choice for that. WinHTTrack would change the links within files, and was thus not useful as a backup, except in the sense that I could use it to recover my content if all else failed (and would have to manually reconstruct all of the links that would have been changed within its stored files). I would have to work through the technicalities of these backup and offline browsing methods at another time.