A look at ways of backing up and copying blogs and websites led to several tools and techniques. WinHTTrack was one of those tools. I did not fully understand its settings, but I did think it could be useful for me. I decided to devote a bit of time to an effort to learn how to use it properly. This post discusses what I learned. In a word, I learned a fair amount about HTTrack, and on that basis I decided to try alternatives. Another post provides a more recent statement of where those alternatives took me.
WinHTTrack was the Windows version of HTTrack Website Copier, which described itself as a free offline browser. When it worked as intended, HTTrack would produce a copy, on one’s hard disk or elsewhere, of an entire website, or of a designated portion thereof. Contrary to the program’s name, one would probably not use HTTrack to browse that downloaded copy. The downloaded copy (sometimes referred to here as the “mirror”) of the website would consist of one or more HTML files, just like the website from which it was copied. Double-clicking on an HTML file in the mirror would open it in one’s default browser, which in my case was Firefox. Once that page was open, its links would function like links online, except that they would not lead to online pages: they would lead to downloaded pages contained in the mirror. So if, while browsing the mirror, I clicked on a link that said About Ray Woodcock, I would be taken to a mirror image of the online page that discusses me — but that mirror would be stored on my computer, and I could take it with me on my laptop or on a USB drive or CD. I would have to update it to keep it current, but otherwise I could browse the contents of the downloaded website without going online. This could be handy in case of hardware difficulties or lack of an Internet connection. It could also be useful for capturing and preserving the state of a website as of a certain date.
I had used HTTrack previously. At its best, I had been very happy with it. Most recently, however, I had somehow lost the settings that I had used to make a successful backup of my blogs. (Once backed up, it was not necessary to re-download the full website; HTTrack would just copy the pages that had changed.) I hadn’t done a careful job of exploring and understanding those settings in the first place, nor had I checked what was actually getting downloaded, so it seemed that I might as well sit down at this point and look into it more thoroughly.
I saw that there were downloads for both 32- and 64-bit systems, with and without installers. I assumed the non-installer version was what others would call the portable version. I hoped to preserve my settings, once developed, over a long period of time, and anyway I preferred not to have to reinstall a boatload of programs whenever I installed Windows. So I opted for the 64-bit portable version. The version I downloaded was called httrack_x64-noinst-3.47.27.zip. It dated from September 30, 2013. I put the entire contents of the portable version into my customized Start Menu, not located on drive C, so that I could back it up, share it with other computers, and preserve it for years, regardless of Windows reinstallations.
I ran HTTrack while consulting its documentation page. Given my particular interest in using HTTrack to back up my blogs, especially but not only those in WordPress, I also ran a search for advice. This led to a bemusing spectrum of suggestions. On one extreme, eHow said that, to back up a Blogger blog (presumably not very different from a WordPress blog) using HTTrack, you just basically point and shoot. I saw comments by others who had done exactly that, with acceptable results. On the other extreme, Noumaan Yaqoob, evidently working with WordPress.org blogs, said that I would have to disable all dynamic features of my blog. To the extent applicable, this could include Search and other widgets, plugins, comments, and metatags, as well as code minification. His concern seemed to be that HTTrack would not mirror all of the programming that would underlie a blog, and therefore some links (e.g., Search) would not function or might yield undesirable results if clicked on in the HTTrack mirror.
I guessed that it was not necessarily as simple as some might believe. Within HTTrack’s documentation page, Fred Cohen’s HTTrack User’s Guide (version 3.10, 2007, printed as a PDF) ran to 46 pages of fine print. Granted, that guide seemed to be oriented toward using HTTrack from the command line and/or from Linux, and as such it had the luxury of being detailed and complex. Still, it appeared that many of the possibilities discussed there were available within the WinHTTrack GUI as well. (There was also a concise, command-line manual for the Linux version.)
It seemed my best approach might be to begin with the program’s own Step by Step guide. That guide (apparently a bit outdated, but hopefully still useful) advised me to start by entering a project name and base path. The name I chose was simply the title of the blog I was backing up, so that I would not accidentally mix up its files with those of another blog. The so-called “base” path turned out to be what Windows people might call the top-level folder or directory. So if my base path was F:\Some Folder\Blogs Backup, then everything that HTTrack would download would go into F:\Some Folder\Blogs Backup or into some subfolder under it (e.g., F:\Some Folder\Blogs Backup\Blog 1\stats.wordpress.com).
I chose a base path that was not located on drive C. That is, I wanted the HTTrack backup of my blogs to be on a data partition, not on the Windows drive C programs partition. This was because (a) I backed up the data partitions much more frequently and (b) drive C would be wiped out if Windows became nonworking and had to be restored from a drive image. In the image shown above, for demonstration purposes, I was actually creating the HTTrack folders on an 8GB MiniSD card that I could take with me (along with a copy of the portable HTTrack program) and use elsewhere.
As soon as I entered F:\Some Folder\Blogs Backup as my base path, HTTrack created several files and one subfolder in that folder. The subfolder was called hts-cache. This folder reportedly contained information about what had been copied previously, so as to speed up subsequent updates of the copied website. The files (blackblue.gif, fade.gif, and index.html) were apparently what HTTrack needed in order to provide its basic display, when I viewed F:\Some Folder\Blogs Backup\index.html in Firefox. HTTrack would add to the contents of that index.html file as I added new projects to my list.
My first Project Name was Blog 1. (The Project Name box had a drop-down menu, but at the start there was nothing in it; it was OK to type Blog 1 as a project name there.) I also entered a project category (e.g., WordPress Blogs). Then I clicked Next. That created the Blog 1 subfolder. Eventually, I backed up several blogs, resulting in multiple subfolders around my base path. At that point, my directory tree looked like this:
- F:\Some Folder\Blogs Backup\Blog 1
- F:\Some Folder\Blogs Backup\Blog 2
- F:\Some Folder\Blogs Backup\Blog 3
Each of those subfolders contained an hts-cache subfolder. But that would come later. At the moment, my focus was just on mirroring Blog 1.
When I clicked Next, I moved on to a new screen. Here, I had to choose an Action. According to the Step by Step guide, the meanings of those options were as follows:
- Download web site(s): uses default options
- Download web site(s) + questions: pauses the download to ask questions as needed about particular links
- Get separated files: user specifies the files to be downloaded
- Download all sites in pages (multiple mirror): user can put links to multiple websites in one or more specified webpages; HTTrack will then download all of those sites at once
- Test links in pages (bookmark test): tests the links in the specified pages, to see if they lead to valid webpages
- Continue interrupted download: permits user to resume a prior download that did not complete
- Update existing download: prior download completed but may have since become outdated
I figured that downloads of my WordPress blogs would ordinarily involve (a) an initial Download Web Site (perhaps with questions) and then (b) an occasional Update Existing Download. Next, I had to specify the URL of the site I was downloading. For Blog 1, I used the basic URL of this blog: https://raywoodcockslatest.wordpress.com/. I was able to paste that directly into the Web Addresses box. Note: Tidosho said that, for some websites, pasting would lead to failure; it was better to enter the URL by clicking on the Add URL button.
Next, I was able to set Preferences and Mirror Options. Clicking that button opened a new dialog:
The Step by Step guide seemed to indicate that some of the tabs in this dialog might be relevant to me, while others were probably not:
- Proxy: like most people, I hadn’t made arrangements to use a proxy server, so I had nothing to enter here.
- Scan Rules: HTTrack came pre-configured to allow or deny a few file types. When I began using the program (below), I would have reason to add scan rules and consult the relevant page in the Step by Step guide.
- Limits: as shown in the foregoing image, this tab offered a number of possible parameters. The settings seemed to depend upon the particular website. I decided to defer analysis of this item, too, until I could explore it in some detail (below).
- Flow Control: intended at least partly for the benefit of the site being mirrored. HTTrack’s Advice & What Not to Do page offered several tips on responsible downloading, so as to avoid overloading the target server. Examples from that page: “Use bandwidth limits,” “Use connection limits,” “Use size limits.” Cohen’s Guide (above) noted that some websites may ban users who abuse their privileges. The relevant page in the Step by Step guide suggested using a limit of eight connections (but as few as one or two, if mirroring big files, and as many as 42 on standard websites). Cohen said that if you go above 1 or 2 connections, especially during busy times of day, you might not get all of the downloads. HTTrack offered a Persistent Connections option that came checked by default. The recommended Timeout was 120 seconds, less for fast connections, more for slow ones. But Cohen said 30 seconds, with a couple of retries. For the minimum transfer rate, Cohen offered an example of 10 bytes per second.
- Links: the Step by Step guide led me to think I would want to check the boxes for “Get non-HTML files” and “Get HTML files first.” I chose the former because, while the images on my blog posts were always saved within the website, I could readily imagine trying to mirror a website where the images were stored elsewhere. I hoped there would not be too many bulky ZIP files dragged in incidentally. I checked the “get HTML files first” box because the guide said this could speed things up.
- Build: Cohen explained that the Local Structure Type option on this tab was concerned with how the downloaded files would be named and organized. In the default Site-structure option, files and folders in the mirror would be arranged the same as in the source website. (Characters not allowed in path or file names (e.g., : \ x @) are replaced by underscores.) Alternative structures (refined as needed via the Options button) could be useful in special situations (e.g., law enforcement wants a copy of exactly what was on the website, without changing links to refer to local files). I could see that some of these options could be useful; I was just not able to invest time in fine-tuning them at this point. None seemed essential for purposes of capturing the contents of my blog. Otherwise, this Build tab offered options to save filenames in DOS format; eliminate superfluous error pages where a link leads nowhere; insert a message in place of a working link to an external page not included in the mirror; and hide passwords (encountered during the mirroring process, perhaps entered automatically by the computer) that would otherwise be written into the revised links in the mirror, and thus be made discoverable. The Step by Step guide added that the Hide Query Strings option might be helpful on some basic browsers — if, that is, I intended to use something other than an advanced browser like Firefox to view the mirror. The Do Not Purge option did not sound helpful in most cases. The ISO9660 Names option would apparently be useful if not essential for mirrors running from CD or DVD, but there was an indication of a bug in that option at some point.
- Spider: according to Cohen, “These options provide for automation with regard to the remote server.” Thus, in my version, most of these options were checked by default. The Step by Step guide seemed consistent with my sense that these options were fairly complex and did not clearly require tweaking for my purposes. Then again, Wouter said that, for him, it was necessary to set the Spider option to “no robots.txt rules.” Tidosho agreed with that.
- MIME types: this tab called for site-specific information that I did not seem to have and, again, was not sure I would need.
- Browser ID: the “browser identity” option seemed to be set, by default, to the best option, but would perhaps have to be tweaked for some fussy websites. I was content to leave the default HTML footer in place for the time being.
- Log, Index, Cache: the options of potential interest to me, on this tab, included “Do not re-download locally erased files.” The idea there was that HTTrack would keep a record of files that I had manually erased from the mirror, and would not try to restore them at the next update. That would be useful in some cases, though probably not for purposes of periodic updating of my blog. The options to Create Log Files and Make an Index were checked by default; I left them that way.
- Experts Only: I was content, after a look at the relevant Step by Step guide page, to leave these settings unchanged. Cohen’s discussion referred, here, to a command-line option to check for invalid links, but I was inclined to use other tools (e.g., AM-Deadlink) for that purpose. Here, again, there appeared to be refinements that might be useful for those who would become expert in HTTrack. I hoped not to need to do that.
That concluded my overview of the Options dialog. Now I needed to return to its Limits and Links tabs for closer study.
A Closer Look at Limits and Scan Rules
I turned to my WordPress blogs, using the present blog as an example. Its URL was https://raywoodcockslatest.wordpress.com/. Looking at the first setting on the Limits tab, I wondered: what should I enter as the “Maximum mirroring depth”? It had some preset choices but would allow new entries (e.g., 6). It seemed that more than just a few levels would create a potentially voluminous mirror: the Step by Step guide said that the number chosen here would indicate how far links would be followed. If I understood correctly, a Maximum mirroring depth of 1 would capture only the single page found at the designated URL. If I entered 2, it would capture that single page plus all pages or files that you could reach by clicking every link on that single page. If I entered 3, it would capture the single page at that URL, every page linked to from that single page, and every page linked to from each of those pages. And so forth.
My blog was currently set up so that the top page (i.e., https://raywoodcockslatest.wordpress.com) contained only one post. That top page also contained links to other recent blog posts and to my other blogs. I certainly did not want the mirror of this site to include posts from those other blogs. That problem was apparently addressed by the next option, “Maximum external depth.” The Step by Step guide said that the default was blank, and that a blank was equal to zero. It cautioned that any entry other than zero, in the external depth, could be problematic; there’s no telling what I would be dragging into my mirror. I didn’t want anything from outside my website. So just just to be sure, I set the external depth to zero.
That resolved one concern. But I still didn’t know what to enter as the Maximum mirroring depth. If nothing from outside my target URL (i.e., https://raywoodcockslatest.wordpress.com) was going to be included, then what would I lose by setting this depth to the maximum possible value? The highest given value was 20, but it appeared that the box would accept 20000. I wondered whether a value that high would just make the thing churn unnecessarily, as it hashed the same links within my blog, over and over again. The Step by Step guide said “the depth is infinite” if you leave it blank — that is, any value entered would be a constraint upon, not an expansion of, the list of links to explore. Presumably HTTrack started by making a complete list of all of the links that it would be checking. Its search would be infinite only in the sense that it would keep looking for more links, as long as there were any nooks and crannies that it had not yet explored, within the blog or other website that it was copying. But once it had worked through the links on a page, apparently it would not try to do so again. So it seemed there was not much to be lost from leaving Maximum mirroring depth at a blank (i.e., infinite) setting. Then again, Cohen said, “In some cases, a symbolic link will cause an infinite recursion of directory levels as well, so placing a limit may be advisable.” Another user echoed the possibility of infinite regression. Cohen offered an example with a limit of 50, and said 20 would usually be sufficient. I decided there was not much to lose from entering a Maximum mirroring depth of 50. Of course, if I had been copying some random and potentially voluminous webpage (within e.g., CNN.com), a sane person might ask why I would set this depth option at more than 1 or 2.
So, OK, I had excluded the outside world by setting maximum external depth to zero. I had also enhanced my chances of capturing all of the webpages within my target URL by setting a high (50) if not infinite (blank) value for maximum mirroring depth. But could I be sure that HTTrack would find every post in my blog, even if it were allowed to follow infinitely many links within my blog? In other words, was it possible that one or more posts within my blog would never be reached, regardless of how many links HTTrack might follow? In asking this question, I visualized a solitary post, not linked to by any other post. Of course, there was that list of recent posts at the right side. But what if it wasn’t a recent post, or what if one of my blogs had no such list? Well, there was also the Archives option, with a drop-down menu leading to lists of items posted, for each month since the blog was first started. But what if one of my blog themes didn’t have that widget? And would a drop-down menu count as a source of links for HTTrack?
I experimented briefly, using a free website host. I created a main page, a subfolder, and a file within that subfolder. I ran HTTrack on that new main page. It was captured. What I quickly realized, however, was that it didn’t really matter, for my purposes, whether an isolated, unlinked page would be captured, because as a practical matter that page might as well not exist. For purposes of my HTTrack mirror, the fact that there was no link to it meant that I would basically not be able to find or see it, within the mirrored page structure. In other words, the mirror was just going to reflect the website or blog. If it was going to take me 14, or 50, or infinite steps to find the page online, then the same would be true — at best — in the mirror.
So now I tried a different kind of experiment. I pointed HTTrack toward the top page in this blog, and played with the maximum mirroring depth. I used the same project name as before (i.e., Blog 1), but changed the URL. HTTrack defaulted to the “Update existing download” action choice, and that was fine; I assumed the update process would detect that everything had changed, and would delete the results described in the preceding paragraph accordingly. First, I tried setting the mirroring depth to zero. I ran it, and then, in Windows Explorer, I went into F:\Some Folder\Blogs Backup\Blog 1. The folder created in the process described in the preceding paragraph was still there. In a moment, I would delete that, because now there was a new folder for this blog. (Note that Windows might not want to delete files or folders if HTTrack or some other program (e.g., Windows Explorer) seemed to be using them. Unlocker could be helpful in that situation.)
I went into that Blog 1 folder and clicked on its index.html file. Sure enough, it displayed the post currently shown on the main page of this blog. I verified, in Firefox’s Address bar, that it was indeed displaying a file on my drive, and not a web address. While viewing that, I clicked on one of the links shown in that file. It took me to a web address. So, OK, in this case limiting the search to zero depth did not kill the links; it just prevented them from being brought into the mirror. In Windows Explorer, I went to the other index.html file, at a higher level: F:\Some Folder\Blogs Backup\index.html. This one showed me the “Index of locally available projects.” Problem is, it displayed nothing. Something wrong there. Now I went into the folder for the previous experiment (i.e., the one described in the previous paragraph). That folder was F:\Some Folder\Blogs Backup\Blog 1\a2038260.net78.net. There was nothing there. I deleted that folder.
I ran the same thing again, except this time I set the maximum mirroring depth to 1 instead of zero. As expected, this gave me the same results as before: depth of 0 or 1 = just the page specified in the URL, without anything else; its links lead to real webpages. This time, the top-level index.html file — that is, the one in F:\Some Folder\Blogs Backup, not F:\Some Folder\Blogs Backup\Blog 1 — did open an HTTrack page listing Blog 1 as a project; and when I clicked on it, it flashed something else, and then realized there was only one page in the project, and thus opened the top page in this blog, at https://raywoodcockslatest.wordpress.com, as just described.
Now I tried again, notching it up once more: this time, mirroring depth was set to 2. This process took longer, though still only a minute or so. While it was underway, HTTrack reported interesting statistics — number of files scanned and mirrored, bytes downloaded, etc. Unfortunately, that page disappeared immediately upon completion of the project. I wondered if I could get it back. I clicked the View Log File button. It gave me this:
HTTrack3.47-27+htsswf+htsjava launched on Sun, 09 Feb 2014 18:32:19 at https://raywoodcockslatest.wordpress.com/ +*.png +*.gif +*.jpg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar
(winhttrack -qir2%e0C2%Ps0u0%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F “Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)” -%F “<!– Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO’2013], %s –>” -%l “en, *” https://raywoodcockslatest.wordpress.com/ -O1 “F:\Some Folder\Blogs Backup\Blog 1” +*.png +*.gif +*.jpg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar )
Information, Warnings and Errors reported for this mirror:
note: the hts-log.txt file, and hts-cache folder, may contain sensitive information,
such as username/password authentication for websites mirrored in this project
do not share these files/folders if you want these information to remain private
18:33:12 Warning: Warning, link #79 empty
HTTrack Website Copier/3.47-27 mirror complete in 53 seconds : 78 links scanned, 78 files written (3139715 bytes overall), 61 files updated [782819 bytes received at 14770 bytes/sec], 2545727 bytes transferred using HTTP compression in 59 files, ratio 29%, 3.3 requests per connection
(No errors, 1 warnings, 0 messages)
So there, OK, toward the end: 78 files written in 53 seconds, for a total of about 3MB. How it managed to get 78 files out of links from my top page, I was not yet sure. There was apparently some overhead: Windows Explorer > right-click > Properties reported that the Blog 1 folder actually contained 93 files, totaling 4.66MB.
I closed that log page, declining for now to investigate the other logs named above (i.e., hts-log.txt and the hts-cache folder). This time, instead of clicking on the file in Windows Explorer, I used HTTrack’s own Browse Mirrored Website button, right there in that same screen. That evidently did the same thing as I had been doing manually: as I could see in Firefox’s Address bar, it opened F:\Some Folder\Blogs Backup\Blog 1\raywoodcockslatest.wordpress.com\index.html, though maybe it got there by first opening the higher-level index.html file; it occurred to me that the latter was going to keep flashing by until I had saved more than one project to choose from. As I browsed the mirrored website, I saw that the external links were still live, as expected — that is, they still led to external sites. If I wanted the mirror to contain the external webpages linked to, obviously, I would need to set the external mirroring depth to a number greater than zero. This time, internal links shown on the top page (i.e., at https://raywoodcockslatest.wordpress.com/) were all canned (i.e., referring to files on my hard drive, mirrored by HTTrack) — again, as expected. Clicking on one of those canned links took me to another mirrored webpage.
Basically, at this point, with an internal mirroring depth of 2, I had a mirror containing the main page of my blog and each of the posts for that blog listed on that page. But what about the Archives? On the webpage at this time, there was a set of links to Archives — that is, to items that I had posted to this blog in prior months. The Archives list was not exactly a list: it was a drop-down menu, from which I would select a specific month. Did that count as a level-2 link, for purposes of HTTrack mirroring? I went to the archives for September 2013. Yes! Firefox’s Address bar told me that that September page, too, was mirrored, not live. But in the theme I was using, only one post from that (or any) month was showing; the “Older Posts” link appearing there was not mirrored.
I felt that I had a general sense of where this was going. Specifying an internal depth of 3, 4, or 5 would reduce the number of posts, on this blog, that would remain live rather than being included in the mirror. How about trying the other extreme? I re-ran HTTrack with a mirroring depth of 50. This was a very different deal. I left the computer for a while. When I came back, it was still running. It had been running for nearly two and a half hours. It had downloaded about 225MB in almost 2,700 files. My blog had only 87 posts. What those other 2,600 files were, I had no idea. When I stopped HTTrack, the progress reporting information indicated that it was saving the contents of numerous Wikipedia pages. Those were obviously external to my blog. The “View error log” button was flashing, so I clicked on it. There were hundreds of warnings; the program had just been reporting that there were only 20 errors. I reviewed my settings. It definitely said zero for “Maximum external depth.” I could only guess that, at a mirroring depth of 50, HTTrack must have been finding external links (to e.g., Wikipedia) that somehow looked internal.
I tried again, same settings exactly, except that this time the maximum mirroring depth was set to only 20, not 50. Twenty was the largest number pre-supplied in the mirroring depth drop-down box. At the two-hour mark, this one still seemed to be downloading pages from within my blog, though I could not imagine how; it was already up to almost 1.900 files. I say it “seemed” because HTTrack wasn’t showing the full file and path name being processed. I was just seeing hints that it was downloading pages from my blog, albeit with unfamiliar file and/or path names, like “feed” and “root.” Just a few minutes after the two-hour mark, the Wikipedia pages started appearing again. Something was plainly not working right. Once again, I killed it. This time, though, I didn’t completely and immediately cancel out; I let it wrap up its pending transfers. That took another 20 minutes or so.
I browsed the resulting website (3,012 files, 209MB). As expected, any link I could find that was internal to my blog was captured on a file in F:\Some Folder\Blogs Backup\Blog 1. A glance with TreeSize indicated that about 84MB of the bulk was in a folder called F:\Some Folder\Blogs Backup\Blog 1\en.wikipedia.org. There were also several other seemingly related folders (e.g., upload.wikimedia.org, commons.wikimedia.org). In Windows Explorer, I went into F:\Some Folder\Blogs Backup\Blog 1\en.wikipedia.org\wiki. It contained 836 files and folders. The files had names beginning with A, like Aardvark and Aardwok and Abacus. It looked like the program had been trying to download everything in Wikipedia. That wikipedia.org\wiki folder did not contain its own index.html file, but opening the Abacus file in Notepad gave me what looked like HTML source code. I was able to open it in a web browser by double-clicking on it. Sure enough, iIt was the Wikipedia webpage. Its “See also” links (e.g., “Slide rule”) had been mirrored too. HTTrack had clearly been trying to mirror Wikipedia. But why mirror any external site, and why Wikipedia in particular?
A general search didn’t lead to any immediate explanations. I tried an HTTrack forum search. The page was not scrolling properly, so I redid the search. One suggestion was to explicitly exclude the sites in question. To try that, I went into HTTrack > Set options > Scan Rules tab > Exclude link(s) > Criterion > Folder names containing. I entered wikimedia, wikipedia, twitter, wp.com, amazon, google, paypal, and one or two other sites that were among the folders listed in TreeSize and in Windows Explorer (e.g., unl.edu). I entered them all at once, separated with just a single space. When I clicked Add, HTTrack entered them as separate lines, each formatted like so: –/wikimedia/. I clicked OK and re-ran the operation, noting that HTTrack had specified the action to be “Continue interrupted download” rather than “Update existing download.” Had it done that the last time? I didn’t recall looking at that specifically. Would it still have downloaded Wikipedia pages if I had made sure it was doing an Update rather than trying to Continue my first, 50-depth process? I temporarily deleted the Exclude Link rules I had just created (e.g., –/wikimedia/) and ran that Update, with depth = 20. I canceled after just a few seconds, wishing to double-check that I had definitely left it at the Update option. HTTrack froze and remained frozen for several minutes. I killed and restarted it. I later decided that killing it had not affected its subsequent behavior (by e.g., clearing its memory). I ran the Update process. It was slow to start that. Possibly its seemingly frozen status previously had involved corrections to its internal database, despite the absence of action on my system’s hard drive activity LED. I should perhaps have been running something like Moo0 SystemMonitor, to see if I could get a sense of what was happening during that freeze. Anyway, the Update process ran for several minutes. I could already see that it was updating some file at wp.com. In Firefox, I went to one of the wp.com locations for which there was a folder: s2.wp.com. It looked like this might contain WordPress (wp.com) program files. I decided not to exclude these wp.com URLs from the HTTrack search. The Update process continued for another two hours when, once again, it was scanning Wikipedia files. I killed it. I decided to delete the contents of F:\Some Folder\Blogs Backup\Blog 1 and do a new website download (rather than an update), just in case the contents of that folder were confusing HTTrack somehow. Wiping out those contents had eliminated all of my customized settings, so once again I set depth = 20, external depth = 0, leaving all other settings at their default values. It took a while to delete the contents of that folder, but when that was done, I ran this new HTTrack download. This time, in Windows Explorer (and also in the left-hand pane of HTTrack, which I had been ignoring), I noticed that HTTrack had almost immediately recreated folders in F:\Some Folder\Blogs Backup\Blog 1 for Amazon and Twitter. I let it run for a while, in case these, apparently like the wp.com folders, were somehow an integral part of the WordPress scheme. As I watched, I gathered that a lot of the file creation activity in HTTrack had to do with the creation of files related to miscellaneous links that I rarely used in WordPress — links to tags, for example, where clicking would open up all posts that I had tagged with a certain word. HTTrack did once again create Wikipedia and related folders, starting approximately an hour after I started the search. So it seemed there was no alternative but to set Scan Rules to exclude those external sites (e.g., –/wikimedia/), as described above. I did that, and ran it as a new “Download web site(s)” action, rather than an update, but HTTrack did not thus begin by deleting the folders in F:\Some Folder\Blogs Backup\Blog 1, as one might have expected. I stopped HTTrack, deleted those folders, and started again. This time, it ran, completed, and stopped. The log information was still bizarre: 2,009 files written (58MB) in 75 minutes. There were hundreds of error messages, many having to do with things that I didn’t want to be bothered with, given my exclusion of external webpages — messages having to do with Amtrak and Twitter, for instance. It seemed I still had some work to do, to figure out how to use HTTrack. But at least it had downloaded my blog. I opened index.html and clicked around for a while. It looked like HTTrack had indeed captured my blog.
I was still not sure whether it was necessary to go down 20 levels to capture everything in my blog. Was I perhaps creating unnecessary bulk and complexity by forcing HTTrack to rehash various links — was the program not yet refined enough, perhaps, to know that it had already recreated the same link a dozen times before? It seemed, really, that every post in my blog should have been accessible within just a few clicks of the current page: just go to the Archives drop-down and pull up the entries for a given month. Except that there was a problem: I had configured the blog to show just one post per page. So when it pulled up the archives, more clicking would be necessary to move among, say, the posts from September 2013. I didn’t like that, so I decided to change it. I went into my blog’s dashboard, in WordPress, and went to Settings > Reading. I changed it to say, “Blog pages show at most 100 posts.” I was pretty sure I didn’t have 100 posts in any month, since I had only 87 posts altogether. But as a precaution, I went into Dashboard > Appearance > Widgets > Archive and checked the box for the option that said, “Show post counts.” I checked that I could now see all posts for a month as soon as I clicked on the Archives widget. Now there was another wrinkle: in the theme I had chosen for this blog, the Archive page for a given month (e.g., September 2013) did not show the full text of each blog post; it just showed the first several lines and a “Continue reading” link. So I would have to allow one extra HTTrack layer for that link. Finally, with a setting that would show 100 posts to every visitor, and the fact that I had written only 87 posts altogether, the top page of my blog would show all 87 posts. To fix that, I went into Dashboard, created a new front page (i.e., one that would provide links to, but not the text of, recent posts); then I set Dashboard > Settings > Reading to show that new welcome page to visitors. So now it really looked like HTTrack ought to be able to get the full text of any post with just two clicks from the main page: one click to get a truncated display on the monthly archives page, and then one click from there to get the full text. Three levels. To be on the safe side, I set HTTrack’s maximum depth at 5. Once again, I deleted the folders (this time excepting hts-cache) from F:\Some Folder\Blogs Backup\Blog 1, so that we would be making a fresh start. I verified that the Scan Rules (for e.g., –/wikimedia/) were still in place. Then I ran HTTrack again. Except as just described, it was doing exactly what it had done before (see previous paragraph). This time, it copied 1,696 files (54MB) in 64 minutes, with 373 warnings and, as before, about 15 errors. Not a huge change from the depth = 20 process described in the previous paragraph: 15% faster, 7% smaller. A brief spin through the resulting product suggested that I was still capturing all of my blog posts, directly and via tags, archives, and other links. There may have been some unforeseen reason why I would need more than three steps to get to a certain file within the blog, so I wasn’t inclined to go lower than depth = 5 at this point.
I decided to run a comparison, to see what depth = 5 was lacking, compared with depth = 50. That is, I ran a new download, this time as Blog 2. Everything was the same, except that I set mirroring depth to 50. Given the results in the two preceding paragrapshs, I might have guessed that Blog 2 would take maybe 90 minutes and copy around 2,500 files. Instead, it took 45 minutes for 1,692 files (54MB). Same as depth = 5, but faster. Was HTTrack settling down into a pattern or rut somehow? I ran TreeSize on each of the resulting folders, F:\Some Folder\Blogs Backup\Blog 1 and F:\Some Folder\Blogs Backup\Blog 2. The only significant difference was that the hts-cache folder was about twice as large on Blog 1. Apparently I should have emptied or deleted it before reusing it; it had apparently saved data from the previous go-round. I tried once more, this time with a new Blog 3, maximum mirroring depth = 20, just to double-check the previous results. This time, the results were almost identical with depth = 5: 1,693 files written (54MB) in 45 minutes. I could not explain why it had taken so much longer previously, nor why so many more files were involved. One would expect any searches for more than a few links to be redundant and quickly resolved, in a blog like mine, and that was what these latest results suggested. As long as the mirroring was not taking extra time or involving extra files, it would make sense to go with the greater depth (in this case, 20), just to avoid any chance of missing an important file. Then again, if it took 20 clicks to find one’s way from the main page to a particular file, it would not usually be very important.
Incidentally, now that I had more than one project (namely, Blogs 1, 2, and 3) set up, clicking on F:\Some Folder\Blogs Backup\index.html took me to the overview webpage, the Index of Locally Available Projects, where the three stored projects were listed. From there, I could click on any one of them and view the mirrored blog. This would produce a more interesting demonstration when I reached the point of having arrangements in place to back up several different blogs.
The preceding section of this post concludes that depth = 20 might be safe and hopefully not overkill. The section also observes that it may be necessary to use Scan Rules to counteract an apparently buggy tendency of HTTrack to reach out to external sites. This section looks at other settings beyond internal and external depth and scan rules. I deleted all files and folders in the Blog 1, 2, and 3 folders, and then experimented with some other settings, to see what difference they would make.
On the Limits tab, I had been using the default Maximum transfer rate of 25,000 bytes per second. But Tidosho suggested 50,000. The Step by Step guide suggested that this higher value might allow HTTrack to monopolize your bandwidth (i.e., you will not be doing much web surfing while the mirroring process is underway). For the “Max connections / seconds” option, the guide said the default was 10. That was also the highest number pre-programmed into HTTrack’s GUI. According to the guide, going too high would pose a risk of server overload. The option of setting a maximum number of links meant that HTTrack would analyze no more links beyond that number. The guide said the default was 100,000, and more would take more memory. I had memory, and HTTrack had an option for 1,000,000, so I went with that. Later, I realized that, in choosing the option at the bottom of the list, I had actually chosen 10,000,000.
So now I did the same download as before (depth = 20), but this time I changed various settings as suggested above (see especially the HTTrack Settings section of this post), whereas previously I had used default values except for scan rules and maximum depth setting. Briefly, the settings I changed from default at this point were as follows:
- Scan Rules: exclude wikipedia etc. (above).
- Limits: mirroring depth = 20, external depth = 0. 50,000 for max transfer rate, 10 max connections/second; 10,000,000 max links.
- Flow control: 8 connections, timeout = 30, 3 retries, 10 B/s minimum transfer rate.
- Links: Get non-HTML; get HTML files first.
- Build: site-structure (default), no error pages, hide passwords.
- Spider: no robots.txt rules.
When I ran this mirror, I saw immediately that HTTrack had more things going on: there were about six to ten threads underway at any time, in contrast to the three to five I had been seeing previously. Oddly, this time the mirror created subfolders for about a dozen external sources that had not appeared before (e.g., cchsv.com, trusteer.com, gmpg.org, acs.org.au). As before, I did not know of any reason why my selection of the foregoing options would have opened the door to more external sites. The mirror ran for 87 minutes, produced 23 errors and 1,646 warnings, and copied 3,000 files (94MB). This was obviously very different from what I had expected: larger, for no obvious reason, and slower, with more errors and more unwanted external material. I deleted the contents of the Blog 1 folder and tried again, changing nothing except that, this time, I made mirroring depth = 5. Both of these mirrors ran in late-night and early-morning hours. This one completed in 67 minutes, produced 22 errors and 1,551 warnings, and copied 2,896 files (91MB). There were still numerous external-website subfolders in the Blog 1 folder — more than there had been in any previous download attempt.
The result just described, from depth = 5, was not very different from the result when depth = 20. Previously, a setting of depth = 20 had sometimes caused a dramatically larger download. This time and once before, that had not happened. I did not like the fact that I had gotten several seemingly inconsistent results from HTTrack. I guessed that perhaps the earlier divergence was a fluke, a result perhaps of some other unnoticed change of settings or related conditions.
But I was still amazed that my 87 posts, hardly any of which had photos or other large accompanying files, should have produced a backup approaching 100MB, or that it should have taken more than an hour. Without denying the obvious difference in connection speeds, it did seem relevant that copying those downloaded files from one drive to another on my system took less than a minute. It also seemed worrisome that HTTrack would be so persistent in including external websites and in generating hundreds of warnings related to external sites, when I had explicitly set external depth to zero.
At this point, I had been at this for days. Granted, these tests were running in the background, not requiring much of my time. Still, HTTrack kept raising questions and uncertainties. My feeling, as I continued to use the program, was that it needed to be tightened up. On prior occasions, I had used it in a sort of black-box mindset, plugging in the target URL and hoping that it would do the job OK. And, frankly, in this testing it had done so. I had not tested what would happen with mirroring depth set to 3 or 4, but at 5 and above every link I had tried, in every download, had worked perfectly. My feeling was simply that it was not reassuring to think that, at some point, I might rely on this program to deliver a solid offline mirror, and might then discover that it had not done so, or that it was requiring an inordinately long time to do so.
Backing up a WordPress blog seemed terribly simple, at least in comparison to an effort to download an offline copy of a portion of some sprawling and complex website. If there was an application that should have been amenable to handling by one of HTTrack’s simpler competitors, this was it. Thus, before proceeding further with HTTrack, I decided to try one or two alternatives.