31 | July | 2012 | Ray Woodcock's Latest

I had a bunch of blog posts that all contained some of the same errors, mostly in the URLs used for links. I wanted to fix those errors en masse, rather than go through each post and make the corrections manually. This post describes the steps I experimented with to try to achieve that.

I already knew that it was possible to make a backup of a blog, with all its posts. Using WordPress as the example, this process called for simply going to the Dashboard and selecting Tools > Export > Export > All Content > Save. That would produce an XML file with a long name in the default download location on my computer. (I could also have used Save As to designate a name and location.) I could open that file in Notepad and, in that one file, I would see what appeared to be all of the HTML code from all of my posts on that blog. I did one such backup now, and saved a copy of it as RevisedBlog.xml.

In a different project, I had also done some tinkering with the task of automating changes to multiple files. In that project, I had used especially the ancient and powerful but complex Emacs, and I had also used the new and simple TexFinderX. The latter required me to indicate the files to be changed. In this case, that was easy enough: File > Add File > RevisedBlog.xml.

TexFinderX needed to know what Replacement Table I wanted to use. This involved searching my computer for “Accented to None,” which was the name of one of the prepackaged TexFinderX replacement tables. In Windows Explorer, I opened the folder where that file appeared. I copied “Accented to None – UTF8.txt” to a new file called BlogFixer.txt. I opened BlogFixer.txt in Notepad and, as described more fully in the writeup of that earlier project, I replaced that file’s the preexisting list of changes with my own. So, for example, on the first line under “/////true/////” I deleted the “à [tab] a” pairing, which would replace occurrences of the letter à with occurrences of plain old a. Likewise, I deleted all of the other before-and-after pairs in that file — that is, everything after “/////true/////” to the end of the file.

Now I inserted my own tab-delimited list of things to change. I wasn’t concerned with “à [tab] a.” I was concerned, for example, with replacing one URL with another. So if I wanted to change all occurrences of http://www.cnn.com with http://www.abc.com, I would have a before-and-after pairing, after “/////true/////” in the BlogFixer.txt file, that would look like this:

http://www.cnn.com [tab] http://www.abc.com

where [tab] meant that I would actually hit the Tab key after typing http://www.cnn.com, before typing http://www.abc.com. I created similar before-and-after pairings for everything that I wanted to replace in RevisedBlog.xml. This was obviously a powerful tool, which meant that I could mess up a huge amount of work with no trouble at all. All I needed was a single replacement line that would replace, say, all occurrences of the letter e with smiley-faces. Presto! All my blog posts would be trash — or, worse, they would be ridiculous.

I got the list of URLs that I wanted to change by running a link checker. There were various websites and tools that would check all of the links in your blog and let you know which ones did not actually lead to a working webpage. After looking at several, I decided I especially liked the Online Broken Link Checker and the W3C Link Checker. The former gave me a single list, whereas the latter gave me a lot of clutter. I did nonetheless find the latter useful for providing an organized guide through the blog pages, one at a time, especially if I checked pretty much all of its options (e.g., “Summary only,” “Hide redirects”), ignored its messages about robot exclusion rules and about duplicate anchors, and focused on the 404 Not Found errors. Some of those same 404s appeared in one post after another, and that’s why I came around to this experiment with TexFinderX.

I kept the list of URLs to be changed in Excel. In an adjacent column, I listed the URLs that should replace them. I copied and pasted this table into BlogFixer.txt. Excel automatically inserted Tab characters between these two columns, so I didn’t have to do that manually in order to wind up with what seemed to be a properly formatted before-and-after list for TexFinderX. At first, I wasn’t seeing BlogFixer.txt when I clicked on the Replacement Table button in TexFinderX, but then I saw that was because I had put BlogFixer.txt into the wrong folder. It needed to be in the TFXTables folder. I clicked through and it ran and was done in about two seconds.

Now, was I ready to try wiping out all of my blog posts on WordPress and replacing them by uploading this modified RevisedBlog.xml file to replace them? No way! I decided to create a new temporary blog and upload there instead. I had the impression that, once WordPress created a blog address, it would never allow anyone else to use that address. I didn’t want to waste a good address for nothing, so I invented a blog address of gibberish characters that probably nobody would ever try to use. WordPress kindly populated it with a “Hello World!” post. I decided to leave that there and see what would happen to it in the import process. In that new blog’s Dashboard, I went to Tools > Import > WordPress > Browse > select RevisedBlog.xml > Upload file and import > Submit. It gave me a message:

Processing

We’re processing your import now and will send you an email when it’s all done. If you don’t hear from us within 24 hours, please contact support, and we’ll get everything fixed.

That was at 6:30 PM. At 7:04 PM, I received an email telling me the import was successful, and giving me the link to the blog. The “Hello World!” post was still there, so evidently the import would just be added to whatever was already in the blog (so I would presumably need to delete the old ones before importing the updates). I didn’t read through the blog posts, but in general terms the formatting looked right — indenting, links, etc.

I ran Online Broken Link Checker and the W3C Link Checker on this new temporary blog. OBLC still found some 404s, but a much smaller number than before. Other than the ones I had decided to disregard for now, there were only two URLs, used repeatedly in a number of posts, that weren’t working. But at a glance, they seemed to be correct. It was confusing. W3C Link Checker provided the answer by displaying the full URL (which was also available in OBLC if I clicked a link for each remaining URL). It seemed that some of the URLs had gotten extended somehow and were not correct. I wasn’t sure how this had happened. I opened RevisedBlog.xml in Notepad and made the corrections manually instead of using TexFinderX. In the temporary blog’s Dashboard, I noticed that the tags and categories all seemed to be in place. This export-import process really seemed to have worked.

After correcting the remaining 404s in RevisedBlog.xml, I deleted all existing posts in the temporary blog and re-imported the new, improved RevisedBlog.xml. Weird thing: WordPress took a while, sent me an email saying that the import was completed, but this time there was nothing there. Nothing got imported. Was that because I had deleted the “Hello World!” post — did WordPress need to find something there already in order to do its import properly? Had I closed the WordPress page too soon — did it need to have that “Processing” message visible in order to complete the import? I tried again, keeping the Processing page open. No, that wasn’t the solution: again, it said it was done, but there was nothing there.

I ran a search and found an instance where someone’s xml import timed out — due, s/he suspected, to the large number of tags appearing at the start of his/her xml file. But in another case, there was an empty file problem even though the upload was pretty small. I decided that this approach of editing the xml file and reimporting it was not necessarily as straightforward and reliable as one might hope. I did not want to jeopardize my existing blog. So that was pretty much the end of this approach, at least for now.

Another approach, I thought, might be to take the HTML out of one blog post at a time, put it into a file called PostInProcess.txt, and repetitively run TexFinderX on that file. This would make the multiple changes listed in BlogFixer.txt, as applicable, on one file at a time. Then I would put the fixed HTML back in the blog post and save it. This would still require fixing one file at a time, but there would be a greater safety level in terms of being closer to the changes. I took that approach. It was slow, but it seemed to achieve the desired outcome.

In other words, I did not really succeed in mass-producing changes across many blog posts. To work through the kinks in the TexFinderX-xml approach, it seemed that I would need to experiment with a different blog, where I was not worried about possibly wrecking the thing.

Ray Woodcock's Latest

It's complicated

Daily Archives: July 31, 2012

Making the Same Changes Across Many Blog Posts