Capturing a Long Webpage as a Single Image via ImageMagick

I had a long webpage that I wanted to save as a single image (e.g., JPG or PNG). This post describes the steps I took to achieve that, using ImageMagick.

(Note: I wrote this post because I was finding that Firefox screenshot add-ons (e.g., Screenshoter, Awesome, Easy, Fireshot) were not capturing the whole length of very long webpages. And perhaps there really was a limit. Later, though, it seemed the problem may just have been in my procedure. I was using the screenshot add-on, telling it to capture the entire webpage, and then pasting it into IrfanView, my default image viewer. This would truncate the image at a certain point. Eventually, it occurred to me to paste from the screenshot tool directly to a large (perhaps several hundred inches long), empty canvas in Photoshop. And that worked: Photoshop ate the whole thing, and could produce at least a TIF capturing it all in one step — and then IrfanView could view the TIF and save it as a PNG or JPG, if desired. That seemed to solve the problem. And yet, later still, it didn’t. I was not sure what was happening there. I have retained the balance of this post as an alternative solution. Another alternative, that I tested only in Chrome, was FireShot Pro ($40; free trial): it did capture, in one image, a webpage that was too long for other add-ons to capture.)

In a previous effort, I had started with a batch file that would capture screenshots every half-second, as I scrolled down the page. Then I tried using panorama-maker software (e.g., Photoshop; Microsoft’s Image Composite Editor) to combine those screenshots into a single image. Unfortunately, the panorama software had problems with this task: it might be able to combine only a few images at a time; it might balk at overlapping screenshots that captured a playing video, because those screenshots would capture different parts of the video and thus would not overlap precisely; it might not be able to produce an image beyond a certain length.

Preparing the Webpage

It had belatedly occurred to me that I might defeat one of those problems — the problem involving screenshots of playing video — by working from a copy of the webpage, rather than the original.

To do that, I would first prepare the webpage as desired. This preparation could involve pre-scrolling or other steps. For instance, if I was capturing a webpage that was broken up into separate pages (with e.g., a set of links at the bottom, allowing the user to choose from pages 1, 2, 3 . . . Next), I would prepare by using a browser add-on (e.g., AutoPagerize) to display all those pages in one: it would keep unrolling the webpage as long as I kept scrolling down. Or if I was capturing something like a Facebook wall, where the user may have to click a link in order to see “more comments” following an entry, I would prepare by clicking all those links, so that everything I wanted to include was visible.

Once the page was prepared, I could save it as a file. Working from a copy of the webpage, rather than the original, would have the advantage of preserving these potentially time-consuming preparations. It probably made sense to make a file copy of the webpage, just in case, even if I did plan to work from the original. That way, it would not be necessary to start over with preparations, from the beginning, if for some reason the webpage changed, or if the tab or the browser was inadvertently closed during the process. I could just reopen the saved webpage file.

In Firefox, saving the webpage as a file would mean using the menu: File > Save Page As > enter filename > Save as Type: Web Page, complete. Note that, when saved in the “Web Page, complete” format, there would be a similarly named folder, containing material relevant to the saved .htm file. This method might unfortunately save a playing video as a vague blur. To correct that, a perfectionist might have to edit the final image, to insert a recognizable image from the relevant video. Of course, I could also avoid that problem by working from the original webpage, with each video paused at a frame that would convey a sense of its contents.

Alternative Approaches

Assuming I wanted to save the webpage as an image rather than as an .html file with that accompanying folder, the foregoing preparations would be just the starting point. Now I would have to find a way to convert the webpage, or the HTML copy, to an image format. I wanted to use PNG rather than JPG or some other image format, but I was open to JPG or others if necessary.

If the page was not too long, at this point the solution could be simple: I could use a converter. A search led immediately to a list of dozens of online and downloadable conversion solutions. It was not clear that the online solutions would handle a complete webpage, with that accompanying folder, as distinct from a single, standalone HTML file. I tried one such online converter with the HTM copy I had saved in this case. PDF was the only output image format offered by this particular converter. It said it was uploading a total of 3.3MB, which was the size only of the HTM file; the accompanying folder was 14MB. And that seemed consistent with the output: the resulting download was a 178-page PDF with nothing but text. No images. So that wasn’t too satisfying. Really, if I had wanted PDF output, I would have been further ahead just to use the browser’s print command and specify a PDF printer — except that, for a page of this length at least, an attempt to do so failed. (Note that Chrome > Ctrl-P could print the whole page as a PDF, but with limited options as to page size; it appeared unavoidable that there would be many page breaks.)

Another conversion possibility was to use a browser add-on that would capture the whole webpage as displayed. For example, Full Web Page Screenshots (which actually appeared to be a part of, or another name for, Fireshot) and Screenshoter were each able to capture about one-third the length of this page, and that was what I had previously experienced with other screen capture add-ons, some of which are discussed below.

As another approach, after unproductive searches for an alternative in 2012, I had used a command-line converter, wkHTMLtoPDF, to convert HTML pages and also MHT pages to PDF. A search led to a SuperUser discussion, drawing on the wkHTMLtoPDF manual, which suggested that it should be possible to specify an extremely long page length (e.g., 100 meters); but it appeared the user in that case was having difficulty implementing the theory in practice. A PDF solution would presumably yield multiple pages, which would still have to be combined somehow, in order to produce a single image file capturing the entire webpage.

In pursuit of a way to proceed directly from the HTML to the resulting PNG, a search led to numerous indications that such attempts would be difficult and imperfect. But then another search notified me of wkHTMLtoIMAGE, which was apparently part of the wkHTMLtoPDF project. That program’s homepage offered versions in Windows and other operating systems. Although the documentation seemed oriented toward wkHTMLtoPDF, a separate manpage suggested that the general syntax would be the same: wkhtmltoimage [options] <input file> <output file>. I wasn’t sure, though, how that would deal with the subsidiary folder containing files required by the “Web Page, Complete” format. A search produced nothing on that.

Based on these findings, in this case it appeared I would need to capture the webpage or the HTML copy as a series of individual images, and then combine those images into a single final image.

Breaking the Page into Screenshots

As noted above, the approach I took in the previous post involved an automated screenshot process and then a panorama-style merging of those overlapping screenshots. This time, I wanted to try a different approach. I wanted screenshots that would not overlap. Instead of using their areas of overlap to inform the panorama software as to how the screenshots would relate to one another, my plan this time was simply to butt them up against one another. Basically, as I scrolled down the screen, viewing the finished product, I expected to see screenshot no. 2 starting where screenshot no. 1 ended. The purpose here was to skip the troublesome panorama software step, and go directly from the original screenshots to the combined result.

To capture the screenshots, I began as I had done previously: set up the display to display and capture images rotated 90 degrees, so as to take advantage of the long width of the display. This would reduce the number of screenshots needed — which was helpful, not only because extra screenshots meant extra work, but also because pages would not necessarily display as desired when I scrolled or paged down. I might have to adjust the page manually, each time, so that the next screenshot began exactly where the previous one ended. In the present case, hitting the Page Down key would take me almost exactly one page down — but it looked like one or two rows of pixels were missing. That was not a killer, for my purposes, but it was impossible to correct with scrolling, and it was undesirable. So it did make sense to use a rotated display, and perhaps even to hit Ctrl-Minus once or twice, so as to zoom out and capture more material per page (at the potential cost of resulting image quality).

There seemed to be several ways to capture the individual screenshots. One would be to proceed manually: use the Print Screen key to take a screenshot; hit Alt-Tab to switch to IrfanView; use Ctrl-V to paste the screenshot into IrfanView; hit S to save it as the latest in a series of sequentially numbered files; use Alt-Tab to get back to the webpage; hit Page Down; and repeat the process. That would be the slow route, subject to error and interruption. Another approach would be to use a browser add-on to capture and save the screenshot in one operation. In that case, I would just need to hit Page Down, click on the appropriate button, and then proceed to the next page. I thought I had seen a Firefox add-on that would do this, but I was having trouble locating it at this point. Grab Them All didn’t look quite right; Screengrab (fix version) was reportedly adware; Lightshot looked like it was going to require me to draw a rectangle around what I wanted it to capture, each time, which would be slow and would also produce inconsistent pages that would look bad when combined; Awesome Screenshot Plus looked like it was having problems at this point; Fireshot (above) would do it, but would require several steps at each screen; likewise, I suspected, with Nimbus Screen Capture.

It looked like this particular webpage would require somewhere around 100 screenshots. (After rotating the image 90 degrees and hitting Ctrl-Minus a few times, it actually turned out to be half that.) As in the previous post, then, it seemed the easiest solution would be to run my Shotshooter batch file as I paged downscreen, allowing enough time for an automatic capture at each page, and then use VisiPics on its strictest setting (and then Tools > Auto-Select > OK > Delete) to eliminate duplicates. I made a backup of the resulting screenshots and deleted the unwanted ones at start and end.

Concatenating Images via ImageMagick

Previously, I had used ImageMagick’s convert and composite commands to combine images. That approach would make sense if, as in that case, I was manually aligning two images horizontally and vertically. But in the present case, my images were all vertically consistent, though I would be trimming unwanted material from the tops and bottoms of a few. In this situation, given a working ImageMagick installation as described in that other post (i.e., choosing the option to “Add application directory to your system path”; editing the system path, if necessary, to put its reference to ImageMagick first; and then rebooting), one source seemed to say this command should work:

convert *.png -append combined.tif

And it did. I chose a TIF extension (or could have chosen JPG) to keep the command from confusing the output PNG with the input PNGs. (If I had needed to specify the order of files to be combined, I might have started by using IrfanView > File > Batch Conversion/Rename, or something like Bulk Rename Utility, to give the PNGs shortened names, which would be easier to work into a command. I might also have used something like Microsoft Excel to construct the essential command.)

The only problem with the output was what I had already seen when I was taking the screenshots: a bit of material was lost at the end of each screenshot, when I moved to the next screen using PageDown. This example illustrates about how much was lost, from the particular webpage I was capturing:

For some purposes, losing an entire line of text could be problematic. One remedy, in that case, would be to take overlapping screenshots (given the difficulty of aligning webpage display precisely in the browser) and then edit out the duplicative material by hand, so that there would be neither surplus nor missing material. That might not mean editing every screenshot; perhaps some elisions would be unimportant.

Advertisements
This entry was posted in Uncategorized and tagged , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s