Capturing a Long Webpage as a Single Image via ImageMagick

I had a long webpage that I wanted to save as a single image (e.g., JPG or PNG). This post describes the steps I took to achieve that, using ImageMagick. Notes at the end discuss some alternate approaches. Over time, I actually started to prefer the last of those alternate approaches.

In a previous effort, I had started with a batch file that would capture screenshots every half-second, as I scrolled down the page. Then I tried using panorama-maker software (e.g., Photoshop; Microsoft’s Image Composite Editor) to combine those screenshots into a single image. Unfortunately, the panorama software had problems with this task: it might be able to combine only a few images at a time; it might balk at overlapping screenshots that captured a playing video, because those screenshots would capture different parts of the video and thus would not overlap precisely; it might not be able to produce an image beyond a certain length. (Another post tackles a similar problem, involving a horizontal rather than vertical concatenation.)


Preparing the Webpage
Alternative Approaches
Breaking the Page into Screenshots
Concatenating Images via ImageMagick
Appendix: Other Approaches

Preparing the Webpage

Eventually, in the case of an attempt to capture a long Facebook page, either Facebook changed or I got smarter, but for whatever reason it became possible to pause each playing video, and it would then not insist on continuing to play, so the captures worked better in that regard.

Before that development, I found that I could prevent active videos from disrupting the page-merge process by working from a copy of the webpage, rather than the original. To do that, I would first prepare the webpage as desired. This preparation could involve pre-scrolling or other steps. For instance, if I was capturing a webpage that was broken up into separate pages (with e.g., a set of links at the bottom, allowing the user to choose from pages 1, 2, 3 . . . Next), I would prepare by using a browser add-on (e.g., AutoPagerize) to display all those pages in one: it would keep unrolling the webpage as long as I kept scrolling down. Or if I was capturing something like a Facebook wall, where the user may have to click a link in order to see “more comments” following an entry, I would prepare by clicking all those links, so that everything I wanted to include was visible.

Once the page was prepared, I could save it as a file. Working from a copy of the webpage, rather than the original, would have the advantage of preserving these potentially time-consuming preparations. It probably made sense to make a file copy of the webpage, just in case, even if I did plan to work from the original. That way, it would not be necessary to start over with preparations, from the beginning, if for some reason the webpage changed, or if the tab or the browser was inadvertently closed during the process. I could just reopen the saved webpage file.

In Firefox, saving the webpage as a file would mean using the menu: File > Save Page As > enter filename > Save as Type: Web Page, complete. Note that, when saved in the “Web Page, complete” format, there would be a similarly named folder, containing material relevant to the saved .htm file. This method might unfortunately save a playing video as a vague blur. To correct that, a perfectionist might have to edit the final image, to insert a recognizable image from the relevant video. Of course, I could also avoid that problem by working from the original webpage, with each video paused at a frame that would convey a sense of its contents.

Alternative Approaches

Assuming I wanted to save the webpage as an image rather than as an .html file with that accompanying folder, the foregoing preparations would be just the starting point. Now I would have to find a way to convert the webpage, or the HTML copy, to an image format. I wanted to use PNG rather than JPG or some other image format, but I was open to JPG or others if necessary.

If the page was not too long, at this point the solution could be simple: I could use a converter. A search led immediately to a list of dozens of online and downloadable conversion solutions. It was not clear that the online solutions would handle a complete webpage, with that accompanying folder, as distinct from a single, standalone HTML file. I tried one such online converter with the HTM copy I had saved in this case. PDF was the only output image format offered by this particular converter. It said it was uploading a total of 3.3MB, which was the size only of the HTM file; the accompanying folder was 14MB. And that seemed consistent with the output: the resulting download was a 178-page PDF with nothing but text. No images. So that wasn’t too satisfying. Really, if I had wanted PDF output, I would have been further ahead just to use the browser’s print command and specify a PDF printer — except that, for a page of this length at least, an attempt to do so failed. (Note that Chrome > Ctrl-P could print the whole page as a PDF, but with limited options as to page size; it appeared unavoidable that there would be many page breaks.)

Another conversion possibility was to use a browser add-on that would capture the whole webpage as displayed. For example, Full Web Page Screenshots (which actually appeared to be a part of, or another name for, Fireshot) and Screenshoter were each able to capture about one-third the length of this page, and that was what I had previously experienced with other screen capture add-ons, some of which are discussed below.

As another approach, after unproductive searches for an alternative in 2012, I had used a command-line converter, wkHTMLtoPDF, to convert HTML pages and also MHT pages to PDF. A search led to a SuperUser discussion, drawing on the wkHTMLtoPDF manual, which suggested that it should be possible to specify an extremely long page length (e.g., 100 meters); but it appeared the user in that case was having difficulty implementing the theory in practice. A PDF solution would presumably yield multiple pages, which would still have to be combined somehow, in order to produce a single image file capturing the entire webpage.

In pursuit of a way to proceed directly from the HTML to the resulting PNG, a search led to numerous indications that such attempts would be difficult and imperfect. But then another search notified me of wkHTMLtoIMAGE, which was apparently part of the wkHTMLtoPDF project. That program’s homepage offered versions in Windows and other operating systems. Although the documentation seemed oriented toward wkHTMLtoPDF, a separate manpage suggested that the general syntax would be the same: wkhtmltoimage [options] . I wasn’t sure, though, how that would deal with the subsidiary folder containing files required by the “Web Page, Complete” format. A search produced nothing on that.

Based on these findings, in this case it appeared I would need to capture the webpage or the HTML copy as a series of individual images, and then combine those images into a single final image.

Breaking the Page into Screenshots

As noted above, the approach I took in the previous post involved an automated screenshot process and then a panorama-style merging of those overlapping screenshots. This time, I wanted to try a different approach. I wanted screenshots that would not overlap. Instead of using their areas of overlap to inform the panorama software as to how the screenshots would relate to one another, my plan this time was simply to butt them up against one another. Basically, as I scrolled down the screen, viewing the finished product, I expected to see screenshot no. 2 starting where screenshot no. 1 ended. The purpose here was to skip the troublesome panorama software step, and go directly from the original screenshots to the combined result.

To capture the screenshots, I began as I had done previously: set up the display to display and capture images rotated 90 degrees, so as to take advantage of the long width of the display. This would reduce the number of screenshots needed — which was helpful, not only because extra screenshots meant extra work, but also because pages would not necessarily display as desired when I scrolled or paged down. I might have to adjust the page manually, each time, so that the next screenshot began exactly where the previous one ended. In the present case, hitting the Page Down key would take me almost exactly one page down — but it looked like one or two rows of pixels were missing. That was not a killer, for my purposes, but it was impossible to correct with scrolling, and it was undesirable. So it did make sense to use a rotated display, and perhaps even to hit Ctrl-Minus once or twice, so as to zoom out and capture more material per page (at the potential cost of resulting image quality).

There seemed to be several ways to capture the individual screenshots. One would be to proceed manually: use the Print Screen key to take a screenshot; hit Alt-Tab to switch to IrfanView; use Ctrl-V to paste the screenshot into IrfanView; hit S to save it as the latest in a series of sequentially numbered files; use Alt-Tab to get back to the webpage; hit Page Down; and repeat the process. That would be the slow route, subject to error and interruption. Another approach would be to use a browser add-on to capture and save the screenshot in one operation. In that case, I would just need to hit Page Down, click on the appropriate button, and then proceed to the next page. I thought I had seen a Firefox add-on that would do this, but I was having trouble locating it at this point. Grab Them All didn’t look quite right; Screengrab (fix version) was reportedly adware; Lightshot looked like it was going to require me to draw a rectangle around what I wanted it to capture, each time, which would be slow and would also produce inconsistent pages that would look bad when combined; Awesome Screenshot Plus looked like it was having problems at this point; Fireshot (above) would do it, but would require several steps at each screen; likewise, I suspected, with Nimbus Screen Capture.

It looked like this particular webpage would require somewhere around 100 screenshots. (After rotating the image 90 degrees and hitting Ctrl-Minus a few times, it actually turned out to be half that.) As in the previous post, then, it seemed the easiest solution would be to run my Shotshooter batch file as I paged downscreen, allowing enough time for an automatic capture at each page, and then use VisiPics on its strictest setting (and then Tools > Auto-Select > OK > Delete) to eliminate duplicates. I made a backup of the resulting screenshots and deleted the unwanted ones at start and end.

Concatenating Images via ImageMagick

Previously, I had used ImageMagick’s convert and composite commands to combine images. That approach would make sense if, as in that case, I was manually aligning two images horizontally and vertically. But in the present case, my images were all vertically consistent, though I would be trimming unwanted material from the tops and bottoms of a few. In this situation, given a working ImageMagick installation as described in that other post (i.e., choosing the option to “Add application directory to your system path”; editing the system path, if necessary, to put its reference to ImageMagick first; and then rebooting), one source seemed to say this command should work:

convert *.png -append combined.tif

And it did. I chose a TIF extension (or could have chosen JPG) to keep the command from confusing the output PNG with the input PNGs. I didn’t know for a fact that it would; I just wanted to avoid problems. Another approach would have been to specify something like “convert f*.png,” if the names of all of the input PNGs began with the letter F.

If I had needed to specify the order of files to be combined, I might have started by using IrfanView > File > Batch Conversion/Rename, or something like Bulk Rename Utility, to give the PNGs shortened names, which would be easier to work into a command. I might also have used something like Microsoft Excel to construct the essential command.

The only problem with the output was what I had already seen when I was taking the screenshots: a bit of material was lost at the end of each screenshot, when I moved to the next screen using PageDown. This example illustrates about how much was lost, from the particular webpage I was capturing:

For some purposes, losing an entire line of text could be problematic. One remedy, in that case, would be to take overlapping screenshots (given the difficulty of aligning webpage display precisely in the browser) and then edit out the duplicative material by hand, so that there would be neither surplus nor missing material. That might not mean editing every screenshot; perhaps some elisions would be unimportant.

* * * * *

Appendix: Other Approaches

I wrote this post because I was finding that Firefox screenshot add-ons (e.g., Screenshoter, Awesome, Easy, Fireshot) were not capturing the whole length of very long webpages. And perhaps there really was a limit. Later, though, it seemed the problem may just have been in my procedure. I was using the screenshot add-on, telling it to capture the entire webpage, and then pasting it into IrfanView, my default image viewer. This would truncate the image at a certain point. Eventually, it occurred to me to paste from the screenshot tool directly to a large (perhaps several hundred inches long), empty canvas in Photoshop. And that worked: Photoshop ate the whole thing, and could produce at least a TIF capturing it all in one step — and then IrfanView could view the TIF and save it as a PNG or JPG, if desired. That seemed to solve the problem. And yet, later still, it didn’t. I was not sure what was happening there.

Another approach, that I tested only in Chrome, was FireShot Pro ($40; free trial): it did capture, in one image, a webpage that was too long for other add-ons to capture — but then, later, it too failed. Possibly my free trial had expired by the time I tried it the second time.

Another approach — and, as of my most recent try, this one worked best for me, merging the equivalent of 43 PDF pages into a single PNG: print as PDF in your web browser, using pages wide enough to prevent Facebook’s Chat button (which presently appears at the bottom right of the Facebook screen) from interfering; convert the PDF to image files, one per page; edit the image files as needed; and then join the image files edge-to-edge. In more detail:

  • For me, the PDF printing was better done with the Adobe PDF printer; the Microsoft Print to PDF printer did not render the Facebook page correctly.
  • With the Adobe PDF defaults, I found that I had to trim 0.4″ from the top and bottom of each page — more, for some pages. Another way to crop the pages would be to convert them to image files first and then use IrfanView’s File > Batch Conversion/Rename > Use Advanced Options.
  • I was able to convert the PDF pages to individual image files using Acrobat, but no doubt other PDF programs also have that capability. In Acrobat, this conversion required a high dot-per-inch setting (I used 2400) to preserve quality. That setting produced large PNGs. That setting may have been counterproductive; I had to use IrfanView’s batch conversion with advanced options to save them as JPGs with their short side reduced to 1000 pixels. Otherwise, the PNGs were too large; they crashed ImageMagick when I attempted to stitch them together, bottom edge to top edge, with no overlap.
  • ImageMagick was able to merge the shrunk JPGs, using “convert f*.jpg -append combined.png” (i.e., basically the same command as discussed in the previous section). On the few occasions when the page joints occurred in the middle of an image, I got a thin white break, like this, that I might have been able to alleviate with a more refined ImageMagick command:

I reduced the number of such breaks by printing long PDFs. I found that Adobe’s PDF printer was able to print PDF pages 15″ wide x 190″ long (this time, at least), so that even a long webpage would have few breaks. I had to print 15″ wide to keep FB’s stupid Chat box off to the right side.

In a later try, I found that, for some reason, Facebook’s header was appearing at the top of every page in the PDF, blocking out some content. I rectified that by printing the same page with two slightly different (long) page lengths, cropping in Acrobat and exporting to JPG as described above. As long as the two pagelengths differed by at least an inch or two and weren’t an easy multiple of each other, they wouldn’t share the same page-break point. (For instance, the least common multiple of pages 190″ and 175″ long would be 6650″. My webpage wasn’t that long, so the printouts wouldn’t encounter a page break shared by both printouts.)

Unfortunately, neither my version of Photoshop nor Microsoft Image Composite Editor could accurately merge pages of that size. I had to crop each page manually. This cropping sought (a) to make each JPG end at a place where no joint would be noticeable and (b) to eliminate all overlap. In effect, I made minor crops at the starts and ends of each long (190″) page, and I made major crops in the middles of the shorter (175″) pages, so that the latter were reduced to serve as short extensions, each just a few inches long, covering the page breaks in the longer pages. Then I renamed all of these long pages and short extensions as 01.jpg, 02.jpg, etc., so that each long page was followed by its short extension. That way, the ImageMagick append command (above) was able to butt them up against each other in the proper order, with no overlap and nothing missing. I still had to resize width and DPI of the resulting PNG (using IrfanView > Image > Resize/Resample) to get filesize down without losing quality (I used 10″ wide and 72 DPI as the final setting). That produced a PNG of about 40MB.

This entry was posted in Uncategorized and tagged , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.