As detailed in the Conclusion (below), this post describes my exploration of a number of different tools and techniques for repairing damaged or corrupt PDF files. For the particular set of PDFs I was working on, the most effective technique would have started with an examination of file names and contents in Notepad (see Manual Repair section). The several canned programs I tried were also able to restore a few corrupt files. In addition, with the investment of sufficient time and effort, certain command-line tools might also have been helpful. Of course, a proactive backup system would be the first line of defense.
Introduction: Trying a Few Quick Tools
As described in another post, I had found a way to scan the PDFs scattered around my drive, to see if any were corrupt. Having found 34 corrupt PDFs, I looked for tools that might help me repair them. (I did have backups, but some of these files appeared to have been corrupt since before my oldest backup.)
First, I looked at several options highlighted in a search. I wound up trying two tools: PDF Repair Free by Recovery ToolBox (a portable utility) and Repair PDF by PDFaid (online). (Later, I found a similar tool by OfficeRecovery.) Both programs proceeded one file at a time. Pending a more thorough search for alternatives, it appeared I would have to pay for commercial software if I wanted something capable of repairing files en masse.
PDF Repair Free indicated that it was able to repair only three of the 34 corrupt PDFs. For the others, it gave me an error message: “Unable to recover file: no PDF data has been found.” I suspected that the “Acrobat could not open” error messages (above) corresponded to those that PDF Repair Free could not repair, and that the “There was an error opening this document” messages (above) corresponded to those that PDF Repair Free felt that it could repair. The analyze/repair sequence of buttons and messages in PDF Repair Free seemed somewhat illogical. But it did give me readable PDFs for those three that it had attempted to repair.
I tried Repair PDF (online) on those same 34 files. It indicated that it allowed a maximum file size of 20MB. Fortunately, only one of the 34 was larger than that, at 22MB. When I tried that one, I got an error: “Only 20MB max Allowed.” To use the online tool, I had to click its Browse button, navigate to the folder containing the corrupt PDF (which address it remembered after the first time), select the file, click the Repair PDF button, and wait for it to try to repair the file. For the most part, I got error messages: “Sorry! Your PDF is corrupted to a level that cannot be repaired.” For the file that it could repair, it said, “Success! Click Here to downloaded your File!” It gave the recovered PDF a strange name and put it into my defaults Downloads folder. Altogether, the online Repair PDF tool claimed success in repairing only one PDF. It was one of the three that PDF Repair Free had also been able to repair. These results suggested that not all PDF recovery tools were made alike.
Another search led to additional tools, apparently capable of bulk repair, at a price. A search for free alternatives led to a suggestion to use Poppler, in Linux, or Ghostscript. The webpage for GSview, a Ghostscript GUI, provided links for downloading Ghostscript 9.15 and GSview 5.0. I installed those programs in that order and ran GSview. I got as far as File > Open before it crashed. Tried again; same thing.
Another search led to a page offering several command-line suggestions involving Ghostscript, pdfopt (which appeared to be a Linux and/or Ghostscript tool), or PDFtk. I installed PDFtk Free (the pro version was only $3.99), put a copy of pdftk.exe in the folder where I was working, and tried this command with each of the corrupt PDFs:
pdftk "J:\Input Folder\File name.pdf" output "J:\Output Folder\File name.pdf"
replacing the folder and file names as needed. That didn’t work: it produced this:
Error: Unable to find file.
Error: Failed to open PDF file:
[path and file name]
Errors encountered. No output created.
Done. Input errors, so no output created.
I guessed that perhaps PDFtk could not handle a path name in the input, so I tried again with a modified version of the command:
pdftk "File name.pdf" output "J:\Output Folder\File name.pdf"
But that produced the same input error message. Was PDFtk not able to handle quotation marks? I tried again with this form of command:
pdftk File name.pdf output "J:\Output Folder\File name.pdf"
But no, that definitely was not the solution: now I got a series of errors indicating that PDFtk was unable to find the file named “File” and the file named “name.pdf.” So, alright, evidently it couldn’t handle spaces or quotation marks: evidently the file name had to be one continuous word. I tried changing the input file name and ran it again, like this:
pdftk Filename.pdf output "J:\Output Folder\File name.pdf"
But that wasn’t it either! I was stumped. A search took me to a thread in which someone suggested that the problem (as of back then, anyway) might be that PDFtk was very old and had not been maintained. I wasn’t sure that would explain the failure of its Windows version to function reasonably in an old DOS/cmd type environment.
It occurred to me that maybe the advice (above) was mistaken. I went to a page full of PDFtk examples. It did seem to confirm what I was doing. But was PDFtk perhaps calling it an input error when, in fact, it was objecting to the spaces and/or the quotation marks on the output side of the command? I tried again with this:
pdftk Filename.pdf output Fixed.pdf
That didn’t work either. It was time to try something else.
Several sites had said that Ghostscript could fix broken PDFs. That was consistent with the Ghostscript documentation on debugging, which said, “Normally, PDF interpreter tries to repair all problems in PDF files.” But I had previously attempted to use Ghostscript in a Windows environment without success, and for a while it looked like that was going to be the situation once again. I tried entering the recommended command:
ps2pdf damaged.pdf fixed.pdf
but Windows said, “‘ps2pdf’ is not recognized as an internal or external command, operable program or batch file.” A search for ps2pdf.exe indicated that no such file existed on my computer. Likewise for pdfopt.exe. It seemed that the page advising me to try those commands must have been thinking in terms of Linux systems.
The Ghostscript documentation said that the Ghostscript executable on a Windows system was gswin32c, not just gs as in a Linux system. On my system, Control Panel > Programs and Features confirmed that I did have GPL Ghostscript installed, but a search for gswin32c confirmed that the only such file on my system was gswin32c.exe, installed in a subfolder under my Bullzip PDF printer — not in the Ghostscript program folder. The Ghostscript documentation said that I should use gs, and also seemed to assume that I had ps2pdf. I was not sure why those executables would not have been included in my Ghostscript installation. But I could have used the gswin32c.exe from that Bullzip installation, if I had understood its relationship to ps2pdf.
Eventually I discovered that my system did have a number of ps2pdf executables, but they were all *.bat and *.cmd, not *.exe. The Ghostscript documentation said that, for example, ps2pdf14 (presumably referring, on a Windows system, to ps2pdf14.bat or ps2pdf14.cmd) would produce PDF 1.4 output, compatible with Acrobat 5 and later. (That appeared to be the most advanced level of Ghostscript available.) A look at ps2pdf14.bat in Notepad seemed to indicate that it had not been formatted for Windows. Nonetheless, if I opened a command window in the folder containing ps2pdf14.bat (which, on my system, was C:\Program Files\gs\gs9.15\lib) and typed ps2pdf14, it ran, in the sense of giving me a tip on proper usage rather than a “not recognized” error. Responding to that tip, I tried this command in that folder:
ps2pdf14 "I:\Current\Corrupt PDFs\test.pdf" "testout.pdf"
but now it said, “‘gswin64c’ is not recognized as an internal or external command,
operable program or batch file.” That was odd: a search within my computer indicated that gswin64c.exe existed in C:\Program Files\gs\gs9.15\bin. But in any case, now it seemed that this was why my search for gswin32c.exe had failed: I was using a 64-bit system, so it apparently had not been included in the installation. The rather convoluted situation now seemed to be that ps2pdf14.bat was calling ps2pdfxx.bat, and somehow that was eventually (not within ps2pdfxx.bat itself) resulting in a call for gswin64c.exe — and for some reason gswin64c.exe was not being found.
Trying Poppler in Ubuntu
I could have pursued that line of inquiry further, in a desperate bid to make Ghostscript work, but my searches were not leading directly to clear guidance, and now I was curious about another approach. A comment in one discussion suggested that the poppler-utils package in Linux was especially capable of opening PDFs that even Ghostscript couldn’t open, that Evince and Okular were common Linux PDF viewers, and that they used Poppler.
A search led to an indication that Evince was the default PDF viewer in Ubuntu. I did not have Linux installed anywhere at this point, but I booted Ubuntu 14.10 from a live CD (selecting the “Try Ubuntu” option, not “Install Ubuntu,” once the CD had booted) and tried to view my corrupted PDFs that way. To do that, I double-clicked on one of Ubuntu’s folder icons, navigated to the place where I had the PDFs, and double-clicked on one of them. I got an error message:
Unable to open document [filename]. File type HTML document (text/html) is not supported.
Apparently Evince was not going to tell me that the PDF was corrupt; it was just going to assume that I must be trying to view some kind of document that was beyond its capabilities. I tried on a few others and got the same message, or variations that said, “File type DOS/Windows executable (application/x-ms-dos-executable) is not supported,” and likewise for “File type unknown (application/octet-stream)” and various other types. So it seemed that perhaps Evince would read different kinds of PDF corruption as suggesting different types of files — which was very interesting, but I still couldn’t read my PDFs.
I clicked on all 34 PDFs. I noticed that, in Ubuntu, most were represented by a vaguely Adobe-looking icon, but a few were represented by thumbnails of the actual document. This seemed to be an indication of which PDFs Ubuntu was able to read. In other words, it appeared that I didn’t actually have to try to open the document to see whether Evince was going to be able to read it; I could tell just by looking at the thumbnails. Soon I discovered I could select multiple (or all) icons and hit Enter to open them all at once, rather than opening each one individually. Ctrl-W worked, as in Windows, to close individual windows.
In the end, I found that Evince was able to read only the three PDFs that I had already been able to read in the quick tools that I tried at the outset (above). There was a difference, though. PDF Repair Free had recovered three files, but each one was only two pages long, and the first page of one of those was blank. Repair PDF had recovered only one document, but had recovered its entire 31 pages. Evince, too, recovered that 31-page document. It recovered only the first page of another document, whereas PDF Repair had recovered the first two pages, but it indicated that that document had actually been 77 pages long. Finally, Evince recovered only the first page of the third document, but indicated that it had originally been three pages long.
So Evince in Ubuntu seemed to be just another possibility for recovering corrupted PDFs: better than some tools with some documents, perhaps, and worse with others. I wondered whether Okular (mentioned above) would be a better Linux alternative. My Ubuntu recollection had aged several years at this point, but Okular appeared to be the default PDF viewer for the KDE desktop, which was different from the Unity desktop that seemed to be the Ubuntu 14.10 default desktop. I found instructions on installing a different desktop environment, but that seemed like a lot of work. Nor did that seem necessary: it looked like I might be able to install Okular without changing the desktop. But doing so within a live CD setup would apparently require me to customize the live CD: it didn’t look like I could load packages (e.g., Okular) into RAM while the CD was booted.
There was another way to run Linux without formally installing it on my computer. Following advice, I downloaded and installed VirtualBox (choosing to install none of the subfeatures — i.e., no USB, networking, or Python support — and making sure to Register file associations). Then, in the new VirtualBox setup, I went into File > Preferences and set the default location where I would store my multi-GB virtual machines (VMs). Next, I clicked the New button, accepted the default RAM assignment, creating a new 8GB Ubuntu virtual hard drive (VDI) with dynamically allocated space. Now I went into Settings > Storage > click the blue disk+ icon at the bottom. The tooltip (coming up when I moused over that icon) said, “Adds a new attachment to the Storage Tree using currently selected controller as parent.” The currently selected (default) controller was the IDE (with an icon of a round disc), not the SATA. I clicked on that blue icon. It said, “You are about to add a new CD/DVD drive to controller IDE.” I selected “Choose disk” and navigated to my downloaded Ubuntu ISO.
Now, back in the Ubuntu 14.10 Settings dialog in VirtualBox, I clicked System at the left side, selected CD/DVD in the Boot Order box, and moved it up to the top of the list. I clicked OK and then double-clicked on the icon at left that said, “Ubuntu 14.10 – Powered Off.” That started the virtual machine (VM). Here, unlike the live CD (above), I clicked Install rather than Try Ubuntu. I checked “Download updates while installing” and then “Erase disk and install Ubuntu” (referring to the newly created virtual drive, not to my hard disk and all my real files). The VM was very slow; it took a while to install Ubuntu. When it was done, I tried to restart, but the Ubuntu shutdown sequence seemed to have frozen, so I killed it and restarted it in VirtualBox. Then, in the Ubuntu VM, as advised, I went into the top menu bar > Devices > Insert Guest Additions CD Image > Run.
When that was done, I tried to figure out how to add Okular to the Ubuntu VM. At the left side of the screen, I went into the top left Ubuntu-like icon (tooltip: “Search your computer and online sources”) and searched for Okular. It was there, under the Reference heading. (I could also have searched for Synaptic and taken that roundabout but familiar approach to find and install Okular.) I double-clicked on the Okular icon. That didn’t work: evidently it wasn’t showing me the Okular program after all. I tried a search for Terminal, and then double-clicked on the icon showing a command window. In that window, I typed “sudo apt-get install okular.” That worked. When the install process finished, I found myself at a command prompt. I typed “okular.” Okular opened. Unfortunately, it didn’t seem to offer any way to select my defective PDFs. I guessed that perhaps I did need to open it within the KDE desktop environment. I wasn’t going to go to that trouble in this extremely slow VM. I bailed out.
The JSTOR/Harvard Object Validation Environment (JHOVE, pronounced “jove”) was, among other things, an effort to validate digital objects, including PDFs. At this writing, its software had apparently last been updated on 9/30/2013. Despite its name, a parallel project, JHOVE2, had seemingly last been updated on 3/18/2013. According to COPTR, as of December 2014, much of the JHOVE “development effort has been diverted to JHOVE2” but JHOVE “is still actively maintained and developed (apparently as a solo project) as it supports some common formats that JHOVE2 does not.” COPTR specified that, among other formats, JHOVE supported, but JHOVE2 did not support, JPEG, JPEG2000, and PDF. Hence only JHOVE was relevant here.
JHOVE and other comparable tools (e.g., DROID, NLNZ, FITS, FIDO) appeared to be of interest especially to librarians and other digital preservation professionals. This is not to say that such tools were necessarily any more powerful or well-designed than other options considered above. For example, one source stated that JHOVE was, but DROID and NLNZ were not, able to confirm that a document was “well-formed and valid”; another source asked, “Why can’t we have digital preservation tools that just work?” Note that a common purpose of such tools (e.g., DROID), not very relevant here, was characterization as distinct from validation of files — determining, that is, whether they were indeed PDFs or might instead be misnamed files of some other type.
java -jar pdfbox-app-x.y.z.jar PDFDebugger [inputfile]
To experiment with that, I downloaded and unzipped the pre-built PDFBox 1.8.7 standalone binary (pdfbox-app-1.8.7.jar). As I had suspected when I saw that command line syntax, however, it quickly developed that using these command-line tools required more programming knowledge than I possessed. FITS, the other non-JHOVE validation tool on Gary McGath’s list, indicated that JHOVE was the only tool it used for validation.
So I focused on JHOVE. According to Wikipedia, JHOVE sought to analyze documents for consistency with basic requirements of a specified format (e.g., PDF). Wikipedia indicated that the currently available JHOVE download (version 1.11, as of 9/28/2014) included both command-line and GUI versions. It appeared, moreover, that there were instructions for installing in Windows and also in an Ubuntu VM.
I downloaded and unzipped JHOVE 1.11 and looked for instructions. In the JHOVE tutorial, the sections on installation and configuration specified a number of ways in which it would be necessary to edit certain aspects of the JHOVE configuration file and related matters. While these might not be overwhelming to users with some command-line experience, they raised the concern of whether I, at best a novice to programming, would invest a lot of time and then find that I had not done it right, or that the program had not done what I wanted.
Skipping ahead to that latter issue, JHOVE relied on Adobe’s PDF Reference for the particular version of PDF (e.g., 1.6) to determine whether a PDF file would qualify as well-formed. The documentation on JHOVE’s PDF-hul Module said that “documents using features specific to PDF 1.7 or later may be reported as not well-formed or valid.” PDF 1.7 had first become available in October 2006. It meant Adobe Acrobat 8 and Adobe Reader 8. I had been using Acrobat 9 for years. Thus, it appeared that JHOVE might incorrectly report many of my PDFs as invalid.
Buried in the tutorial, I found documentation on the JHOVE GUI. It said that I could start the GUI with an “appropriate mouse click.” I looked at the files that had been unzipped from the download. The top-level folder contained a file called jhove.bat. I double-clicked on it. I got the telltale flash of a command window opening and closing, telling me that this was a command-line program. A search led to a guide for using JHOVE on Windows XP. It told me that the file to run was the JhoveView.jar file in the bin subfolder. But double-clicking on that achieved nothing.
It appeared possible that I would have to do some of the aforementioned manual configuring before I could get JHOVE to run. And then, if it did run, it appeared that JHOVE was outdated. I declined to pursue JHOVE further at this point.
Trying Manual Repair
In my browsing, I had encountered a few references to fixes that users had administered by opening a PDF file in Notepad. For instance, one person said that s/he had fixed a corrupted PDF by removing all text after the line that read %%EOF. I assumed EOF meant “end of file,” and that somehow some additional material had gotten added to the file where it shouldn’t. A search for more tips like that led to these suggestions:
- Adobe said it was possible to edit the registry so that its PDF readers would be less picky about what they would choose to open.
- PawPrint described a technique for removing JPGs from PDFs containing images.
- BrendanZagaeski offered details on the structure of PDF files.
- A tutorial pointed out that Microsoft Word 2013 was capable of adding edits to a PDF.
- Firebird presented ways of improving a defective PDF.
There were countless additional suggestions. At this point, however, I was not inclined to explore them in great detail. I opened my defective PDFs in Notepad and just looked at the start and the end of each, to see if I could gather anything obvious. I was able to open the corrupt PDFs quickly by using the spreadsheet (above) to produce a revised batch file with commands like this:
start Notepad "file name.pdf"
Looking at the corrupt PDFs in Notepad was certainly revealing. Several were completely blank. I would have expected these to be the ones that Windows Explorer reported as having a size of 1 KB (which I assumed probably meant closer to zero KB). In that, I was mistaken. The smallest files were not PDFs: they were text files that somehow had acquired a PDF extension by mistake. I renamed them with a .txt extension. One or two others contained text, but they looked like they might be be URL files (i.e., links to webpages). (They contained several Prop lines and a URL line.) So I gave them a .url extension instead. When I double-clicked on these files with modified extensions in Windows Explorer, all but one opened or ran as expected.
For the ones that were completely blank, there did not seem to be anything I could do. Some of them had a lot of spaces, tabs, and/or empty lines: I could arrow around in them, down and across the page, but there seemed to be nothing there. (I viewed them with Notepad’s Word Wrap feature turned on, just in case there was text beyond the visible right margin.)
In some of the corrupt PDFs, the contents appeared to belong to an email message, not a PDF. For instance, one began with “X-Mozilla-Status” and proceeded to mention various email addresses. Changing the extension to .eml produced openable email files. Similarly, another contained various HTML codes, like these:
<HTML> <BODY> <TD>
Changing that file’s extension to .html produced an openable page. Another seemed to be a program file, containing plain-text instructions that would apparently be readable by some kind of configuration file. I gave it a .txt extension. Two others looked like they might be XML files: they didn’t start out right — it looked like their beginnings might have gotten clipped somehow — but they contained stuff like this:
</Specs> <State> <ColCRCWidth Value="73"/> <ColSizeWidth Value="104"/>
Changing their extensions to .xml did not work: the attempt to open them produced XML parsing errors. Two other corrupt PDFs did appear to be PDF files: they began with the “%PDF-1.6%” that I had seen mentioned in some of my browsing for solutions. I searched those files for the “%%EOF” marker mentioned above, deleted everything after that, saved the files, and tried to open them. These, incidentally, were copies of two of the files that other tools (above) had at least partially recovered. In these cases, my edits made things much worse: when I tried to open the files that I had doctored, Acrobat said, “There was an error opening this document. The root object is missing or invalid.”
Most of the remaining corrupt PDFs were complete gibberish. There did not appear to be any human-readable text in them. Only one of them contained the “%PDF-1.6%” (or other number) marker just mentioned. That one, incidentally, was the 31-page document that one of the other tools (above) had been able to recover. It looked like that one might consist of two or three documents munged together: its “%%EOF” marker appeared before some references to “HTTP 1.1” and to “%PDF-1.3%.” I tried breaking it into two files, delimited by where the PDF and EOF markers were (or where I thought they should have been), and saving each as a separate PDF in Notepad. This attempt was largely a failure: the one document would not open as a PDF, and the other opened with errors that amounted to elimination of most of its text.
It appeared, by now, that the search for PDF and EOF markers was of limited value. I did search the remaining files full of gibberish for “%%EOF” markers; none had that. There did not seem to be anything else I could do for those gibberish files.
That left me with a few files that did contain recognizable text. Three of these turned out to be Word documents: I saw a reference to Microsoft Word at the end of the file. I changed their extensions to .doc and they opened without problems in Word. (If that had failed, I was going to delete the gibberish and just save the text.) Finally, one was not a document: an inspection of its readable text indicated that it was a program file. Unfortunately, changing its extension to .exe did not produce a working executable.
In this post, I described my efforts to find fixes for a few dozen corrupt PDFs. The strategies I explored included the use of one downloaded and one online PDF repair program; the PDFtk and Ghostscript command-line tools; two Ubuntu PDF readers; JHOVE and other digital preservation programs; and manual repair.
Manual scrutiny was the most effective of those strategies, at least with this particular group of PDFs. That was primarily because a number of the PDFs had simply acquired the wrong extension. Opening them en masse in Notepad quickly highlighted those capable of opening when renamed with a more suitable extension. This experience suggested that one might begin by simply reviewing filenames to identify any that seemed likely to be something other than a PDF — because no PDF repair tool will repair a file that wasn’t a PDF in the first place. For these files, manual editing only made things worse; directly viewing the alleged PDFs in Notepad only helped for purposes of guiding the change of filename extensions. It appeared that actual surgery on a PDF would require more knowledge and perhaps a hex editor rather than Notepad. Manual editing would have been helpful, with these files, only if several containing recognizable text had not turned out to be .doc rather than .pdf files: editing would at least have allowed me to save a plain-text version of their textual contents.
At my level of knowledge and patience, the PDFtk, Ghostscript, and JHOVE command-line approaches were completely unhelpful. If the corrupted PDFs had been more numerous and/or more important, I would have been tempted to make this a Linux-based recovery operation, as it did appear that there were options in Linux that I was not finding in Windows. Then again, if I’d had time to learn and use Linux for this project, I probably would also have had the time needed to make PDFtk, Ghostscript, and/or JHOVE work.
Canned software was only slightly helpful in this project. Each of the three PDF repair or viewing programs that I used — two in Windows, one in Ubuntu — was able to recover at least a part of at least one of the files. It seemed that such programs could be fairly fast and easy to use. If an initial review of file names did not reveal alleged PDFs that were actually some other type of file, it seemed that it could make sense to start with one or more of these canned programs, just to see whether they might reduce the size of the problem.
In addition to the free approaches tried here, there were also commercial programs designed for PDF repair. No doubt such programs would vary in usefulness and design, just as the several programs reviewed here had done. On all of these points, comments are welcome.