This post offers a relatively simple and streamlined summary of techniques that I developed through a long, detailed discussion in another post. Portions of that post may be useful for people who want to learn how to use Windows 7 batch commands to automate processes applied to large numbers of files. That post also troubleshoots numerous issues that did not arise in the present effort, but that may arise in some other cases.
I wanted to test a large number of PDFs. I did not want my testing to cause unintended damage to any of those files, nor did I want to cause disorder by moving those files from the places to which I had sorted them. So I decided to do my testing on copies of those files.
Before copying, I used the Everything file finder to search for “~” (with quotes), so as to identify and rename files whose names contained the tilde (~) character. Tildes (and probably other odd characters that I had removed previously) would cause problems with some of the steps described here. They probably wouldn’t derail the entire procedure; they would just prevent operations from running on those particular files.
I was able to find and copy the desired PDFs (as discussed in that other post, as are most topics in this one) by using DIR and a spreadsheet to generate individualized file-moving commands that would rename duplicates as needed to avoid overwriting. Alternately, I could have done an Everything search for *.pdf; selected them all with a Ctrl-A; unselected exceptions with Ctrl-leftclick; copied the selected files with Ctrl-C; and pasted into the destination folder with Ctrl-V. This time around, I included *.fdf files, of which I had only a few. It appeared that these processes were not necessarily reliable. I had to make multiple attempts and comparisons before I was sure I had copied them all.
Space permitting, I would have liked to put the PDF copies into a folder on an internal drive, so as to reduce the delays I had experienced with an external drive in the previous effort. I called the source folder (containing these copies of my PDFs to be tested) the “InputPDFs” folder. I would be printing them to the “OutputPDFs” folder.
When I had the source PDFs combined into the InputPDFs folder, I removed the titles that were preserved in the Properties of some individual PDFs. In my version of Adobe Acrobat (9.5), those titles were visible at File > Properties > Description tab > Title. I removed the titles because I had found they would interfere with the process of testing PDFs in the InputPDFs folder by printing them to PDFs in the OutputPDFs folder. For the same reason, I also wanted to remove security from the PDF copies in the InputPDFs folder.
There were various ways to remove those titles and security settings. One practical measure was to run PDFInfoGUI on the InputPDFs folder and export the results to an Excel spreadsheet. The Title and Encrypted columns in the PDFInfoGUI output told me which PDFs had preset titles and security settings.
In PDFInfoGUI, I clicked on the Encrypted column heading to sort by that value, so as to see which PDFs had some form of security enabled. I hoped there were not many, because in my experience few secured files could be undone by Acrobat without the creator’s password. Fortunately, I had disabled security in most; there were only three left. I unlocked two of those files using FreeMyPDF. I worked around security on the third one, for purposes of testing, by using Foxit to print it as a PDF. A rerun of PDFInfoGUI confirmed that titles and security fields were now clear for these PDFs.
Many of the PDFs had preset titles. I didn’t want to remove them manually. Acrobat and perhaps other PDF editors had the ability to automate their removal. To clear out the titles using Acrobat, I went into Acrobat > Advanced > Document Processing > Batch Processing > New Sequence. There, I selected Document > Description, with blanks in all fields (i.e., Title, Subject, Author, and Keywords). I designated the source and target folder for this operation and clicked Run Sequence.
Due to Acrobat crashes, it took several tries to get through the full list. Fortunately, Acrobat offered a no-overwrite setting, so when I did have to start it over again, it moved very quickly through the files it had already processed. Once Acrobat completed a full run through the entire list of input PDFs without crashing, I ran Beyond Compare, comparing input and output folders, to see which files had not changed or did not exist in the output folder.
Beyond Compare indicated that Acrobat had not copied twelve files to the output folder. These twelve included the six that I had identified as corrupt in my first attempt to work through this process (see previous post). So it appeared that Acrobat’s attempt to remove Title and related data from the Properties of various PDFs did constitute a way of identifying bad PDFs. When I attempted to open these twelve, Acrobat produced error messages for most. Those error messages were as follows:
- Acrobat could not open [filename] because it is either not a supported file type or because the file has been damaged . . . .
- There was an error opening this document. The file is damaged and could not be repaired.
- [filename] is protected. Please enter a Document Open Password. (I did not have the password, was not able to open the document without the password (it had been sent to me, but I had lost the password), and was not able to circumvent the password for this particular file.)
- An error exists on this page. Acrobat may not display the page correctly.
- There was an error processing a page. There was a problem reading this document (109).
Five of the twelve did not open at all. One opened but displayed corruption in the form of blank pages where text should have been. One (the one with the “error exists on this page” message, above) displayed a map with missing data. Several others seemed OK, and did not display any error messages until I paged forwards or backwards through them, at which point I got the “error exists on this page” message and, in two instances, a blank page. Two of the twelve opened, and I was able to view all of their pages, without error messages. I was not sure why Acrobat had not put Title-free copies of those two in the output (Titles Removed) folder. I was able to remove titles manually from a few.
I would have liked to remove the Titles from the originals of these PDFs, rather than from the copies, but doing so would have required batch files and spreadsheets to identify the PDFs’ original folder locations, bring them together into a Workspace folder, run the Acrobat operation, and then return them to their original folders. That was not a priority.
Quick Tests for Corrupt PDFs
Now that I had set up the files to be tested, and had done some preliminary analysis and adjustment with the aid of Acrobat and PDFInfoGUI, it was time to try some relatively quick ways of testing PDFs.
As just described, I was able to do a form of bulk PDF testing just by running an automated process in Adobe Acrobat. This was not an elegant solution: Acrobat had crashed repeatedly in the process, and when it was done I had to compare the input and output folders, using something like Beyond Compare, to see what Acrobat had failed to process.
Also, as described above, I had used PDFInfoGUI to run a quick scan on the input PDF folder and produce a spreadsheet detailing various characteristics about the PDFs being tested. I wondered, now, whether the PDFInfoGUI output provided any clues that would tend to identify problem PDFs — specifically, the twelve files discussed above.
To explore that, I ran PDFInfoGUI on the PDFs that had now been created in the Titles Removed (output) folder. This confirmed, first, that the Titles and Encryption had been removed from all PDFs. There were three exceptions: the one PDF that I could not decrypt (above), and two others that were corrupted in such a way that they could not be saved after revising their Properties.
In fact — as I could see when I clicked on various column headings to sort by column — the data produced by this run of PDFInfoGUI identified eight files for which little data was available. That is, in the PDFInfoGUI screen showing the results of its scan of the output folder, there were eight files that showed no information for number of pages, encryption status, creation date, and so forth.
Five of those eight files were among the twelve problematic files discussed above. In other words, when PDFInfoGUI failed to display most items of data for a PDF, there was a good chance that it did have problems. The other three of those eight were not among the twelve that (as I learned from a Beyond Compare comparison of input and output folders) Acrobat failed to process. Acrobat seemed to have successfully processed these three files — after all, it had created these three output files — and yet manual inspection now demonstrated that I could not open them. That seemed odd. I checked the input files that Acrobat had processed. They opened OK. It appeared that the Acrobat process may have produced them as corrupted output files.
These results suggested that files with most information missing, in the PDFInfoGUI output, were likely to be corrupted. Evidently the information was missing because PDFInfoGUI, like the user, would be unable to open them. But PDFInfoGUI did not detect all corrupted files: somehow, it managed to report data for some input files that I could not open.
The files that PDFInfoGUI did not implicitly identify as problematic — that is, files for which it reported most items of data, as though everything were OK in those files — were of several types. One of those files displayed no error messages when opening or paging forward or backward through it, but there were several instances of blank pages where there should not have been blanks, and one page on which the image was visibly corrupt. Another of those nondetected files, similarly, consisted of a single page, containing a visibly corrupted image, but without error messages. Another was badly corrupt — upon opening, it displayed the “There was an error processing this page” message, and about half of its pages were blanks — mostly reduced to tiny icons. Another displayed no error messages until I paged forward and then backward, at which point I got, “An error exists on this page” — without visible corruption. Two others gave that error while paging forward through the document — one without visible corruption, one with a blank page at the site of the error. Another had lost three of its pages to tiny blank icons.
It appeared, in short, that PDFInfoGUI would be useful for some purposes; it might detect some corrupted files; but it was not designed for purposes of PDF testing, and it did not do well at it.
In the previous post, I had tried CorruptedPDFinder (a/k/a Recursive Finder of Corrupted PDF Files). I tried it again now. I tried installing a fresh download of the 2015-03-12 version, but the installer did not seem to think anything had changed since the version of December 2014; it offered merely to repair, modify, or uninstall.
When I ran CorruptedPDFinder the first time, my Avast! antivirus immediately popped up with a warning that something had installed, or was attempting to install, an undesirable toolbar. I didn’t write down the name of that toolbar — “Find” something, I think. Findbar? Whatever it was, with my approval, Avast! proceeded to uninstall it. That problem did not recur on subsequent runs.
I decided to run CorruptedPDFinder on the folder containing the output of the Acrobat title-cleaning process (above), since that folder now contained three additional corrupted files. CorruptedPDFinder quickly scanned that folder. It flagged six files as “Corrupt” and one as “May Be Corrupt.” The files that it flagged were exactly the ones for which most data was missing in PDFInfoGUI, except that it failed to identify one file for which PDFInfoGUI displayed no data. The problem with this file was not an error; it was the password-protected file that I could not decrypt (above). I was not sure how CorruptedPDFinder was able to conclude that this file was OK; presumably it, like I, had no idea what its contents were.
Since CorruptedPDFinder produced results nearly identical to (but not quite as good as) those produced by PDFInfoGUI, and since the latter provided other information as well and ran at comparable speed, I concluded that, for the time being, it would be sufficient to use PDFInfoGUI for quick checks that would identify some but not all corrupted PDFs.
Acrobat: Batch Create Multiple Files
At another point in the previous post, I had succeeded in using Acrobat 9 to create PDFs from PDFs and, in the process, to flag those input PDFs that were problematic for that purpose. I had found that this procedure worked fairly well. I decided to try it again now. The procedure involved simply going into Acrobat > File > Create PDF > Batch Create Multiple Files > click on the Combine icon at the top left > Add Folders > designate the source folder. Here, again, I decided to use the folder containing the output of the prior Acrobat process (above), with its three additional known corrupted files.
Once I had designated the source folder, Acrobat chewed on it. It did not give any indication of what it was doing. If I hadn’t been able to see the hard drive churning away, I might have thought that the program had frozen. Eventually it showed me the list of files it proposed to process. It did not provide a statement of how many files were in the list. I clicked OK. It commenced its batch creation process. Again, no indication of anything happening; in fact, the window said, “Not responding.” But eventually Acrobat requested the location of the output folder, and then we were off and running.
This batch file creation process was not as rapid as the file-checking done by PDFInfoGUI and CorruptedPDFinder. Acrobat often lingered for several seconds, or longer, while creating an individual PDF. Hence, it took quite a while to batch-produce the output set.
I knew, from the previous effort, that the output files would not necessarily have the same size or other characteristics as the files from which Acrobat had created them. The main thing, here, was simply whether Acrobat was able to use the input PDFs to create output PDFs. If not, that would be a sign that something might be wrong with the input PDF.
As noted in the previous writeup, Acrobat did not provide an exportable list of problematic files. Once again, fortunately, there were few error messages, so I was able to copy them manually. Those messages, in various forms (e.g., “Cannot open document”; “Object label badly formatted”), flagged all of the PDFs that PDFInfoGUI and CorruptedPDFinder had flagged, plus several of those that had been flagged in the prior Acrobat sequence (above). I was surprised that it did not flag exactly the same files as in that prior sequence — that it missed several. It did not flag any files that had not been flagged in one or more of those previous attempts. A comparison with Beyond Compare confirmed that the PDFs flagged by these error messages were exactly the PDFs that existed in the input folder but not in the output folder.
Testing PDFs by Printing in Adobe Reader
Among the various methods explored in the previous post, the most thorough was that of testing PDFs by printing them to other PDF files. This differed from the Acrobat process discussed immediately above. We were not merely going to create an output file from an input file, in a copying process that might test for some kinds of problems while neglecting others. Rather, we were going to run an input file through a PDF printer to produce an output file. It seemed that this slower, page-by-page approach might uncover PDF problems not detected in the quicker methods discussed above.
After trial and error with a number of different batch file and command line possibilities, the relevant section of the previous post came around to a solution in which a couple of different things would be going on at the same time. This solution would require a computer that could be dedicated to just this process for the hours or days necessary to PDF-print a potentially large number of PDFs.
First, I needed an input folder. I put this folder on the R: drive and called it simply R:\Input. It would hold the PDFs to be tested. I copied those over from the source folder that I had been using for the other testing approaches described above. I also needed an output folder, R:\Output.
Next, I needed a one-line command that would do the PDF printing. That command was as follows:
FOR %i IN (*.pdf) DO (IF EXIST "R:\Output\%~nxi" (IF EXIST "R:\Input\%~nxi" DEL "R:\Input\%~nxi") ELSE (SET printing="R:\Input\%~nxi" && SET printed="R:\Output\%~nxi" && START _Killer.bat && START Everything.exe -s "%~nxi" && START /wait AcroRd32.exe /t "R:\Input\%~nxi"))
In English, that command basically said, Do the following for each PDF in the Input folder. First, if a file of the same name already exists in the Output folder, then delete the file from Input, so it’s not in our way anymore. This feature meant that the process could be interrupted and restarted, and would not have to go back through a bunch of already-printed PDFs to find where it should resume. Views of file lists in Windows Explorer would incidentally be faster as the number of files remaining decreased.
Next, the command said, if the Input file exists (i.e., if there’s not a file of the same name in Output), give its name to the “printing” variable, including its path (e.g., R:\Input\File to be printed.pdf). Also, give a similar value to the “printed” variable (e.g., R:\Output\File to be printed.pdf). Now start the _KILLER.BAT batch file. Also, start the Everything file finding program. Finally, open the input PDF in Acrobat Reader, and use Reader’s printing capability to print that PDF.
Bullzip PDF Printer was my default PDF printer; it had certain features that were useful for this purpose. I had to be sure to set its default output folder to R:\Output. The other post contains further details on desired Bullzip settings.
So then _KILLER.BAT would go to work. (I put an underscore in front of its name so as to make it appear among the first files in the folder. For simplicity, it would be in the Input folder, and without the underscore would appear way down in the list, among the other files whose names began with K, where I could not readily find and open it for tweaking as needed.) _KILLER.BAT read as follows:
:: _KILLER.BAT ECHO OFF SET /a counter=0 :waiter CLS ECHO. ECHO. ECHO. ECHO Printing %printing% ECHO. ECHO Waited about %counter% seconds so far ... ECHO. ECHO. TIMEOUT 4 /nobreak && SET /a counter+=4 IF NOT EXIST %printing% GOTO resumer :: Because by now I might have moved it manually IF EXIST %printed% (DEL %printing% & GOTO resumer) IF %counter% EQU 60 BEEP 250 500 IF %counter% GEQ 3600 (MOVE /y %printing% R:\Input\TestManually\ && GOTO resumer) GOTO waiter :resumer TASKKILL /f /t /im AcroRd32.exe TASKKILL /f /t /im gui.exe RD /s /q "C:\Users\Ray\AppData\Local\Temp\BullZip\PDF Printer" EXIT
The _KILLER.BAT batch file would wait, check on the situation every four seconds, and decide whether to wait another four seconds for results. It was looking, specifically, to see if a file with the name of the PDF being printed had now materialized in the Output folder. If that still hadn’t happened after an hour, it would give up and move the file being printed to the TestManually folder. It would also beep after 60 seconds, so that I could look up and see if progress had stalled due to some error message requiring manual intervention. When the input PDF was gone (because either I had moved it, or it had been moved to the TestManually folder, or _KILLER.BAT had deleted it after a successful print), _KILLER.BAT would proceed to terminate Adobe Reader and the Bullzip PDF Printer interface (i.e., gui.exe), and would also clean out the temporary folder where Bullzip PDF printing materials tended to pile up until they choked the program drive.
For the long command (above) to work, the Input folder had to contain not only _KILLER.BAT, but also something like BEEP.EXE and the several program files needed for the portable Everything file finder (i.e., .exe, .ini, and .db). In addition, Adobe Reader (AcroRd32.exe) had to be installed on the system and included in the system’s PATH.
At this point, I ran into problems with Bullzip that I had not encountered the previous time around. I did not have time to troubleshoot them. It tentatively appeared that I might have to uninstall and reinstall Bullzip. I did not have time to explore this, and therefore deferred further development of this restatement of the previous effort. The prior post does provide further detailed discussion.