All volunteers trying to convert docs to web-ready formats encounter problems with pdf and OCR tools and approaches. Some tools stink and some are almost OK once you learn how to use them effectively. I would like this thread to be a place to share observations and experience related to pdf and OCR tools generally and specifically.
To get things started ... I have been using various pdf and OCR tools for a long time. Most of the applications involved original paper documents that I could manipulate to generate better images and, thus, good pdf results and generally pretty good OCR results. Virtually every pdf tool has some form of OCR capability. I stopped using the Adobe Acrobat suite a few years ago when I realized that the "macros" they added into MS Office programs were corrupt and were changing custom menu stuff I created. Presently, I use pdf tools from FoxIt (Foxit Editor), Nitro (Nitro Pro 6) and Nuance. 90% of my use is Nitro Pro. Virtually all pdf tools have some OCR capability and for some simpler applications, they work fine. However, when documents (either paper originals or pdf files) are dirty, smudged, have handwritten comments, classification stamps or overprinted headers/footers (sound familiar?), higher quality OCR tools might be needed. Documents that were manually typed also exhibits many problems that result from the particular typewriter, the cleanliness of the keys, the skill of the typist, etc., plus the copy you get is invariably the 3rd or 4th carbon. Unfortunately most of the documents we are working on within the HyperWar project exhibit at least some of the problems mentioned; some are really bad and OCR output looks like a combination of English, Arabic and Klingon.
I have been using ReadIRIS Pro 12 on my stuff. It is considered one of the better programs out there and has some nice features:
1- ability to work in just about any language you can imagine
2- supports most scanners
3- virtually any image file format can be used for input
4- outputs to just about any file type you might want
5- has multiple viewing formats such that you can convert just text, text and formatting, text, formatting and images
6 - allows you to control speed vs conversion accuracy
7- can handle (so they say) multiple column formats, tables, etc
8- is not very expensive at $60 for a single user license
But it also has some drawbacks (at least I have not been able to find workarounds yet):
1- it does not have a global cropping capability such that you can define the working zone for multiple pages. You can define the working zone one page at a time and that is useful, but tedious
EDIT/UPDATE: the program does have a global crop/window capability (it was just hard to find). You can define a crop window on a single page, save it as a layout template and then load the layout template applying it to a single page or all pages currently loaded. Works quite well and since you can name layout templates, you can develop a library of them
2- its "learning" capability does not seem to work. It does not seem able to apply learnings from one page to the next
EDIT/UPDATE: the learning capability is limited to individual characters and character combinations. You have to set switches to turn on the learning, then work through a page or two and then (key item), explicitly tell the program to start applying what it learned. It has some irritating GUI issues while in learning mode, but it is better than I previously thought.
3- it appears to have no inference capability by which it infers the correct text based on surrounding letters and words. MS Word has a pretty decent capability along these lines and pushing the output of IRIS through Word cleans up some fraction of the mistakes. As I have used IRIS and have gained experience with its "mistakes", I have written a VBA program that runs in Word to correct the consistent gross mistakes (like "ot" instead of "of") that Word's inherent capability might not pick up
4- it can handle only 50 pages at a time. This is not a big problem conceptually, but it renumbers its pages 1 to n regardless of the true page numbers in the target source file.
5- support (telephone or email) has not been very responsive
EDIT/UPDATE: customer support finally responded and they were helpful. They did not answer all my questions, but we are making progress
That's enough from me for now. What experiences do others have? What pdf and OCR tools have you used?

