PDF/OCR Tools and Techniques

This forum is for volunteers to get assignments, ask questions and discuss issues with scanning, proofing and coding documents.

PDF/OCR Tools and Techniques

Postby merlin » Tue Feb 23, 2010 8:05 pm

All volunteers trying to convert docs to web-ready formats encounter problems with pdf and OCR tools and approaches. Some tools stink and some are almost OK once you learn how to use them effectively. I would like this thread to be a place to share observations and experience related to pdf and OCR tools generally and specifically.

To get things started ... I have been using various pdf and OCR tools for a long time. Most of the applications involved original paper documents that I could manipulate to generate better images and, thus, good pdf results and generally pretty good OCR results. Virtually every pdf tool has some form of OCR capability. I stopped using the Adobe Acrobat suite a few years ago when I realized that the "macros" they added into MS Office programs were corrupt and were changing custom menu stuff I created. Presently, I use pdf tools from FoxIt (Foxit Editor), Nitro (Nitro Pro 6) and Nuance. 90% of my use is Nitro Pro. Virtually all pdf tools have some OCR capability and for some simpler applications, they work fine. However, when documents (either paper originals or pdf files) are dirty, smudged, have handwritten comments, classification stamps or overprinted headers/footers (sound familiar?), higher quality OCR tools might be needed. Documents that were manually typed also exhibits many problems that result from the particular typewriter, the cleanliness of the keys, the skill of the typist, etc., plus the copy you get is invariably the 3rd or 4th carbon. Unfortunately most of the documents we are working on within the HyperWar project exhibit at least some of the problems mentioned; some are really bad and OCR output looks like a combination of English, Arabic and Klingon.

I have been using ReadIRIS Pro 12 on my stuff. It is considered one of the better programs out there and has some nice features:
1- ability to work in just about any language you can imagine
2- supports most scanners
3- virtually any image file format can be used for input
4- outputs to just about any file type you might want
5- has multiple viewing formats such that you can convert just text, text and formatting, text, formatting and images
6 - allows you to control speed vs conversion accuracy
7- can handle (so they say) multiple column formats, tables, etc
8- is not very expensive at $60 for a single user license
But it also has some drawbacks (at least I have not been able to find workarounds yet):
1- it does not have a global cropping capability such that you can define the working zone for multiple pages. You can define the working zone one page at a time and that is useful, but tedious
EDIT/UPDATE: the program does have a global crop/window capability (it was just hard to find). You can define a crop window on a single page, save it as a layout template and then load the layout template applying it to a single page or all pages currently loaded. Works quite well and since you can name layout templates, you can develop a library of them
2- its "learning" capability does not seem to work. It does not seem able to apply learnings from one page to the next
EDIT/UPDATE: the learning capability is limited to individual characters and character combinations. You have to set switches to turn on the learning, then work through a page or two and then (key item), explicitly tell the program to start applying what it learned. It has some irritating GUI issues while in learning mode, but it is better than I previously thought.
3- it appears to have no inference capability by which it infers the correct text based on surrounding letters and words. MS Word has a pretty decent capability along these lines and pushing the output of IRIS through Word cleans up some fraction of the mistakes. As I have used IRIS and have gained experience with its "mistakes", I have written a VBA program that runs in Word to correct the consistent gross mistakes (like "ot" instead of "of") that Word's inherent capability might not pick up
4- it can handle only 50 pages at a time. This is not a big problem conceptually, but it renumbers its pages 1 to n regardless of the true page numbers in the target source file.
5- support (telephone or email) has not been very responsive
EDIT/UPDATE: customer support finally responded and they were helpful. They did not answer all my questions, but we are making progress

That's enough from me for now. What experiences do others have? What pdf and OCR tools have you used?
Last edited by merlin on Wed Feb 24, 2010 11:16 pm, edited 1 time in total.
User avatar
merlin
Forum Member
 
Posts: 3
Joined: Tue Feb 23, 2010 7:16 pm

Re: PDF/OCR Tools and Techniques

Postby OpanaPointer » Tue Feb 23, 2010 8:28 pm

I use FineReader for most of my work, it came with my Mustek A3 scanner and I used when I got frustrated with Omnipage. It will save in PDF and most other applicable formats. It's primary weakness (in version 9 anyway) is tables, a seriously annoying habit of not reading the text. Otherwise it does a good job. Tech support is via a "local rep", not direct with the company, so there is some back-and-forth involved.

FineReader (FR) has a "template" option that allows you to put the same fields on all, or selected, pages. If your pages are all exactly the same size this is useful as it stands. If there is variation in your scanning you'll have to double check the application of the template to each page.

BTW, for scanning, the Mustek is very good, it can to two 8 1/2 x 11 pages at once. This means most books can be scanned with one pass for two pages, reducing handing time.

FR also accepts images from my camera, so I can duplicate documents that can't be checked out. Very handy for documents when the librarian is hovering around. I use a PVC frame to hold my camera above the text so it's actually kinder than xeroxing for old docs.
User avatar
OpanaPointer
Site Admin
 
Posts: 55
Joined: Mon Jun 29, 2009 8:00 pm

Re: PDF/OCR Tools and Techniques

Postby OpanaPointer » Tue Feb 23, 2010 9:57 pm

One more thought. It is sometimes necessary to exploit more than one software when doing OCR. For example, knowing that FR is weak with tables I save the contents as text and then use Word to convert text to tables. I also find that formatting changes are easier in Word or Open Office than they are in the OCR programs. Sometimes two steps are the shorter path than one.
User avatar
OpanaPointer
Site Admin
 
Posts: 55
Joined: Mon Jun 29, 2009 8:00 pm

Re: PDF/OCR Tools and Techniques

Postby merlin » Wed Feb 24, 2010 11:31 pm

I agree that it makes sense to use the capability of available tools.

My present approach is to position Nitro (pdf tool) and IRIS (OCR tool) side by side (on a wide screen) for the OCR task. That way I have a pdf viewing window open while I am OCRing material in IRIS. That really helps during the Lean mode as IRIS does not display very much "problem text" at a time and it is often hard to figure out what the right characters actually are. It also helps me remember exactly what the "real" page numbers are.

The output of IRIS is an rtf temp file that I then append to the Word master doc. I again have Nitro on the left side with the pdf doc displayed, and Word on the right side where I am doing cleanup. I first use some custom VBA procedures to clean up consistent problems and then use Word's spelling, grammar and context checker as a 2nd problem identifier/fixer. And then the inevitable manual tweaking.

It is still a very manual process, but both the applications and I are getting better at it.
User avatar
merlin
Forum Member
 
Posts: 3
Joined: Tue Feb 23, 2010 7:16 pm

Re: PDF/OCR Tools and Techniques

Postby OpanaPointer » Thu Feb 25, 2010 4:04 am

merlin wrote:I agree that it makes sense to use the capability of available tools.

My present approach is to position Nitro (pdf tool) and IRIS (OCR tool) side by side (on a wide screen) for the OCR task. That way I have a pdf viewing window open while I am OCRing material in IRIS. That really helps during the Lean mode as IRIS does not display very much "problem text" at a time and it is often hard to figure out what the right characters actually are. It also helps me remember exactly what the "real" page numbers are.

The output of IRIS is an rtf temp file that I then append to the Word master doc. I again have Nitro on the left side with the pdf doc displayed, and Word on the right side where I am doing cleanup. I first use some custom VBA procedures to clean up consistent problems and then use Word's spelling, grammar and context checker as a 2nd problem identifier/fixer. And then the inevitable manual tweaking.

It is still a very manual process, but both the applications and I are getting better at it.

You really should give FineReader a try. I have one window with the whole scanned image, one with the text for proofing and one that zooms on the original image for the exact spot I'm proofing. http://finereader.abbyy.com/ You can get a trial version at that URL.
User avatar
OpanaPointer
Site Admin
 
Posts: 55
Joined: Mon Jun 29, 2009 8:00 pm

Re: PDF/OCR Tools and Techniques

Postby OpanaPointer » Thu Feb 25, 2010 6:12 pm

Here's the frame I use to hold my camera when imaging documents. It's just some PVC fittings and the metal screws they use to put facing around doors. By not gluing the fittings I can adjust the frame for "special" tasks. I have two sets of uprights, one that puts the camera six inches higher than this set. You don't need to install screws in the uprights, so it breaks down into a pile that fits into a backpack.
DSCF0009.JPG
DSCF0009.JPG (4.76 MiB) Viewed 768 times
User avatar
OpanaPointer
Site Admin
 
Posts: 55
Joined: Mon Jun 29, 2009 8:00 pm

Re: PDF/OCR Tools and Techniques

Postby OpanaPointer » Fri Feb 26, 2010 6:06 pm

On HTML software:

I use a free WYSIWYG program called PageBreeze and a "conventional" HTML editor, CoffeeCup, for editing HTML packages. Both have advantages and disadvantages, but I'm not a codewizard so I use PageBreeze the most.

If you use Word for generating HTML documents please save it in the "filtered" version to avoid all the "word processor" statements going into the file and adding unneeded freight to the Net.
User avatar
OpanaPointer
Site Admin
 
Posts: 55
Joined: Mon Jun 29, 2009 8:00 pm

Re: PDF/OCR Tools and Techniques

Postby OpanaPointer » Fri Feb 26, 2010 6:30 pm

OpanaPointer wrote:Here's the frame I use to hold my camera when imaging documents. It's just some PVC fittings and the metal screws they use to put facing around doors. By not gluing the fittings I can adjust the frame for "special" tasks. I have two sets of uprights, one that puts the camera six inches higher than this set. You don't need to install screws in the uprights, so it breaks down into a pile that fits into a backpack.
DSCF0009.JPG

BTW, I timed myself yesterday and averaged 10 minutes for 100 pages (or images). So a 700 page book would take just over an hour to copy if you get a rhythm going.
User avatar
OpanaPointer
Site Admin
 
Posts: 55
Joined: Mon Jun 29, 2009 8:00 pm

Re: PDF/OCR Tools and Techniques

Postby merlin » Sun Mar 07, 2010 8:12 pm

OpanaPointer wrote:You really should give FineReader a try. I have one window with the whole scanned image, one with the text for proofing and one that zooms on the original image for the exact spot I'm proofing. http://finereader.abbyy.com/ You can get a trial version at that URL.

I downloaded and installed FineReader a few weeks ago and found it awkward to use. I admit that I only played with it for a few hours. The major issues were that I was not very happy with the intelligence of its output, i.e., it did not seem to have a very useful inference engine and it also installed a bunch of junk appls I had to clean out of my system. I also noticed some strange system behavior after I installed FineReader. I eventually tracked it down to some corrupted Registry entries. I can not attribute the malware to FineReader, but the timing is about right. I also recollect that it was fairly expensive. Now that I have explored a few more tools, and have bolstered my malware defenses, I will probably go back and give FineReader another try but will be very careful during the install.

My assessment of IRIS is that it is fairly easy to use but it has no inference engine and its output takes some time to cleanup. I have exchanged half dozen emails with a support tech at IRIS and my issues with the program have no current solution.

I am currently working with OmniPage Pro 17.1 I had used Nuance products for PDF stuff a few years ago before I started using Nitro. Nuance continues to send me electronic flyers for their products and recently sent me a "special deal" where I could try OP Pro for 30 days and if satisfied pay $99. Since the current selling price on their web site was $400, the offer seemed like something I should explore. OP Pro seems to be more for batch processing of paper docs where you set up a workflow than for processing of PDF docs. OP Pro has a strange GUI and menu structure, but ... The one thing I really like so far is its inference engine, i.e., its ability to output 95% correct text. I compared the output of IRIS and OP Pro on a 20 page set from one of the docs I am working on and the difference was amazing. The only (consistent) situation that OP Pro did not handle easily was cases where the character was way above the normal text line and very faint. It also did not recognize the difference between footnote numbers and normal numbers. My post processing time in Word has been cut by 80%
User avatar
merlin
Forum Member
 
Posts: 3
Joined: Tue Feb 23, 2010 7:16 pm

Re: PDF/OCR Tools and Techniques

Postby OpanaPointer » Sun Mar 07, 2010 10:26 pm

FineReader (FR) compares well with OP 17 from my experience. They both take some getting used to in order to streamline the work. The editor window in OP is my biggest complaint with it. I just don't like the way it handles the text. I do like the "dynamic" bit, where the original image follows the cursor around. But FR's original image window is larger and its easier to put things in context when you have to make a call on questionable characters. The spell checker is obviously not English based, however, there are some words that don't show up in the dictionary, mostly the lesser used versions of words.

As for installation problems, you may have another program that conflicts with FR.
User avatar
OpanaPointer
Site Admin
 
Posts: 55
Joined: Mon Jun 29, 2009 8:00 pm


Return to The Pen is Mighty.

Who is online

Users browsing this forum: No registered users and 1 guest

cron