Re. OCR tools

Subject: Re. OCR tools
From: Geoff Hart <geoff-h -at- MTL -dot- FERIC -dot- CA>
Date: Thu, 4 May 1995 13:31:34 LCL

Deborah Wood asked for advice about OCR tools and methods. Here are a
few quick pointers:

- Most major computing magazines cover the software at least annually;
check for recent reviews and read them carefully. The editor's choice
isn't always the best product for your use.

- If you're not having the pages scanned by a service bureau, get a
good recent scanner with 400-600 dpi resolution. Make sure you get one
with a reliable sheet feeder. Hewlett Packard consistently sells fine
products, which include (if I'm remembering right) something called
"Accupage" technology, which is supposed to improve scans of line art
(including text). You'll also need a fast computer with lots of
memory, typically a 486/Pentium or 68040/PowerPC with >8 Meg memory.
You can get by with slower hardware, but it'll take significantly
longer.

- If the pages are at all dirty, consider photocopying them onto clean
paper that will yield high contrast in the results. Use liquid paper
or some other workaround to clean up garbage on the page.

- One tip that can work wonders is to enlarge the pages 50-100% before
scanning, perhaps by copying the pages sideways onto larger paper.
This can make letter shapes more distinct and easier to recognize. (It
works particularly well for two-column documents or 7" X 9" manuals
where the blown up text will fit nicely on an 8.5" X 11" page.) It
works less well if you need to capture the identical layout of the old
manuals, since this technique preserves text sequence but not layout.
But don't get hungup on looking for features that preserve the layout,
as you can usually do better yourself. (Also, if you're putting the
documents online, the layout won't likely stay the same anyway.)

- If all the documents use the same typefaces, make sure that your
software is "trainable"; that is, the software should be able to
improve its accuracy with practice. The best OCR software doesn't
reach much more than 99% accuracy yet (a type every 20 words or so),
so anything that bumps up the accuracy a notch or two is worthwhile.

- Get software that uses a spell checker and that can use a custom
dictionary if you have lots of proprietary words. This also bumps up
your accuracy a few notches.

No matter what you do, you'll still need a lot of editing to catch
those 1% or so of the words that will be errors. A good question to
ask yourself before starting is whether the information is still
sufficiently up to date to be worth scanning; in some cases, you may
be better off simply retyping it from scratch, using the old manuals
as a guide for your writers. Sometimes, simply storing the scanned
pages as bitmaps will also work, particularly if you can do this in
software with good keyword searching facilities. Good luck!

--Geoff Hart #8^{)}
geoff-h -at- mtl -dot- feric -dot- ca

Disclaimer: These comments are my own and don't represent the opinions
of the Forest Engineering Research Institute of Canada.


Previous by Author: Re.Vapors
Next by Author: Re. One word oxymorons
Previous by Thread: Information from TECHWR-L Listowner
Next by Thread: Word for Word users?


What this post helpful? Share it with friends and colleagues:


Sponsored Ads