TechWhirl (TECHWR-L) is a resource for technical writing and technical communications professionals of all experience levels and in all industries to share their experiences and acquire information.
For two decades, technical communicators have turned to TechWhirl to ask and answer questions about the always-changing world of technical communications, such as tools, skills, career paths, methodologies, and emerging industries. The TechWhirl Archives and magazine, created for, by and about technical writers, offer a wealth of knowledge to everyone with an interest in any aspect of technical communications.
Subject:Re. OCR for large, old manuals From:Geoff Hart <geoff-h -at- MTL -dot- FERIC -dot- CA> Date:Thu, 10 Aug 1995 10:58:48 LCL
James Swinburn asked about how to scan and OCR old technical manuals
for online use. The first question to ask isn't how, but rather
"whether" you should bother. A few leading questions:
1. If, as indicated, the material is 20 years old, don't you think
it's time for an update? At least consider doing an update to correct
the gratuitous errors that slipped through the original edits
(undoubtedly carried out under tight time pressure).
2. If the material was so poorly organized that it was hard to
navigate inthe first place, putting it online won't help. Sure, you
get to do full-text searches, but with 20,000 pages of text, how many
"hits" will searchers get for a given topic? Will this be any more
useful than simply creating a usable index for the printed docs? Will
the A-size paper-based page layout and illustrations adapt well to the
typical computer display (about half that size)? (Generally, no, it
won't.)
3. Combining 1 and 2, could it perhaps be better just to hire several
good typists and an information designer to tell them what to type, in
what order/format. A good typist will run about $4 per page, and this
may well work out to be cheaper than a dedicated OCR operation.
Moreover, the best OCR typically maxes out at 99% accuracy, or one
typo per 100 characters. Let's assume an average of five letters per
word (for simplicity of calculation): this means one error per 20
words. Now let's assume 500 words per page... 25 errors per page.
Multiply this by 20,000 pages and you'll start to see the magnitude of
the problem. You'll probably have less trouble with the typists, and
you'll be giving useful work to someone who needs it instead of
promoting job loss through technology.
If you do choose the OCR route, plan carefully to overcome the
obstacles above. A few quick tips related to the previous points:
1. Plan to incorporate a technical review as part of your process.
You'll need this to catch errors anyway.
2. Determine how your audience intends to use the material, and
structure your scanning and organizing of info. to meet these needs.
3. Make sure the original pages are clear. For example, a good crisp
photocopy of faded pages (particularly dot matrix printouts) often
improves the text quality greatly, making OCR more successful; it also
makes it more likely that you don't accidentally omit the backs of
pages, since few OCR systems work on double-sided sheets. If you use
the same fonts throughout the manuals, look for OCR that "learns" the
fonts. This will gradually improve in accuracy as you train it (e.g.
to recognize the difference between one and lower-case L in some
fonts). Don't even consider OCR that doesn't use a spell checker to
verify suspect words; most do, but then, you may not be able to add
subject-specific keywords that aren't in the dictionary that ships
with the product.
A simpler solution might be to avoid the OCR step entirely, and just
archive the scanned bitmaps of the pages on a CDROM. The advantage of
this is that it's faster and more accurate than OCR; you can also
apply your own keywords (i.e., build an online index) to each page,
making the search function far more useful than an unconstrained
full-text search.
However, you look at it, this is a monumental task. Good luck...
you'll need it!
--Geoff Hart @8^{)}
geoff-h -at- mtl -dot- feric -dot- ca
Disclaimer: If I didn't commit it in print in one of
our reports, it don't represent FERIC's opinion.