TechWhirl (TECHWR-L) is a resource for technical writing and technical communications professionals of all experience levels and in all industries to share their experiences and acquire information.
For two decades, technical communicators have turned to TechWhirl to ask and answer questions about the always-changing world of technical communications, such as tools, skills, career paths, methodologies, and emerging industries. The TechWhirl Archives and magazine, created for, by and about technical writers, offer a wealth of knowledge to everyone with an interest in any aspect of technical communications.
Subject:How good is OCR? From:"Hart, Geoff" <Geoff-H -at- MTL -dot- FERIC -dot- CA> To:"TECHWR-L" <techwr-l -at- lists -dot- raycomm -dot- com> Date:Thu, 13 Nov 2003 09:06:21 -0500
Marc Santacroce wonders: <<A competitor is offering another product that
consists of an MS Word outline into which customers can cut-and-past
portions of their existing manuals. I can see this working for those who
have an MS word version of their manuals, but many of the customer base just
have a hardcopy version. Has OCR software improved such that this is a
viable
option?>>
OCR has gotten pretty good, but it's still topping out at an accuracy of
99.9% or thereabouts, with much lower rates if you don't know how to work
your scanner or the software properly. That still means an error rate of 1
in 1000 character--or a typo every 200 words or so. That's probably
acceptable for quick and dirty work, but less so if you're trying to produce
a really professional-looking product.
It can also be difficult to work with threaded multi-column layouts because
you have to manually define the text flow--and in poorly designed layouts,
that flow isn't always obvious even to the reader. Shouldn't be a problem
with manuals, but might be for fancy white papers and "tool tips"
newsletters, for instance.
One thing I can't say is how well current OCR works handles mixed text and
numbers. In many fonts, the number 1 and the lower-case L are pretty much
indistinguishable to the human eye, so I can't imagine the software doing a
much better job. Shouldn't generally be a problem when the text and numbers
are separate (e.g., tables vs. body text), but might pose the occasional
problem elsewhere, such as in scientific or engineering manuals.
<<I can't see this working for a pdf import.>>
Actually, it should work nearly flawlessly for PDF because the characters
are all clearly defined--there's no guesswork deciding what characters were
actually entered. Then again, I've edited manuscripts in which the author
typed capital-O instead of zero, lower-case L instead of 1, and ` (the grave
accent) instead of an apostrophe, so let's change the "flawlessly" to "quite
well given the limits of human technology". <g>
--Geoff Hart, ghart -at- [delete]videotron -dot- ca
Forest Engineering Research Institute of Canada
580 boul. St-Jean
Pointe-Claire, Que., H9R 3J9 Canada
"I have always wished that my computer would be as easy to use as my
telephone. My wish has come true. I no longer know how to use my
telephone."--Bjarne Stroustrup (originator of C++ programming language)
RoboHelp for FrameMaker is a NEW online publishing tool for FrameMaker that
lets you easily single-source content to online Help, intranet, and Web.
The interface is designed for FrameMaker users, so there is little or no
learning curve and no macro language required! Call 800-718-4407 for
competitive pricing or download a trial at: http://www.ehelp.com/techwr-l4
---
You are currently subscribed to techwr-l as:
archive -at- raycomm -dot- com
To unsubscribe send a blank email to leave-techwr-l-obscured -at- lists -dot- raycomm -dot- com
Send administrative questions to ejray -at- raycomm -dot- com -dot- Visit http://www.raycomm.com/techwhirl/ for more resources and info.