TechWhirl (TECHWR-L) is a resource for technical writing and technical communications professionals of all experience levels and in all industries to share their experiences and acquire information.
For two decades, technical communicators have turned to TechWhirl to ask and answer questions about the always-changing world of technical communications, such as tools, skills, career paths, methodologies, and emerging industries. The TechWhirl Archives and magazine, created for, by and about technical writers, offer a wealth of knowledge to everyone with an interest in any aspect of technical communications.
Subject:RE: Help with Docbook From:"Mark Baker" <mbaker -at- ca -dot- stilo -dot- com> To:"TECHWR-L" <techwr-l -at- lists -dot- raycomm -dot- com> Date:Wed, 23 Jul 2003 16:11:32 -0400
Mike O. wrote
> Your point is well taken about DocBook being a general
> tag set, but it's a darn good general tag set. In
> DocBook's defense:
Well, I don't think its a darn good general tag set, and you do not mount
any defense of it as such. What do you see as the specific virtues of
DOCBOOK's tag set?
> DocBook is valid XML, and is somewhat future-proof in
> the sense that unlike Word documents, you can script
> DocBook into any other format relatively easily
> (depending on the skill of the tagger/author, as you
> pointed out).
There are two problems in scripting one format into another. One is the
syntax problem, the other is the semantics problem. Each format has a set of
semantics -- what it means, and a syntax -- how it is expressed.
Syntax is really not a problem. You can get documents from Word syntax into
XML syntax using any number of different utilities, including Word itself in
the latest version.
As for the semantics problem, semantics range from smart to dumb. You can
easily transform a smarter format into a dumber format, but it is very
difficult, if not impossible, to mechanically transform a dumber format into
a smarter format, and the greater the distance the harder the
transformation.
Word is a formatting based format. DocBook is a document structure based
format. Formatting formats are slightly dumber than document structure
formats, but not much. In fact, the only thing dumber than a document
structure format is a formatting format.
If you have a need to manipulate document structures without regard to the
semantics of the content, then DOCBOOK gives you something that Word does
not, at least in theory. But in practice you have to deal with authors. You
can get Word to simulate document structure markup using styles. The problem
is to get authors to apply the styles correctly, and you have exactly the
same problem with DOCBOOK: getting authors to use the structural elements
correctly.
Inventing your own smaller, stricter, more specific language for this
purpose will make life much easier for both authors and those who do the
processing.
> DocBook XML is usually processed with standard tools -
> xalan, saxon, xsltproc, jade, etc.
No, DOCBOOK XML is usually processed using specific scripts written in XSLT
and interpreted by these tools. As a user of these scripts it does not
matter to you if they are written in XSLT, C++, or COBOL. You only care that
they are capable of giving you specific results.
If you want to write your own processing applications for DOCBOOK you are
back to dealing with the problem that DOCBOOK is huge and that writing
applications to process it is therefore difficult.
> > The data format itself has its specific
> > strengths and weaknesses.
>
> The data format is standard XML, so DocBook has the
> same strengths and weaknesses of XML.
No. It is absolutely vital to understand this difference. DOCBOOK is a set
of data semantics. Those semantics can be expressed using either XML syntax,
or SGML syntax. You could also create a binary syntax for DOCBOOK and it
would have all the same semantics it does not. Microsoft has long had two
different syntaxes with which to represent the semantics of a Word document:
Word's binary format and RTF. It has now added a third: XML.
The strength of XML is that it can be used to express the syntax of many
different data semantics.
DOCBOOK's strengths and weaknesses are the strengths and weaknesses of its
particular semantics. The same is true of Word, and of HTML. All three can
be expressed in XML syntax. You choose one or another based on the power of
their semantics, not on their common use of XML syntax.
> Most common technical documents can be
> tagged with a dozen or so DocBook tags; that's why the
> Simplified DocBook DTD was invented.
But when you have tagged your document with those dozen or so tags, how,
precisely, are you better off than you were before. What can you do that you
could not do before? How do you save time and money or improve the quality
of the documentation you present to your users?
> > Getting files in Frame format or Word format
> > into XML, so as to process them with XML
> > processing tools, is easy.
>
> If it were easy, then OpenOffice would be doing it
> today.
Getting files from Word format or Frame format into XML is easy. Getting
files from Word's semantics or Frame's semantics into OpenOffice's semantics
is an entirely different story. Note that we have had the problem of
converting between Word and WordPerfect for a very long time. The difficulty
with it is not that the syntaxes are difficult to process, but that the
semantics don't match.
> But try converting a large
> number of typical corporate Word docs; the lack of
> coherent structure will munge the results.
You can transform any Word document into an XML representation of Word's
semantics with complete reliability. The problem is moving from one
semantics to another. People often use Word's styles to represent another
layer of semantics not inherent in Word itself, and this can work quite
nicely. However, whatever format you use to capture the semantics you are
interested in, your results are only as good as the richness of the
semantics you define and the consistency with which they are applied by
authors.
---
Mark Baker
Stilo Corporation
1900 City Park Drive, Suite 504 , Ottawa, Ontario, Canada, K1J 1A3
Phone: 613-745-4242, Fax: 613-745-5560
Email mbaker -at- ca -dot- stilo -dot- com
Web: http://www.stilo.com
This message, including any attachments, is for the sole use of the
intended recipient and may contain confidential and privileged
information. Any unauthorized review, use, disclosure, copying, or
distribution is strictly prohibited. If you are not the intended
recipient please contact the sender by reply email and destroy
all copies of the original message and any attachments.
NEED TO PUBLISH FRAMEMAKER CONTENT ONLINE? "Mustang" is a NEW single
sourcing tool for FrameMaker that lets you easily publish your content
online. No macro language required! http://www.ehelp.com/techwr-l3
Mercer University's online MS Program in Technical Communication Management:
Preparing leaders of tomorrow's technical communication organizations today.
See www.mercer.edu/mstco or write George Hayhoe at hayhoe_g -at- mercer -dot- edu -dot-
---
You are currently subscribed to techwr-l as:
archive -at- raycomm -dot- com
To unsubscribe send a blank email to leave-techwr-l-obscured -at- lists -dot- raycomm -dot- com
Send administrative questions to ejray -at- raycomm -dot- com -dot- Visit http://www.raycomm.com/techwhirl/ for more resources and info.