TechWhirl (TECHWR-L) is a resource for technical writing and technical communications professionals of all experience levels and in all industries to share their experiences and acquire information.
For two decades, technical communicators have turned to TechWhirl to ask and answer questions about the always-changing world of technical communications, such as tools, skills, career paths, methodologies, and emerging industries. The TechWhirl Archives and magazine, created for, by and about technical writers, offer a wealth of knowledge to everyone with an interest in any aspect of technical communications.
Subject:Clean HTML from Word? From:"Sandy Harris" <sharris -at- dkl -dot- com> To:techwr-l -at- lists -dot- raycomm -dot- com Date:Thu, 11 May 2000 12:06:37 -0400
I've got some HTML files that work fine as HTML, but I've been told to put them
into Word for compatibility with other documentation the client has and so that
they'll print out more prettily. Fine so far.
The docs will ship in two formats: PDF for printing and HTML for online use.
The company wants them maintained in Word. I can live with that, and Word-to-PDF
is not a problem.
My problem is that Word 2000's "save as HTML" gives me things like:
I can tolerate Word inserting an extra <a name="_Toc..."> since it seems not
to be smart enough to use my "intro" label, but I desperately want to get rid
of the rest of this crud so HTML style sheets can control how <h1> displays.
Word gives:
<p class="MsoNormal" style="margin-left:.5in;text-indent:-.25in;mso-list:
l0 level1 lfo7">
<strong><span style="font-family:Symbol;font-weight:normal">·
<span style='font:7.0pt "Times New Roman"'> </span></span>
</strong><strong>empowering users</strong>
to better control their own applications.
where the original was:
<li>
<strong>empowering users</strong> to better control their own applications.
</li>
Here it not only inserts extraneous crud, but also loses the list structure
information in the original tagging.
Is there some piece of Word magic that will make it export clean HTML? I can
fix most of the mess with other tools, like HTML tidy (w3c.org), but I'd much
rather not have to.