TechWhirl (TECHWR-L) is a resource for technical writing and technical communications professionals of all experience levels and in all industries to share their experiences and acquire information.
For two decades, technical communicators have turned to TechWhirl to ask and answer questions about the always-changing world of technical communications, such as tools, skills, career paths, methodologies, and emerging industries. The TechWhirl Archives and magazine, created for, by and about technical writers, offer a wealth of knowledge to everyone with an interest in any aspect of technical communications.
Subject:Followup: Moving Word documents into HTML? From:"Hart, Geoff" <Geoff-H -at- MTL -dot- FERIC -dot- CA> To:"Techwr-L (E-mail)" <TECHWR-L -at- lists -dot- raycomm -dot- com> Date:Thu, 4 May 2000 12:50:09 -0400
A followup on my earlier posting concerning moving Word documents into HTML.
This info. is taken from the current issue of Woody's Office Watch, a "do
not live without it" resource for Office users. Subscription information
available at the following address:
Join, Leave or change address: http://www.woodyswatch.com/
Email: send the message "subscribe" to wow -at- wopr -dot- com
3. WHEN THERE'S TOO MUCH HTML IN WORD
Word 2000 is promoted as having full HTML compatibility but
there are times when full HTML fidelity is a pain in the
neck.
When you copy some text from Word to an HTML editor like
Frontpage, Word does its best to send with it exactly what
you had in the document. The same fonts, spacing - the
works.
Sometimes that's what you want - but often you just want
the basic formatting (bold, italics etc) plus the raw text.
What Word sends with the copied text is a bloated set of
class settings, span statements and XML tag placements.
Try copying a few paragraphs from Word to Frontpage then
look in the HTML view - you'll see things like:
"<span style="mso-spacerun: yes"> </span>" - this is
just for two or more spaces!
'class="MsoBodyText"' - this is the name of the original
Word style with 'Mso' in the front of it.
"<o:p> </o:p>" -- apparently these are XML placeholders.
They are everywhere in Word copied text but serve no useful
purpose in most cases.
'<span style="font-size:9.0pt;mso-bidi-font-family:
'Courier New' color:black">' - span statements like this
bring across the exact character formatting in Word.
All this extra code can severely bloat the size of a HTML
page, making it slower to download and display. Worse it
can confuse the page editor if they don't realize the tags
are there.
Sadly there's no direct solution - Microsoft had its focus
so directly at HTML compatibility that it ignored the more
practical possibilities. Thankfully you can identify most
of the surplus Word stuff because of the liberal use of
'Mso' in the tags.
One solution is to use Paste Special and choose the 'Normal
Paragraphs' option. That will remove all formatting codes,
including fundamental ones like bold and italics etc. The
same happens if you select text in FrontPage and choose
Format | Remove Formatting; all the formatting is lost.
You can go through manually and remove the excess. Because
Frontpage has such a basic replace function (compared to
Word) that's a tedious process.