XML, why would
you bother?
Author:
Managing director, Elkera Pty Limited
Date: 22 February 2006
Presentation by Peter Meyer
to the Australian Society for Technical Communication (NSW) meeting in Sydney
on 22 February, 2006.
None of us can avoid being
aware that documentation systems are changing, yet again. Microsoft is moving
Word to its own XML markup. Competing XML formats, such as DITA, are being
proposed as the new best thing for technical documentation.
This paper explores some of the fundamentals of XML, its different
flavours and how it can be used to automate the production of multiple formats
from a single data source.
The advantages and
disadvantages of using XML for technical documentation systems are explored.
The aim is to provide a basis for content managers to develop their own
application requirements and assess the business case to determine if XML is
right for their needs.
There is enormous corporate
inertia to continue using Word as the tool to create all enterprise content.
Unfortunately, Word was not designed to produce content that is to be used in
automated systems to produce multiple outputs from a single source. Structured
XML languages are designed to meet the needs of complex documentation systems.
However, there is a change management issue involved. This is fairly easily
addressed if attention is paid to writer needs and XML editing tools are
customized to truly simplify writing content in XML. The presentation will show
an example of such an application.
XML, why would
you bother?
Why are we talking about XML?
This
discussion is titled “XML, Why would you bother?” This is a rather loaded
question. It assumes that people are thinking about XML but they may have some
doubts.
Before taking on the question, it is
necessary to provide some context. What are the problems that we are trying to
solve when we think of XML? This will ensure that we can properly consider the
merits or otherwise of XML in solving those problems.
Some key needs in product documentation systems
Collaborative writing and quality control
Publications may be produced by technical
writers in different branches of an organization in different cities or
countries. Writers need controlled, shared access to content
components.
Parts of documentation may be written
by software developers and other parts by specialist technical
writers.
Contributions must be coordinated and
changes approved by persons with relevant knowledge of publication
issues.
Content must be maintained in a common
format that can be output to meet local market
requirements.
Knowledge management
Over time, personnel turnover will occur.
Experienced writers may leave and new writers will be introduced to the
project. It can't be assumed that writers know all existing content. In very
large projects, even experienced team members may not know everything that is
available.
Content must be exposed to writers in a
way that they can easily find existing content dealing with a particular
feature or process so as to not create unnecessary duplication or slow down
production times. This requires rich metadata models that tie content to
product components, features and user actions. This issue arises again in the
section dealing with product life cycle management.
Output file formats
Product documentation may require these and other
outputs: • printed manuals and
guides • ready to print,
electronic documents (PDF files for printed manuals) • HTML • Various flavours of Help, .hlp, .chm, Vista help
etc • Multi media training
materials • Outputs for
mobile devices.
Often these outputs
are quite distinct with little content that is common between, say print and
HTML/Help but this is not always so.
Market responsiveness
Market pressures
are promoting faster product development life cycles and, in the case of
software development, new development tools are making this possible. Product
release cannot be held up while documentation is prepared or updated. Thus
production processes must be optimized so that only necessary content
components are revised and reviewed. Production processes must be automated to
minimize the need for manual intervention in processes such as formatting and
hypertext link management.
Content sharing and re-use
In product families and even
among different product families, documentation may include common corporate
information and product information. It is costly and error prone to maintain
this separately in each publication. It should be possible to manage content in
components (topics) in one place and incorporate it into multiple
publications.
Stored content components must be
capable of being inserted at arbitrary points in document hierarchies and in
multiple publications using quite different formatting.
This is not just a technological problem. It requires great care
and skill in planning the information architecture and in the preparation of
content.
Integration with other production processes
Some product information such
as interface component descriptions and API descriptions may be written by
software developers and may be created within the code files. Using a literate
programming model, this documentation should be incorporated into the product
documentation. However, it may require review by documentation managers. To
ensure consistency in the code base and the product documentation, two way
linkages will be required.
When changes are made by
one or other contributor, these may need to be flagged for review by the
others.
Developers need access to the documentation
for incorporation of links for context sensitise help in interface
components.
Publication formatting
Publications may require quite different
presentation in print, on the web or in other formats. At the very least,
different fonts may be used but more extensive changes may occur. Contents
listings, may be output quite differently. Hypertext links must be activated
and validated in HTML and help outputs. Graphics must be rendered at
appropriate resolutions for the output medium. Footnotes may be displayed quite
differently in print and online versions. Print versions may be rendered in a
single output but online versions chunked in separate files by
topic.
Over time, software versions may be revised
that require changes to proprietary file formats. Corporate styles may be
revised to reflect new corporate images or changes following corporate mergers.
It should not be necessary to edit content to accommodate any of these
needs.
Product documentation begins with the specifications
Most products begin life in
a vision statement, requirements document and detailed specification. If well
written, particularly with use cases, this documentation may provide a
substantial input into later product documentation. It may be valuable if
specifications are managed in a compatible form with product documentation so
that content can be incorporated without format
transformation.
Product life cycle management
Increasingly, there is a need to more
actively manage the total product life cycle to achieve quality outcomes.
Product support and maintenance issues must be fed into the product development
process and then into product documentation and support knowledge bases. It may
be desirable to manage documentation in compatible formats at all stages in the
product life cycle for consistency of metadata, content sharing and
searching.
Translation management
Increasingly, companies must provide
documentation in multiple languages. Translation costs are very high so it is
necessary to manage content to minimize those costs. This may require the use
of fine grained translation memory systems.
The problem with native Word documents
Word processor applications are designed to let authors create
documents that can have almost any layout and style that might be needed in an
office environment. Writers can choose single or multiple columns, apply
paragraph and text style properties, create automatic numbering, create
headers, footers, indexes, contents listings and cover pages to meet a vast
range of document publishing needs. Applications such as Microsoft Word are
commonly known as What You See Is What You Get (WYSIWYG) editors.
What you see on screen ought to be the same or almost the same as what you see
in print. We all know that this is not always the case, particularly when you
send your Word document to someone else to print.
Word processing documents have three important
characteristics: (a) They store information
in a simple paragraph model in which each new paragraph is created when you
press the Enter key. This means that there is no explicit relationship in the
data to connect headings to the paragraphs to which they relate or introductory
text to the following list items. (b) Word documents store format information with the content, even
when named styles are used. This means that if you want to publish the document
in another output, other than print, it is necessary to manually change style
properties or apply new styles where the old styles don't match the new
output. (c) It is difficult
to store non printable metadata about particular components because the
components may not be defined (a chapter or because there is no place to insert
the information in a form that software can reliably
process.
The effect of these
characteristics is that if you want to publish a Word document to the web in
HTML, the HTML document will look fairly similar to the print. If you want to
change the fonts in particular places to improve the reader experience on the
web or chunk a large document into single pages for each chapter or schedule,
it is very likely this will not work as expected. Problems may include
inappropriate or inconsistent formatting, poorly defined tables, incorrect
chunking and broken or missing links to internal components and other
documents.
If the document contains rigorously
applied styles, it may be possible to get a better result. Unfortunately, most
authors don't use Word styles. Most who do so change them arbitrarily, override
them or apply local formatting instead. Style names may be difficult to
interpret by anyone other than the author if they are based on their
appearance, rather than the generic function of the content to which they are
applied. Even with styles, the absence of metadata can cause problems with
links and non compliance with web accessibility guidelines.
It is possible to use Word documents for smaller documentation
projects or those where many of the needs listed earlier are not of high
importance. The basic problem with using Word for content writing is that the
resulting system will not scale well to meet more of the listed needs. More and
more effort will be devoted to manual processes and correcting
errors.
How does XML meet these needs?
The big question
Before we can answer this question, we must understand some
important points about XML. Since Microsoft is increasing the use of XML in
Microsoft Word, we will also need to understand the implementation of XML in
Word by Microsoft.
The many languages of XML
XML is not a markup language but a tool to
define document or data description languages. Anyone can define an XML
language to suit their particular needs. Thus, it is not helpful to say that
something is “in XML” in order to convey important information. We must know
which XML language is in use and its characteristics. The XML language in
question may be useful or not for a particular purpose according to its
design.
There are many commonly available XML
document languages provided by standards bodies and developers. XML languages
are defined in a Document Type Definition (DTD) or other schema
definition language. These will be referred to as schema. The
schema defines the grammar for the XML language. This includes the names of
allowable elements and attributes, the order in which they may occur and other
properties. A schema for document markup is a means of enforcing desired
structure and business rules in content production. It ensures that data is
predictable for use in automated processing systems. This is something that is
not possible with word processing documents.
Some
schema such as XHTML 1.0 can be applied to a wide range of common
documents as they may appear on a web page. However, such a schema defines only
a very few generic structures in documents, such as heading, paragraph, list
and blockquote. Often, formatting information is added to distinguish different
kinds of information in the document. If you want to define a component of a
document for special processing and make sure it is only used once and in a
particular part of a document, you can't do this with XHTML 1.0. In addition,
you cannot reliably determine the hierarchy of your documents. Headings (H1 to
H6) can be used in any order and at any level in the document. They are not
tied to the paragraphs to which they relate.
Schema
such as XHTML 1.0 are said to be flat, presentation oriented
schema.
Other schema, including
XHTML
2.0,
DocBook,
DITA and
Elkera BNML provide various ways to represent the
true hierarchy of objects within the document. Some of these schema allow you
to generically define particular parts of documents that may require particular
processing in rendering applications or in content re-use or for searching.
These schema describe the generic hierarchical structure of documents
(chapters, parts, clauses, procedures, steps etc.). While each of the mentioned
schema have this characteristic, they are very different and suited to
different uses. A comparison of key features in the listed schema is available
in the paper
Comparison of XML schema for narrative documents.
Many of the benefits from using XML can be obtained only if the
application separates presentation information from the information itself.
This allows software applications to read the XML file and select chosen
components for particular kinds of processing that are needed for the
information defined by that component. For example, a document abstract or
synopsis may be suppressed in the print version but shown in a special location
on a web version. Alternatively, different style properties may be applied in
the print and web versions. XML schema achieve this by using names for elements
that describe their function in all documents of a particular kind. In this
way, the names are said to be generic. XHTML 2.0, DocBook, DITA and BNML are
considered to be generic structured schema.
Generic structure schema can capture a lot of information about
the content of a document, depending on the richness desired in the language.
Such schema can provide great flexibility in the way document content can be
searched, manipulated and rendered by software applications. This kind of
flexibility is not available for documents using flat, presentation oriented
schema such as XHTML 1.0 or the XML formats used in Microsoft
Word.
XML in Word
Versions of Microsoft Word up to Office 2003 create a native or
binary file format (.doc) by default when a document is saved. Word can also
save out a text based representation called Rich Text Format
(RTF). Recent versions of Word can also save HTML. From Office
2003, Word can optionally save WordProcessingML, Microsoft's own XML document
format. Today, it appears that most users continue to save their documents as
Word binary files (.doc).
From Microsoft Office 12
to be released in 2006, Microsoft advertises that Office applications,
including Word, will save files in the Microsoft Office Open XML formats by
default. From that point, there is no difference between a Word document and an
XML document, except that all XML formats may not be equal. The new XML format
for Word is an extension of the WordProcessingML format used in Office 2003. It
is necessary to understand what kind of XML is produced by Word and assess it
against your requirements.
In its preview materials
for Office products
Microsoft Office Open XML Formats Frequently Asked
Questions, Microsoft makes it clear that its XML formats
in Office 12 are “display oriented”, in the same way as WordProcessingML in
Office 2003.
The problem with presentation oriented schema
Writers of documents in
Word can create an XML document using all the same styles they use now. Using
the default schema, Word does not impose any meaningful structure rules on
content writers. There is nothing to tie headings to the content to which they
relate and nothing to enforce a regular document hierarchy. Based on a
formatting whim of the writer, a heading 1 can follow a heading 3 but
semantically be part of the heading 3 topic, just as in a conventional word
processor.
Presentation oriented schema are little
different to common word processing documents in providing a foundation for
automated processing. They do allow use of XML tools for processing but the
semantic information in the markup often is missing or cannot be trusted, just
as in traditional Word documents.
Structural schema
Structural schema
such as DITA, DocBook, S1000D, Elkera BNML and similar schema impose rules on
content writers that may require the use of elements in particular sequence and
mandate the use of certain elements. These rules are enforced through
validation of the document against the schema. Some or all these rules must be
understood by the writer before he or she can effectively write content using
an XML editor. This makes writing content with a structural schema different to
using a word processor such as Word.
The key features of structural schema include: • Information is self describing though the use of
descriptive element names. Presentation information is separated from document
structure so that any desired formatting can be applied automatically for each
output. Content does not have to be revised to change styles or file formats in
generated publications. • Consistent document structure can be enforced by the schema. These
characteristics remove ambiguity from processing systems and allow reliable
automation that minimizes human intervention. • Depending on the schema design, content
components can be incorporated into multiple documents at arbitrary levels in
the document hierarchy without re-tagging or
re-formatting. • Metadata
can be attached to content components to assist information retrieval and
processing. Structural containers ensure that metadata attaches to the right
content. This facilitates reliable processing and
searching.
It is clear from this
short list that structured XML can provide the foundation for meeting most or
all the needs listed earlier.
Some points about writing structured XML
What is unique about using an XML editor?
When using an XML editor, a writer must insert a valid element
defined by the schema before typing content. This is quite different to a word
processor where the writer can type, press Enter and continue typing to create
new content.
After completing
the content for an element in an XML editor, the writer has to: (1) identify and choose the next element they
wish to insert either inside or after the current
element; (2) if the element
is outside the current element, find the valid location at which to insert the
new element, usually by moving the insertion point; (3) insert the selected
element; (4) if elements are
nested, the writer may have to go back and pick another element to insert
before writing new content; and (5) continue entering content.
In many XML editors, steps 1 and 2 are bound together. Commonly,
the editor provides a pick list of valid elements at each location. The problem
is that when the writer finishes a paragraph, the allowable elements that may
follow that element cannot be seen until the insertion point is moved to the
correct location. This can be quite frustrating, particularly for writers who
are new to XML editors.
Other
demands imposed on writers in XML editors include: • The schema may require an element to be inserted or an attribute
value to be completed. If it is omitted, the document is
invalid. • When an element
has to be moved, it can be moved only to a valid location. If the writer does
not know how to find the valid location, moving content can be very
frustrating.
Many XML editors provide
multiple views of the document, including normal or Tags Off View, Tags On View
and a Structure View. To deal with the problems mentioned, writers often have
to work in a Tags On or Structured View.
The problem with XML editors
The
particular characteristics of XML editors impose several burdens that are not
as pronounced when using a word processor. This leads some organizations to try
to let technical writers work in Word using style templates. Documents may be
converted to XML later in the work flow. This is rarely entirely satisfactory.
It is difficult, if not impossible, to reliably convert word processor
documents to structural XML without some human intervention to correct
errors.
Writers using an XML editor may need a good
understanding of the schema rules to understand how to find valid locations for
new and moved elements. This imposes a high training burden and may lead some
writers to be intimidated by XML editors at first.
The additional steps introduced into the writing process by XML
editors can slow down work and break concentration while
writing.
If the use of XML editors is to become more
widespread, it will be necessary to simplify the process for writers and enable
them to work more easily.
Overcoming these problems
Schema design
All the problems with XML editors can be overcome through good
schema design and thoughtful customization of the editor to the
schema.
Ideally, the schema should be the simplest
model that will create the desired markup structure. Most documents conform to
very standard patterns for section/clause, paragraph and list structures.
Ideally, the schema should make it easy for the writer to understand how these
work. If the interface is then designed carefully, the writer won't really need
to know the finer points of the schema rules until they are ready to learn
them.
Editor customization
Most XML editors are designed to be
customized in some way to provide writers with various shortcuts and aids to
their work. XML editors must work with a wide range of schema. It is not
practicable to treat them as a ready-to-use product. Sometimes, extensive
customization is required to achieve a highly usable writing interface. The
capacity of an XML editor to facilitate that customization should be carefully
considered during evaluation.
The key to customizing
an XML editor is to fully understand the content to be created and the
conventions followed by writers. A customized interface need not try to deal
with every situation. It should aim to make work easier for that common actions
that will likely amount to 80% of the writer's work.
A good test of an effective interface is if the writer can perform
most work without seeing any tags. While some editors attempt to work wholly
without tags, Tags On Views can be very helpful to enable writers to quickly
and accurately perform unusual editing actions.
When
properly implemented, writers should find that content writing in XML is much
simpler than using a word processor. The freedom to forget about presentation
and concentrate solely on structure is liberating. Writers no longer have to
concern themselves whether automatic numbering and cross references will break
as content is edited or how various fiddly layouts will be created. All these
processes can be handled automatically.
Conclusions
The choice between
using Word, including Word XML or using structured XML involves a clear
trade-off.
The use of Word avoids a layer of change
management within the organization. Most users already have Word on their
desktops. The temptation is high to use it if you can. Senior managers who
don't understand document publishing issues rarely understand why Word is not
good enough. This makes it hard to obtain funding for new software. It seems
cheaper up front.
The costs of
using Word or format oriented XML may not show up immediately but they can be
substantial and occur over a long period. They may include: • low functionality in basic systems with poor
scalability to meet more complex needs as the business
grows; • costly manual
re-purposing of data to manage different outputs and hypertext linking or
simply not being able to meet real customer needs in
documentation; • excessive
duplication of content resulting in inconsistencies, errors and higher
production and maintenance costs; • slower production cycles with impacts on time to market for new
and updated products.
Put simply,
Word was never designed to produce content for use in automated document
publishing systems such as those needed for complex product
documentation.
The switch to structured XML involves
a higher level of change management at the outset. In a few cases, there is a
genuine fear of this change based on examples of poorly implemented XML systems
which did not take sufficient account of the needs of writers. In other cases,
it is based on no more than an assumption that “We must be able to do it in
Word”. Structured XML is designed to enable systems to meet the listed needs.
It will maximize functionality, reliability and scalability, thus reducing
costs over the medium to longer term.
It is amazing
that many organizations are making decisions affecting fundamental issues of
content management system design based on little more than a poorly defined
resistance to using a new writing tool.
In your
enterprise, do you want to just keep using a word processor for its own sake or
do you want to establish a framework that is designed to do the job at
hand?
|
Page Options
|