Comparison of XML schema for narrative
documents
Author: Andrew Squire and Peter Meyer
Comparison of XML schema for narrative
documents
1 Introduction
1.1 Purpose of this document
The purpose of this document is to identify and compare the key
features of four XML DTDs or schema that may be considered as candidates to
model a wide range of business, legal and technical documents commonly prepared
by business and government enterprises (narrative business
documents). These include reports, articles, contracts, technical
specifications and other similar documents. The key point about them is that
they are usually published as complete documents that provides the context for
all content components. Some components may be shared with other documents but
there is a high degree of interdependency between parts of these
documents.
Enterprises may wish to manage narrative
business documents using XML to support various requirements, including single
source publishing, automated content re-use and long term data management for
long life cycle documents.
These sorts of documents
are often created by non-technical authors who are familiar with word
processing software but have no experience with XML.
The schema to be compared in this document
are:
The first three schema are standards or proposed standards. The
BNML schema is newly released by Elkera in July 2005. The BNML schema is to be
freely available under an open source licence and is not a commercial product
for Elkera.
While they overlap in many respects,
each of the four schema is designed with a particular focus. They each have
advantages for particular document types and application
requirements.
The selection of a schema is possibly
the most critical decision in the establishment of an XML based content
management system. Every application must be built around the schema. Once
applications are developed, it is very expensive to replace the schema or make
major changes to it. The schema design has major impacts on the cost of
application development and on the ease with which authors can create XML
content.
The aim of this document is to assist
technical architects to define their application requirements and identify
those schema that are likely to be suitable for more detailed assessment. Due
to the complexity of the schema and the wide range of potential user needs,
this comparison is not a comprehensive review of each schema. It focuses on
particular features relevant to narrative business documents.
1.2 Assumed requirements
To provide a basis for the comparison, certain high level
requirements are assumed: (a) Documents are
predominately text with graphics and tables that should be rendered using
standard layouts for all or most documents of a particular
class. (b) Content must be
created by authors from scratch using an XML editor. (c) The schema must define a common markup
language that can be used for a range of document types to minimise application
development and author training effort. (d) Metadata may be required to be added to
documents and content objects to facilitate content management, publishing
requirements and semantic processing needs. (e) Documents may be revised from time to time by
different authors and re-published. (f) Documents may be published in print, on the web and often in other
formats. (g) Document
components may require automatic numbering and dynamic cross references to
numbered components. (h) Content may be shared in multiple document renditions and content
may be re-used in multiple documents, either by manual or automated copying of
content chunks. (i) It may
be necessary to provide for collaborative authoring where parts of a document
are produced by different people and assembled into a whole before
publication.
1.3 Approach taken in the comparison
This document provides a two level comparative analysis of the
four schema covering technical issues and the business implications of using
each schema.
The technical assessment and
comparison considers these characteristics: (a) the stated purpose or function of each schema by its
developers; (b) a
description of the basic structural model defined by the schema and the way it
may be applied to narrative business documents; (c) the approach taken for component numbering
and internal cross references; (d) the schema's support for content re-use in authoring applications
and in automated processing systems such as document assembly
systems; (e) particular
specialized markup features provided with each schema; (f) the support for conditional output;
and (g) the approach taken
by each schema for customization and
specialization.
The business
implications are based on an assessment of: (a) the likely application development and support effort imposed by
the schema design; and (b) the ease with which authors might be trained and supported to use
the schema consistently, assuming they use an XML editor for document
authoring.
2 Overview of the schema
2.1 DocBook
DocBook was
released in 1991 by the Davenport Group. The original design goal of DocBook
was to enable the interchange of computer documentation. It was not originally
designed as an authoring DTD. DocBook is now an OASIS Standard (www.oasis-open.org).
DocBook was originally developed for computer documentation and it
has a heavy bias toward those types of documents.
DocBook users typically use either the book or article as the
root element. The article element is commonly
used for a range of narrative business documents.
2.2 DITA
The Darwin
Information Typing Architecture (DITA) was originally developed by IBM. It
became an OASIS Standard (www.oasis-open.org) in June
2005.
DITA was designed to support topic based
content. Topic based content is information that is chunked into discrete
topics, usually found on web sites and computer help systems. It was not
designed with narrative documents in mind.
DITA
offers a sophisticated customization and specialization model that allows
customizations to be compatible with applications capable of processing the
base DITA.
DITA allows documents to be created
using either the topic element or the
map element. The topic element is used to model discrete chunks of
information but topics can be arranged hierarchically. The
map element is used to define the relationship
among a set of resources, such as DITA topics.
It provides an information framework and navigation structure for on-line
content. The map element can be used to
represent a hierarchy of topics.
2.3 XHTML 2.0
Extensible
Hypertext Markup Language (XHTML) 2.0 is currently in development by the World
Wide Web Consortium. At July 2005, it is a Working Draft and has not reached
the Recommendation stage.
XHTML is an XML version
of Hypertext Markup Language (HTML). XHTML 2.0 builds on XHTML 1.0 but is
primarily intended for publishing documents on the web. XHTML 2.0 adds the
ability to create document sections and grammatical paragraphs.
XHTML only provides a single root element html. All possible document types must be modelled using
this element.
2.4 BNML
The BNML schema was
developed by Elkera and released as an open source schema in July 2005. This
schema is the result of over 10 years experience developing DTDs for narrative
business documents.
BNML is designed specifically
for the creation of narrative business documents, particularly legal and
general business documents. It aims to provide a simple structural markup of
documents that will support high quality print and web publications, facilitate
content re-use and allow authors to be trained and supported with minimal
effort.
The BNML schema provides four standard
document types, document for normal business
documents and articles, contract,
correspondence and item for discrete, reusable chunks.
The BNML schema provides a number of specialized elements that are
specifically suited to these document types. The adjunct element provides for appendices and attachments
to documents, contracts and correspondence. This element can contain normal
narrative content or a complete document,
contract or correspondence instance.
BNML schema also includes the party-signature element and its components to represent
provisions in contracts and correspondence where written signatures and seals
must be applied.
3 Basic structural markup
3.1 Purpose of this section
The basic structural markup of narrative business documents can be
broken down into three distinct patterns: (a) the hierarchical structure used to create divisions or sections
within the document; (b) the paragraph markup; and (c) the list markup.
Each schema defines a basic document
hierarchy or structure. This section will compare the approaches taken by each
schema to model these basic patterns within a document. This will include an
assessment of the ability of each schema to support the consistent application
of markup to meet content re-use and collaborative authoring needs. This
characteristic is also a significant factor in usability for authors and in
reducing application complexity.
3.2 Hierarchical structures
3.2.1 Examples of common structures
Narrative business documents usually contain a hierarchical
structure, as shown in the following example: Example 1 Numbered document hierarchy 1 First level 1.1 Second level 1.1.1 Third level This is the
paragraph text of section 1.1.1, the third hierarchical
level. 2 First level This is the paragraph text of section 2, of
the first hierarchical level.
The levels in a numbered hierarchy can include numbered
hierarchical structures that may or may not have a heading or title, as in the
following example: Example 2 Numbered document hierarchy with and without
headings 1 First level 1.1 Second level 1.1.1 Third level This is the text of the paragraph under the
third level heading. 1.2 Second level 1.2.1 This is paragraph text of a numbered structure at the third level.
This level does not have a heading or title. This kind of structure is common
in legal and specification documents. 1.2.2 This is another paragraph of a numbered structure
at the third level.
The levels in a hierarchy can also be unnumbered, as in this
example: Example 3 Unnumbered document hierarchy First level Second level Third level This is paragraph text at the third hierarchical
level. First level This is the paragraph text of the first
hierarchical level.
3.2.2 Common hierarchical models
There are two common approaches to modelling the hierarchical
structure of narrative documents: (a) a
recursive model where the same element is used at all levels of the hierarchy
by nesting; and (b) a named
level model where different elements represent each level of the
hierarchy.
This comparison will
focus on the recursive model. In most narrative business documents there is no
meaningful distinction between the content at the different levels. Content at
level 1 in one document may be re-used at level 2 or level 3 in another
document. The recursive model is seen as a superior model in terms of author
usability, flexibility and convenience for the range of structures found in
narrative business documents. A recursive model allows the author to
restructure a document by moving document sections around without having to
rename the element.
3.2.3 The optional heading or title
Example
2 shows a structure in which some numbered components
at the same hierarchical level have titles and others do not. It needs to be
considered whether the absence of a title on clauses 1.2.1 or 1.2.2 in
Example
2 make it semantically different to clause 1.1.1 in
that example. The numbering sequence to be applied to the objects is the same.
The only practical difference is that, usually, the structures without a title
are not included in a contents listing. In another context, another author
could easily add a title to the clause.
Consistently with the point made in the previous section, it is
submitted that authors of the kinds of documents under consideration would
commonly regard the two structures in the examples as the same. However, the
optional title does have some implications, It may be necessary for the
developer to deal with inconsistent title usage in contents generation. In
addition, the layout properties may be different in each situation. When the
structure has a title, the content following the title might be rendered on a
new line. When the structure does not have a title, the paragraph content is
normally rendered inline, after the number.
3.2.4 DocBook
DocBook
provides: (a) the recursive
section element; and (b) named levels using the elements
sect1, sect2,
sect3, sect4
and sect5.
DocBook also has higher level structural elements such as
chapter and part.
The recursive
section element has the following (simplified)
model: (sectioninfo?, (title,subtitle?,titleabbrev?), (((itemizedlist | orderedlist | para | ...)+,
section*) | section+)
The key features
of the DocBook recursive model are: (a) The
section must have a title. (b) The number (or label) of the section is stored as the section label attribute
or derived by processing applications (numbering is further discussed in
section 4). (c) paragraphs and other content are allowed
before the lower level sections.
DocBook does not provide a practical way to markup all the
structures in the examples in
section 3.2.1. The section element cannot be used
for the numbered structures in
Example
2 that do not have a title. A list might be used to
create the numbered component but this does not provide for the automatic
number sequence because the list cannot inherit the parent section number.
Additionally, it may be necessary to distinguish such structures from
conventional lists for both numbering and rendering
purposes.
3.2.5 DITA
DITA provides the
recursive topic element to model hierarchical
structures. The topic element has the following
content model: (title, (titlealts)?,(shortdesc)?,(prolog)?, body, (related-links)?, (topic)*
The key features
of the DITA model are: (a) the
topic must have a title; (b) there is no provision for a topic
number. If numbering is required, this must be generated by rendering
applications and cannot be stored with the topic. (c) the body is required, though this
can be empty; (d) the
recursive topic element occurs outside the
body of the topic.
DITA
cannot markup the numbered examples in
section 3.2.1. If the document requires use of
container elements without a title that themselves contain elements without a
title (paragraphs), DITA does not provide for such a structure.
The DITA requirement that a topic
and any specializations of it must have a title prevents it from representing
the full range of structures required in narrative business
documents.
3.2.6 XHTML 2.0
XHTML 2.0 also provides both approaches to
model hierarchical structures within documents: (a) the recursive section element;
and (b) named levels using
elements h1, h2, h3,
h4, h5 and
h6.
This comparison examines only the recursive section element. The named level headings do not provide
a structured hierarchy.
Key features of the XHTML
2.0 model are: (a) The
title of section can occur in any order with other child elements
and is not constrained to be the first child element of the
section. (b) The section
title is not required. (c) The section
can directly contain #PCDATA or any of the XHTML heading, structural or text
elements in any order. (d) As with DITA, there is no provision to store a number with the
section.
XHTML 2.0 cannot markup the numbered structure in
section 3.2.1,
Example
1.
3.2.7 BNML
BNML provides one
way to model hierarchical structures using the recursive item element. The item
element has the following content model: (metadata?, num?, title?, inclusion*, (block, inclusion*)*, (item,
inclusion*)*
BNML provides
three different patterns for the arrangement of item and block children
of item or specialist containers. These
patterns are a loose model, a standard model and a
tight model. The loose model allows item and block elements
to occur in any order. The standard model allows a block before the first item but not otherwise. The tight model prohibits
item and block
at the same level. Schema designers can choose the pattern that is desired for
any container. The content model shown above is the standard
model.
The key features of the BNML model
are: (a) The main hierarchy of the document
is defined by a recursive item element with a
block element for paragraphs of
narrative. (b) The
item number is stored as element content in the
num element (numbering is further discussed in
section 4). (c) Both num and
title are optional so the same element can be
used for all the hierarchical structures described in
Example
1,
Example
2 and
Example
3. (d) The inclusion element is used to
represent all block level components such as quotations, examples and notes
that need to be shielded from the normal narrative content for the application
of distinct automatic numbering or layout properties. It is also used to
provide for titles and numbers on graphic objects and
tables.
The BNML schema can model
all the example structures in
section 3.2.1 using the recursive
item structure.
3.3 Paragraphs
3.3.1 Concepts
There are two
common approaches to modelling the paragraph content in narrative business
documents using XML markup: (a) a
grammatical paragraph model, where all the content of a
grammatical paragraph is contained within the paragraph element including
#PCDATA and other block style elements such as lists and quotations;
and (b) a loose
paragraph model, where the paragraph element content is not contained
within the paragraph element. The paragraph element usually sits at the same
level as the block level elements such as lists and quotes. This is the way
paragraphs are handled by word processor
software.
These concepts are best
shown by example. Example 4 A
grammatical HTML paragraph <p>This is a list
of primary colours: <ol> <li>red</li> <li>blue</li> <li>yellow</li> </ol> </p> Example 5 A
loose HTML paragraph <p>This is a list of
primary colours:</p> <ol> <li>red</li> <li>blue</li> <li>yellow</li> </ol>
The
grammatical paragraph encapsulates all components that are semantically part of
the paragraph. This facilitates more precise rendering layout control and
easier manipulation of content during editing.
In
the loose paragraph model, there is no relationship to the text that introduces
the list and the list itself. The grammatical paragraph model maintains the
relationship between all components of a paragraph. This is a logical concept
for authors to grasp. The container based approach to the paragraph makes it
easier for application developers to manipulate content where content reuse is
a priority.
3.3.2 DocBook
DocBook offers
both the grammatical paragraph and the loose paragraph models. Block level
content and other paragraph components, such as lists, can occur directly
inside the section element or they can occur
within one of two paragraph elements: (a) para – this is a standard
grammatical paragraph; or (b) formalpara – this is a standard
grammatical paragraph with a heading.
Docbook also offers the simpara
element as a simple paragraph that does not allow block style content to occur
within it.
3.3.3 DITA
DITA offers both
the grammatical paragraph and the loose paragraph model. It allows block level
content and other paragraph components to occur directly within the
body element or within the
p element. Thus, there is no way to enforce a
grammatical paragraph model.
3.3.4 XHTML 2.0
XHTML 2.0
works in the same way as DITA. It allows block level content and other
paragraph content to occur directly within the body or section
elements. It also provides a grammatical paragraph model using the
p element but the loose content models mean
that this cannot be enforced.
3.3.5 BNML
BNML provides a
grammatical paragraph element called block. It
does not permit the loose paragraph model.
3.4 Lists
3.4.1 Concepts
Broadly, a
list is a catalogue of things belonging to a class. In narrative content it can
be difficult to determine which parts of the narrative are lists and which are,
say, clauses or subclauses. Very often, the only distinction is between the
numbering scheme desired by the author. If a schema defines a list structure,
it is necessary to apply that structure in a consistent, useful way to achieve
rendering and processing objectives.
In narrative
business documents, lists are commonly: (a) numbered; (b) unnumbered; or (c) bulleted.
Some schema
provide a number of other specialist lists for creating different types of
lists such as definition lists and variables lists. These types of lists will
not be considered in this document.
Some schema make
a distinction between a numbered list (ordered list) and a bulleted list
(unordered list). In narrative business documents, lists are always reproduced
in the order that was intended by the document author, whether numbered or
bulleted. The distinction between ordered and unordered lists is not a useful
distinction in these types of documents. In practice, the only difference
between the following two examples is the numbering style.
Example 6 A numbered
list
The simple content of a list: (a) text of a list item; (b) text of another list
item.
Example 7 A bulleted
list
The simple content of a
list: • text of a list
item; • text of another list
item.
Schema that
require the numbered list to be an ordered list and the bulleted list to be an
unordered list force the author to assign a vague semantic meaning to achieve a
desired numbering style. This is also a problem for the author when they wish
to change one list to another.
Some schema imply
the type of numbers used for numbering a list. It may be important for a
particular application that an author can control the types of numbers used for
numbering a list. It may also be important that an author can control the index
used for the first item in the list. This can be used to allow list numbering
to continue from a previous list, if necessary, or to create list
fragments.
3.4.2 DocBook
DocBook
provides three elements to markup lists: (a) orderedlist – used for a numbered
lists; (b) simplelist – used for unnumbered lists;
and (c) itemizedlist – used for bulleted
lists.
The orderedlist and simplelist can contain one or more listitem elements. The simplelist can contain one or more member elements.
The
orderedlist provides the author with control
over: (a) the type of list item number that
is assigned to list items; (b) whether list item numbering should continue from the previous
list; (c) whether the list
item number should inherit the parent list number (to achieve hierarchical
style numbering of lists).
The "mark"
or bullet for the items in an itemizedlist is
normally determined by processing applications. The itemizedlist provides the mark attribute to control the type of bullet but
specific values must be added by a customization layer to enable this
functionality.
The listitem element allows the author override the number
for the particular item.
3.4.3 DITA
DITA similarly
provides three list elements: (a) ol (ordered list) – used for numbered
lists; (b) sl (simple list) – used for unnumbered lists;
and (c) ul (unordered list) – used for bulleted
lists.
Both ol and ul can contain
one or more li (list item) elements. The
sl can contain one or more
sli (simple list item) elements. No control is
given to the author to control list item numbering.
3.4.4 XHTML 2.0
XHTML
provides the ol and ul lists. These have the same functionality as these
elements in DITA.
3.4.5 BNML
BNML does not have
a distinct list element. The item element is
used as the list item element. This is the same element that is used for the
document hierarchical structure, as described in
section 3.2 (redefined to prevent directly
recursive item elements). An
item is considered to be a list item when it
occurs within a block (grammatical
paragraph).
Attributes on the block element give the author control over how the list
items are numbered within a block. This can be
either numbered, bulleted, manually numbered or unnumbered. The
item element allows the author to specify a
number restart index for the item so authors
can control the first number in a list.
3.5 Do the schema provide for consistent markup?
3.5.1 Concepts
The
consistency of markup within a set of documents has implications for both
authors and developers. It is an issue that is often overlooked when evaluating
a schema, particularly where it is necessary to capture legacy data with a poor
structure.
A schema may provide many different ways
to mark up the same content. Authors may use the markup inconsistently within a
single document. Different users will apply different markup in similar
situations. An author's view of the best way to mark something up may change
over time.
Application developers may make
assumptions about how the markup should work and base application behaviour on
these assumptions. This may lead to unexpected or inconsistent behaviour by the
application. Some rendering applications are style based applications,
operating in a way that is consistent with word processor paragraph styles.
Consistency of markup produces a better result in these applications because
fewer contexts need to be considered for styling.
Schema that provide more than one way of marking something up also
have implications for authors and author training.
This section considers how each schema promotes the consistent use
of markup for the three basic patterns in structural markup discussed in
previous sections.
3.5.2 DocBook
DocBook
provides both the named level and recursive models for representing the
document hierarchy. Ideally an organization would disable one of the competing
hierarchical models and only provide one model, depending on the needs of the
organization. If this is not done the same underlying content structure may be
represented quite differently in different documents and even within the same
document.
The DocBook schema allows a paragraph to
have a heading using the formalpara element.
This could be used instead of a section for
leaf nodes in the narrative. It seems highly likely this will occur
inconsistently and that, frequently, such differences will be of no useful
semantic significance.
There are three different
paragraph elements provided in DocBook, para,
formalpara and simpara. The section
element can contain most of the block style elements that are found inside the
para and formalpara elements. This allows authors to create these
structures in different ways by mixing grammatical and loose paragraph models.
It would not be unexpected if different authors used the loose paragraph model
or the grammatical paragraph model or even a combination of both models in the
same document.
The markup of structures at the list
level could also be inconsistent. Both the itemizedlist and the orderedlist allow a large number of block level elements
to occur before the first list item, including para. This allows authors to place the introductory text
and other objects within the list element as well as outside the list
element.
Based on this partial analysis, it is
clear that the markup of basic hierarchical structures using DocBook may be
quite inconsistent between authors or between documents created by the same
author. It might be possible that in some document types these differences
could be meaningful. However, in common technical, legal and business
documents, this is extremely unlikely. It will not be possible for processing
applications or users to determine if these differences represent a subtle, yet
real difference in structure or are merely arbitrary variations that, somehow,
must be resolved into a common layout by processing
applications.
3.5.3 DITA
DITA only provides
the topic element for marking up the main
hierarchical structures. The markup of those structures using DITA should be
more consistent than DocBook.
The markup of
paragraph and list content in DITA is likely to be very inconsistent. The body
of DITA places very little constraint on how particular content is marked up
and is very similar to XHTML 2.0. This would result in a mixture of grammatical
and loose paragraphs, leading to poor layout consistency in rendered output,
particularly in print where greater precision is usually
required.
3.5.4 XHTML 2.0
The markup of hierarchical structures using
XHTML may be very inconsistent because authors can use both the recursive
section model and the named level model and mix these in a single
document.
The markup of paragraph and list content
may also be very inconsistent. The body of XHTML is very loose and places very
little constraint on how particular content is marked up. This would result in
a mixture of grammatical paragraphs, loose paragraphs and text data inside the
section element.
3.5.5 BNML
Using BNML, the
item element is the only element provided for
hierarchical markup. It is possible that the inclusion element could be abused in some situations
because it too is a recursive element. Its layout properties will usually be so
distinct that such abuse will be very limited.
The
markup of paragraph and list content ought to be highly consistent because
there is only one way to markup this
content.
4 Component numbering and cross references
4.1 Introduction
As
discussed in section 3.2, most legal and many narrative
business documents have component numbering. Component numbering is typically
used for complex documents such as legal and technical documents that are to be
printed. Component numbering, except for lists, is less common for web based
documents. There are three related issues that will be addressed in this
section: (a) generation and storage of
component numbers; (b) controlling automatic component numbering;
and (c) managing document
cross references to numbered components.
This section will compare the basic approaches used in each
schema.
4.2 Generation and storage of component numbers
4.2.1 Concepts
There are two
common approaches to component numbering within XML documents: (a) the component numbers are written into
designated markup; or (b) the component numbers are not explicitly stored but are calculated
and implied when rendering the XML document.
This is an important issue for processing applications. When
component numbers are implied by applications, numbering functionality may need
to be reproduced in multiple applications. It would not be unusual for the
numbering functionality to be required in three applications, including the XML
editor display, the print output rendering and the web output rendering. If
these applications cannot share the numbering code, developers may need to
maintain separate numbering applications, leading to redundant development and
maintenance plus inconsistency between applications. This problem increases
when documents are exchanged with other organizations.
These problems are avoided when component numbers are written
directly into the XML markup. Application developers should only need to create
the numbering behaviour once.
The explicit
numbering approach also facilitates cross references within documents, as
described in section 4.4.
The
explicit numbering method may be preferable when component numbering is
significant to the application and numbers must be accurately reproduced from
rendition to rendition.
One of the problems with
component numbering is that there is no standard way to represent numbering
styles for the outline or lists. Similarly, off the shelf tools to generate
component numbers are not widespread.
This section
looks at the approaches used by DocBook and BNML only. DITA and XHTML do not
support explicit component numbering.
4.2.2 DocBook
The default
position is that component numbers should be implied by the processing system
and not included in the markup. Numbered elements such as section have a label
attribute. This is used to store a number when processing systems are incapable
of generating the number for a section. If the
label attribute is specified, the content of
the attribute must be used as the section
number.
List item (li) elements do not have a label
attribute and cannot be numbered in the same way as the section element, if numbers are handled
explicitly.
4.2.3 BNML
In BNML, numbers
may be stored in the child num element of all
numbered elements. This approach is convenient with an editing application that
can generate automatic numbers and write them into the markup during the
authoring process. If this is not available, an application can imply numbers
with BNML just as with any of the other schema.
4.3 Controlling automatic component numbering
4.3.1 Concepts
Component
numbering is normally applied to the document hierarchy (outline), to lists, or
to other structures within the document such as tables and
figures.
Control of component numbering is important
where documents use complex component numbering or require different parts of
the document to be numbered in different ways. In these types of documents, the
author may require explicit control over the numbering to activate different
numbering styles or to restart numbering in certain contexts.
Control of component numbering may also be important in content
reuse applications where existing numbers may be preserved (e.g. quoted
content) or regenerated, depending on the reuse
requirements.
4.3.2 DocBook
DocBook
provides several elements that can be numbered. These include
section, appendix, table and
figure.
DocBook: The definitive
guide
states that a section "may" be numbered and that an
appendix is typically lettered. No instruction
is given for table or figure.
DocBook provides no
explicit way to control whether these elements are automatically or manually
numbered or how they are numbered by a processing application. It does provide
the ability to specify a number for a particular element using the
label attribute. Without additional
information, the author will not know if this will be used or overwritten when
the document is rendered. Numbering schemes and the coordination between
authors and application developers must be handled through business rules for
particular document types and element contexts.
In
DocBook the approach taken to section numbering differs to that in list number
values. Section number values can be stored whereas the li element makes no such provision.
4.3.3 DITA
The
DITA specification makes no mention of topic
numbering. Topic numbering could be implied
from topic depth if numbering is required, but
it is not part of standard DITA. There is no explicit storage mechanism for
numbering. In rigidly topic based content, numbering can properly be treated as
a presentation issue and managed during content assembly or rendering
processes.
4.3.4 XHTML 2.0
The XHTML 2.0 specification makes no mention of
section numbering. Section numbering could be implied by processing
applications from section depth if numbering is
required, but it is not part of standard XHTML.
4.3.5 BNML
The BNML schema
deals with component numbering in a way that is closest to the DocBook section
element. However, BNML deals with all numbered elements in a similar way that
enables authors to explicitly apply particular numbering schemes and to control
whether numbers should be inserted manually or generated by an automatic
process.
BNML provides three elements which can be
numbered: item, inclusion and adjunct
(the equivalent of DocBook appendix). BNML also
provides a number of ways to control numbering within the document. The main
control is on the root level element of each of the BNML document types. These
elements have two attributes that control numbering: (a) number-outline – this attribute
allows the author to control the style of numbering that is used to number the
document outline. Commonly, this may be a named numbering style, manual
numbering or no numbering; and (b) number-disable – this attribute
allows the author to disable all automatic numbering within the document,
including list numbering.
BNML
treats outline structure numbering and list numbering in the same way but
allows the author to separately control the application of manual or automatic
numbering to these structures.
The
adjunct element also has the
number-outline and number-disable attributes to control the numbering of
content within the adjunct independently from
the rest of the document. In addition to these attributes, the
adjunct has attributes to control the numbering
of the adjunct itself and an attribute to
specify a number restart index for the adjunct.
The
inclusion element can be used to number objects
such as tables, figures and examples, according to the value of the
class attribute on the inclusion element.
The
number of an item element is controlled by its
parent element or another ancestor element, including the root element. These
are regarded as shield elements. An application may restart numbering sequences
from 1 in each shield element.
4.4 Component cross references
4.4.1 Concepts
Many narrative business documents include cross
references to other components of the document, similar to "See section 4.4" or
"in clause 3.1.1 (a)". Basic features required of applications in support of
cross reference functionality in narrative business documents include: • The reference should be able to include
multiple components of a target number ("clause 3.1.1
(a)"). • The reference
should permit the reference information, such as the target number or title, to
be displayed in line for author convenience. • It should be possible for an application to
update the content of the reference when the target information changes as it
is re-numbered or re-located within the document, so the information is
accurate for the author and when the document is
rendered. • It should not be
necessary for multiple rendering applications to have to calculate cross
references.
As explained in
connection with automatic numbering, the way in which component numbers are
stored will affect the way in which cross reference functionality is enabled.
If the editor calculates component numbers and writes them into the markup,
they will be available to a cross reference utility. However, if the numbers
are not written into the text, it will be necessary for a separate application
to calculate them for display purposes.
4.4.2 DocBook
DocBook
provides the xref element to enable cross
reference functionality. The cross reference text can be generated in three
different ways depending on how the xref
attributes are used: (a) the author can
specify the identifier of a target element whose content is taken for the cross
reference text; (b) the author can specify the identifier of a target element with an
XRefLabel attribute which provides the cross
reference information; and (c) the author can specify a style on the xref element for a custom cross reference processing
approach.
In DocBook, the
xref element is an empty element. It does not
provide any facility to store the calculated value of the cross reference. The
cross reference text must be generated by each application that renders the
output.
If the target of the cross references uses
implied numbering (see section 4.2), the author or the XML editor
application must calculate the target number and write it into the
XRefLabel attribute on the target element, if
the cross reference is to be displayed in the text. This appears to be a
cumbersome requirement that would not be easily implemented for dynamic display
in an application.
DocBook permits compound
references through the ulink element which may
contain multiple xref elements.
4.4.3 DITA
DITA provides the xref element to
enable cross reference functionality. The specification states that if the
xref element is empty, the application "may"
generate cross reference text from the destination object. Once the element is
populated with data, it appears it cannot later be updated. No other mechanisms
are provided to control this functionality.
The
model provided by DITA does not appear to be suitable for the kinds of cross
references found in narrative business documents.
4.4.4 XHTML 2.0
In XHTML
2.0, any element may reference another element. However, there is no defined
mechanism to generate cross reference information from target elements for
display as internal cross references. XHTML 2.0 does not appear to contemplate
this requirement.
4.4.5 BNML
BNML provides the
reference and autovalue elements to enable this functionality. The
cross reference text can only be generated one way. The autovalue href attribute
contains the id of the target element. The autovalue display
attribute contains an XPath statement from the target element to the location
of text to be used as the cross reference text.
For
example, to provide the functionality described in
section 4.4.2
(b), the autovalue would be: <autovalue href="somesectionid"
display="@XRefLabel"></autovalue>
The autovalue element can contain #PCDATA. The intention is that a
processing application populates the content of the element with the cross
reference display value so that rendering applications do not have to reproduce
the cross reference behaviour.
A reference may
contain multiple autovalue elements and text to provide for compound references
with citation wording. The autovalue can access any information stored in a
target attribute value or element.
5 Other markup issues
5.1 Specialized container elements
5.1.1 Concepts
All the schema
provide other container elements in addition to the basic structural markup
elements discussed in section 3.
Some
schema come from a particular domain and may have a significant number of
specialized elements to cater to that domain. When used in their original
domain they may be an ideal choice for an application. When used in other
contexts, domain specific elements need to be removed from the schema so the
author is not presented with unnecessary element choices.
5.1.2 DocBook
DocBook provides a very large number of specialized container
elements. The section element has 59
specialized containers and the para has 139.
Many of the elements found in section are
repeated within para. Because DocBook was
originally designed for computer manuals, there is a wide variety of elements
specifically intended for computer hardware and software
documentation.
5.1.3 DITA
DITA provides a
smaller number of specialized container elements. The body element has 22 specialized container elements and
the p has 54. Once again, many of the elements
found in body are repeated within
p. Like DocBook, DITA has also come from the
technical documentation domain and has a significant number of specialized
container elements that are specific to that domain.
5.1.4 XHTML 2.0
XHTML
provides a small number of specialized container elements. The
section element has 32 specialized containers
and the p has 20 (not including XForms
controls). XHTML does not have the technical orientation of DocBook or DITA.
Because XHTML has its basis in HTML, specialized container elements are
normally applied to achieve particular formatting rather than to define a
generic structure.
5.1.5 BNML
The BNML schema
provides a very small number of specialized container elements. The
item element has 1 specialized container
element (apart from block and the recursive
item), block
and text have a combined 21 elements. Like
XHTML, BNML does not have the technical orientation of DocBook or DITA. This
reduces the number of options significantly.
In
BNML, the inclusion element is a generic
container element for content that is distinct from the normal narrative. This
element was designed to replace a variety of the specialized container elements
that are found in the other schema. Specialization of the inclusion is achieved using a class attribute which
allows different styles to be applied when rendering the
element.
The BNML Schema also provides the
adjunct element (similar to DocBook
appendix) that is used for schedules,
attachments and annexures to documents. The adjunct may include a complete document or
item and block
content.
5.2 Metadata
5.2.1 Concepts
The basic
markup approach taken by each schema may be described as a generic structural
markup. Such markup provides very limited semantic information about the
content of a document. It may define clauses in a contract but it says nothing
about the function of particular clauses. This definition of structural
components serves common publishing needs. It also provides a skeleton that can
be decorated with additional semantic information using metadata where required
by an application. Such metadata may be required at the document or component
level.
Metadata can be used by application
developers for a variety of purposes such as searching, producing document
cover pages, document management and version identification.
Metadata is also highly organizational and application specific.
Some organizations may need to meet government or industry regulations for
metadata. It is rare for a schema to provide all necessary metadata
elements.
There are two common approaches to
specification of metadata: (a) use specific
named elements to store metadata; or (b) use generic elements to store
metadata.
Some organizations may
wish to use a particular metadata language such as RDF.
Specific named metadata elements allow the schema designer to
better control the type of metadata that is captured for a particular document.
This can be done by making certain metadata elements required and validating
the content of metadata when using XML schema or similar schema
languages.
Generic metadata allows the capture of
arbitrary metadata and allows easy extension of metadata. However the schema
designer cannot enforce its capture or validate it as easily as with named
metadata.
Due to the wide range of possible
approaches and application requirements, a schema should not be prescriptive
about the way metadata is managed.
This section
looks at the approaches used by the different schema to capture
metadata.
5.2.2 DocBook
DocBook
provides the articleinfo element for document
level metadata and the sectioninfo for section
level metadata. Both these elements contain a large number (64) of specific
named metadata elements. This provides a "bucket" of options that can be used
by organizations for metadata. There is no provision for generic
metadata.
The schema must be customized to add
organizational specific metadata.
5.2.3 DITA
DITA provides the
prolog element to contain topic metadata. This element contains a small number of
specific named elements: author,
source, publisher, copyright, critdates, permissions, metadata,
resourceid
It also provides the
metadata element which contains: audience, category, keywords,
prodinfo, othermeta
Additional
generic metadata is provided by othermeta
element. This approach only allows the creation of name, value pairs. Record
based metadata structures cannot be created.
5.2.4 XHTML 2.0
XHTML 2.0
provides the generic meta element for
representing all metadata. The meta element has a property attribute which contains the name of the
metadata property and the content of the meta
element is the value of the property. The content of meta can contain inline elements allowing metadata to be
formatted. However, as with DITA, this approach only allows the creation of
name, value pairs. Record based metadata structures cannot be
created.
5.2.5 BNML
BNML provides a
metadata element at the document level and for
the item, inclusion and adjunct
elements. BNML also provides a subset of standard Dublin core metadata elements
that can be used within the metadata element.
However, it is expected that users of the BNML schema will define their own
application specific metadata. The BNML Standard schema permits any content
within metadata for interchange
purposes.
6 Conditional output
6.1 Concepts
Conditional
output is often required for single source publishing. Output that may be
appropriate for print publishing may not be appropriate for web publishing.
Another common use is for international publishing requirements to cater for
variations in spelling or other phrases.
This
section looks at the features provided by each schema for conditional
output.
6.2 DocBook
DocBook provides
the condition attribute on all elements to
enable conditional processing and output. This is part of the common attributes
so it occurs on all elements. The phrase
element can be used for inline conditional text. However, the semantics of
conditional processing are left to processing
applications.
6.3 DITA
DITA provides the
audience attribute on all elements to enable
conditional processing and output. The ph
(phrase) element is used for inline conditional text. Like DocBook, the
semantics of conditional processing are left to processing
applications.
6.4 XHTML 2.0
XHTML makes no explicit provision for
conditional output.
6.5 BNML
Conditional output
is an option that can be activated within BNML. When activated, BNML provides
the condition attribute on the
item and block.
The conditional element is used for inline
conditional text. Like DocBook, the rules for conditional processing are left
to processing applications.
7 Content re-use
7.1 Concepts
XML is ideally
suited to applications that require content re-use. If content re-use is a
requirement for a particular application, provision for re-use must be made in
the schema.
At one level, content re-use only
requires a mechanism to identify to a processing application an external
component that must be incorporated into a particular place in the document. In
practice, other aspects of the schema design are important with content such as
narrative business documents. The schema must define content components that
can be processed as discrete units of information. It has already been noted
that in narrative business documents, content components may be re-used at
different levels in the document hierarchy, depending on the overall context.
Thus the ability to insert components at arbitrary levels may be important.
Such components need to be marked up in a predictable and consistent way for
this to work effectively.
This section looks at the
specific features provided by each schema to facilitate content
reuse.
7.2 DocBook
DocBook makes no
explicit provision for content re-use. Content re-use must be added as a
customization.
7.3 DITA
DITA provides two
approaches to content re-use. Firstly, virtual documents can be defined using
map with topicref elements that point to topic elements.
The second
approach is that all DITA elements provide a conref attribute for content re-use applications. This
attribute can contain the id of a target element. The content of the target
element is rendered instead of the source element. The only qualifier for this
functionality is that the source and target elements must be the same element.
For example, only a topic element can point to
a topic.
7.4 XHTML 2.0
All XHTML
elements provide a src attribute for content
re-use applications. This attribute can contain a URI. The content of the URI
is rendered instead of the source element. The srctype indicates the MIME type of resource. This is the
same method that is used for including graphics and scripts in to XHTML
documents.
7.5 BNML
BNML provides an
option to use XInclude (http://www.w3.org/TR/xinclude/)
for content re-use. By default, content re-use is only provided for
item and for block in contexts where the loose structure model is
used. Customization is required to allow other elements to be
re-used.
8 Customization and specialization
8.1 Concepts
Off-the-shelf
schema will normally require some level of customization before they are used
for an application. It is important to select a schema that is easily
customized. Customization usually takes one of the following approaches: (a) The schema functionality is built up from a
core set of elements by adding elements, attributes and attribute value
enumerations until the desired level of functionality is reached;
or (b) The schema
functionality is reduced from a large pool of elements by removing elements and
attributes that are not applicable to the application. Elements, attributes and
attribute value enumerations may be added to cover functionality not provided
by the base element pool.
Any schema
can be modified to overcome perceived limitations. It is hoped this document
will provide information that will enable persons evaluating schema to work out
which of the four schema is the closest fit to their needs. When deciding
whether to modify an existing schema, there are two important considerations.
Firstly, will the result be confusing to people familiar with the original
schema? They may think they understand it but find it is now something
different. Secondly, will the changes retain the benefit of existing tool
support? If the changes invalidate available processing tools, little advantage
may be obtained.
This section explores the
customization approaches used by each schema.
8.2 DocBook
DocBook is a
very large schema with a large number of elements and loose content models. It
is likely that most authors of narrative business documents would have
difficulty using standard DocBook without extensive customization.
Customization and specialization is first achieved by reducing the large pool
of elements.
DocBook uses the traditional DTD
method of customization and specialization which builds up the DTD using
parameter entities. Parameter entities can be used to alter element content
models and attributes. It has been layered in a way that minimises the impact
of new versions on a customization.
There are three
classes of customization: (a) Subset - the
customization is a strict subset of DocBook; (b) Extension - the customization is not a strict
subset of DocBook (elements, attributes, attribute values have been
added) (c) Variant - this is
used when an organization doesn't want to use Subset or
Extension.
When a customization is
made the schema can no longer be called DocBook.
Specialization of the DocBook requires a detailed understanding
of: (a) the structure and layout of the
DocBook; (b) the DocBook
elements; and (c) the way
the parameter entities are structured and
used.
8.3 DITA
DITA has a smaller
pool of elements and is generally customized by building up the base
functionality. DITA allows for a form of specialization and customization that
is different from DocBook. The topic element
and other elements model a particular pattern in a document. These patterns can
be reused by creating new elements from a source element. A new element must
have the same pattern as the source DITA element (a content model that is the
same or a more restrictive content model than the source). The model cannot be
"loosened". For example, if a specialization was created from a
topic, the specialization must have a required
heading.
There are two types of specialization in
DITA: (a) topic specialization in which new
topic types are created. This is achieved by renaming the topic element and may
include renaming child elements to suit the new topic type. The topic ancestry
is recorded in attributes to allow processing applications to fall back on
generic processing models. (b) domain specialization in which a new language or vocabulary is
added for a particular domain (eg. programming, user interface, hardware, etc).
New elements are created that can occur within para, body and section, etc.
These are made available to all topic and specialized topics within the
DTD.
Specialization of DITA requires
an understanding of the two types of specialization and how the specialization
ancestry is represented in the DTD.
The effect of
the DITA approach is to provide access to default processing rules for topic
content, thereby minimising the amount of application development when new
specializations are created.
8.4 XHTML 2.0
XHTML 2.0 does
not explicitly support customization. Users can customize the schema for
specific uses but it is no longer XHTML.
8.5 BNML
BNML has a smaller
pool of elements than DITA and is customized by building up the base
functionality. The core elements provided by BNML model a small number of
common patterns that can occur throughout narrative business documents. These
patterns can be used to create most document content required by authors of
narrative business documents. This approach is intended to permit re-use of
processing applications based on the base patterns. In this respect it has a
similar aim to DITA but lacks the flexibility of DITA to create new element
names based on those patterns.
It is not expected
that BNML (BNML-standard) can be used without customization. The schema
provides several levels of customization. The minimum customization an
application developer must undertake is to define organizational specific
metadata and class attribute value enumerations for a number of generic element
types.
The application developer can then decide
whether to create a subset or a variant. When creating a subset the developer
is only allowed to add metadata, add attribute value enumerations to existing
attributes and remove optional elements or attributes. This allows BNML
documents to be exchanged with other organizations using the BNML Standard
schema.
If the application developer wants to make
further changes, such as to add new elements or document types, a variant of
BNML must be created. The only restriction on a variant is that the core
patterns for item and block must be maintained. If these are changed, the
application cannot be part of the BNML family of schema.
9 Application development effort
9.1 Concepts
Application development effort is an assessment of the time and
effort required to develop applications using the schema. Other things being
equal, schema that require less effort are likely to find a bigger market than
those that require a lot of development effort.
An
application developer should be able to find developed applications and support
within a community of schema users. Established schema are likely to have built
up a significant resource base over time. The value of this resource will
depend on the applicability of the schema to the enterprise needs. The
availability of already developed applications is not necessarily related to
the design of the schema and is not further considered in this
comparison.
9.2 DocBook
The design of
DocBook impacts on application development effort in two ways. Firstly, it
provides a very large number of elements for which it is difficult to define
precise usage. Secondly, the content models are loose, allowing a very large
number of element combinations. The effect is that developers may create
applications on the assumption of particular usage of particular elements, only
to find that unexpected results through inconsistent usage or the occurrence of
unhandled contexts. In order to develop an application that will take account
of all possibilities provided by the schema, a massive development effort is
required. This makes DocBook applications brittle and unreliable. Except where
appropriate developed applications are already available from the DocBook
developer community, DocBook applications are very expensive to
develop.
9.3 DITA
DITA has fewer
elements and at the document structure level, it provides a slightly stricter
model than DocBook. However, the content of a topic uses very loose content
models that introduce application development complexity similar to DocBook in
some areas.
9.4 XHTML 2.0
XHTML 2.0 has fewer elements than DocBook but
the extreme looseness of the content models will make XHTML applications
unreliable and expensive to maintain, particularly if high quality rendering is
required.
9.5 BNML
BNML uses a very
small set of elements and tight content models compared to the other schema. In
this respect, BNML ought to permit lower cost and more reliable application
development.
9.6 Conclusions
All the
schema operate as recursive, generic structural schema. All such schema impose
demands on application developers to process information based on its
hierarchical context rather than its explicit naming.
DocBook and XHTML 2.0 are likely to be the most difficult schema
with which to create low cost, reliable applications due to the large number of
elements (particularly DocBook) or very loose content models. BNML ought to be
the easiest schema with which to create low cost, reliable
applications.
10 User training and support effort
10.1 Concepts
A key feature
of narrative business documents is that they are created by humans and they are
likely to require ongoing revision. It is assumed that the objective is for
authors to create these documents using an XML editor
application.
To date, XML authoring has not caught
on widely outside specialized content creation units. Experience indicates that
many organizations that have tried it, have difficulty training and supporting
authors who have been raised to use common word processing software. The change
to XML content authoring involves authors learning new concepts.
If authors are to take up XML authoring, the
two critical objectives for the system are to: • minimise the initial training effort so that a new system can be
introduced with minimum disruption and cost; and • obtain consistent markup that does not require
costly data rectification in later stages of the publishing
workflow.
The ease
with which application developers can create an easy to use authoring interface
will substantially depend on the design of the schema. Empirical evidence
suggests that it is easier to train and support authors to use an XML editing
application if: • the new concepts that
authors need to learn are clearly defined and few in number;
and • common authoring
processes can be simplified so that authors do not have to choose from a list
of elements and find the correct location to insert them while also trying to
write the narrative.
Both objectives
can be achieved if the schema avoids asking authors to make semantic
distinctions that cannot be applied consistently and that serve no real purpose
in the application.
10.2 DocBook
As discussed in
earlier sections, DocBook provides a very large number of elements that can be
arranged in a variety of patterns. To use DocBook effectively, authors may need
to understand which of those elements are important to them, which are not and
which of several approaches to paragraph and list markup are applicable to
particular circumstances. It is difficult to reduce this to a few simple
principles for authors. Similarly, it is difficult for an application developer
to package this in a simple form for authors. To be effective, an authoring
application may need to impose a very tightly restricted version of the DocBook
schema.
10.3 DITA
DITA appears to
suffer from many of the same problems as DocBook, particularly at the paragraph
level. The topic structure may suit some kinds
of documents but it is not a natural fit for many business
documents.
10.4 XHTML 2.0
XHTML 2.0 suffers from many of the same problems
as DocBook. While it has fewer elements and the basic principles are simple, it
may be difficult for authors to understand how to consistently markup content
within a section where almost anything is permissible.
10.5 BNML
The core of
BNML involves just three important elements, item, block and
text. The function of each can be quickly
explained, along with the distinction between clause item structures and list
item structures. Once these concepts are understood, an author can reliably
create the basic patterns in narrative documents.
The limited element options and the strict content models should
assist application developers to provide an interface that lets authors easily
create a hierarchical structure from item and
block elements.
10.6 Conclusions
It is
possible to markup basic narrative structures using just a few elements from
each of the four schema. DocBook, DITA and XHTML 2.0 each require authors to
make element selections at the paragraph level that are unnecessary and
confusing. Due to the large number of choices offered, those schema do not
define simple patterns that can be easily explained to new authors and
consistently applied by them.
|
Page Options
|