What is it?
The XHTML family is a reformulation
of HTML 4 in XML. It represents the next step in the evolution of the Internet.
It lets you enter the XML world, creating content that is both backward
and future compatible. XHTML became an official W3C recommendation on January
26, so when you read this you are on the cutting edge of it all.
Benefits and transition path
Old HTML sins to overcome
Old HTML sins (continued)
Backwards compatibility to HTML
Transition pains from HTML to XHTML
Modularization of XHTML
Extension Modules
Document Profiles and Conclusion
Why all this? What is the problem
with good old HTML?
We explored the shortcomings of
HTML a while ago, in a nutshell:
HTML has a fixed set of elements,
so it is not easily extended to different needs
HTML was designed with the PC in
mind, not taking into account the multitude of alternative platforms coming
to the Web, like TVs, mobile phones and digital
tablets.
HTML is defined relatively sloppily,
requiring parsers to be quite forgiving and intelligent in fixing problematic
markup on the fly. This intelligence weighs in heavy on your hard disk
with your favorite browser.
So what is there to gain?
Citing the W3C recommendation on
XHTML (in rearranged order of my humbly perceived importance):
"Document developers and user agent
designers are constantly discovering new ways to express their ideas through
new markup. In XML, it is relatively easy to introduce new elements or
additional element attributes. The XHTML family is designed to accommodate
these extensions through XHTML modules and techniques for developing new
XHTML-conforming modules (described in the forthcoming XHTML Modularization
specification). These modules will permit the combination of existing and
new feature sets when developing content and when designing new user agents.
Alternate ways of accessing the
Internet are constantly being introduced. Some estimates indicate that
by the year 2002, 75% of Internet document viewing will be carried out
on these alternate platforms. The XHTML family is designed with general
user agent interoperability in mind. Through a new user agent and document
profiling mechanism, servers, proxies, and user agents will be able to
perform best effort content transformation. Ultimately, it will be possible
to develop XHTML-conforming content that is usable by any XHTML-conforming
user agent.
XHTML documents conform to the XML
standard, so they are readily viewed, edited, and validated with standard
XML tools.
XHTML documents can still be written
to operate as well or better than they did before in existing HTML 4-conforming
user agents as well as in new, XHTML 1.0 conforming user agents.
XHTML documents can utilize applications
(e.g. scripts and applets) that rely upon either the HTML Document Object
Model or the XML Document Object Model [DOM].
As the XHTML family evolves, documents
conforming to XHTML 1.0 will be more likely to interoperate within and
among various XHTML environments."
How do HTML documents become XHTML?
An XHTML document must:
validate against one of the three
DTDs.
start with the root element <html>.
refer to the XHTML namespace http://www.w3.org/1999/xhtml
in its root element.
contain one of the following DOCTYPE
declaration prior to the root element:
<!DOCTYPE
html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
<!DOCTYPE
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd">
<!DOCTYPE
html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "DTD/xhtml1-frameset.dtd">
Here is an example of a minimal XHTML
document:
<?xml
version="1.0" encoding="UTF-8"?>
<!DOCTYPE
html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
<html
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>My first XHTML page</title>
</head>
<body>
<p>Hello XHTML world!</p>
</body>
</html>
=====================================================
Which old HTML sins to overcome?
The Ten Commandments of XHTML are:
Fix up all documents that are not
well-formed
Well-formedness is a new concept
introduced by XML. Essentially this means that all elements must be properly
nested. Although overlapping is illegal in SGML, it was widely tolerated
in existing browsers.
NOT: <p>this is <em>emphasized</p></em>
BUT: <p>this is <em>emphasized</em></p>
Change all tags and attribute names
to lower case
XHTML documents must use lower case
for all HTML element and attribute names. XML is case-sensitive, so for
instance <li> and <LI> are different tags.
Add end tags for non-empty elements
In SGML-based HTML 4 certain elements
were permitted to omit the end tag; with the elements that followed implying
closure. This omission is not permitted in XML-based XHTML. All elements
other than those declared in the DTD as EMPTY must have an end tag.
NOT: <p>one<p>two
BUT: <p>one</p><p>two</p>
Quote all attribute values
All attribute values must be quoted,
even those which appear to be numeric.
NOT: <table rows=3>
BUT: <table rows="3">
Unminimize attributes
XML does not support attribute minimization,
attribute-value pairs must be written in full. Attribute names such as
compact and checked cannot occur in elements without their value being
specified.
NOT: <dl compact>
BUT: <dl compact="compact">
Correctly tag empty elements
Empty elements must either have
an end tag or the start tag must end with />. For instance, <br/> or
<hr></hr>. See below for information on ways to ensure this is backward
compatible with HTML 4 user agents.
NOT: <br><hr>
BUT: <br/><hr/>
Pay attention to whitespace handling
in attribute values
In attribute values, user agents
will strip leading and trailing whitespace from attribute values and map
sequences of one or more whitespace characters (including line breaks)
to a single inter-word space (an ASCII space character for western scripts).
Escape or externalize script and
style elements
In XHTML, the script and style elements
are declared as having parsed character content. As a result, < and
& will be treated as the start of markup, and entities such as
< and & will be recognized as entity references by
the XML processor to < and & respectively. Wrapping the content
of the script or style element within a CDATA marked section avoids
the expansion of these entities.
<script>
<![CDATA[
... unescaped script content
...
]]>
</script>
CDATA sections are recognized by the XML processor and appear as nodes in the Document Object Model. An alternative is to use external script and style documents.
Stick to the existing SGML exclusions
SGML gives the writer of a DTD the
ability to exclude specific elements from being contained within an element.
Such prohibitions (called "exclusions") are not possible in XML. For example,
the HTML 4 Strict DTD does not allow the nesting of an 'a' element within
another 'a' element. It is not possible to express this in XML. Even
though these restrictions cannot be defined in the DTD, certain elements
should not be nested.
Use id for fragment identifiers,
not name
HTML 4 defined the name attribute
for the elements a, applet, form, frame, iframe, img, and map. HTML 4 also
introduced the id attribute. Both of these attributes are designed to be
used as fragment identifiers.
In XML, fragment identifiers are
of type ID, and there can only be a single attribute of type ID per element.
Therefore, in XHTML 1.0 the id attribute is defined to be of type ID. In
order to ensure that XHTML 1.0 documents are well-structured XML documents,
XHTML 1.0 documents must use the id attribute when defining fragment
identifiers, even on elements that historically have also had a name attribute.
In XHTML 1.0, the name attribute of these elements is formally
deprecated, and will be removed in a subsequent version of XHTML.
=====================================================
How to ensure backwards compatibility?
Here are some design guidelines
to follow for XHTML documents to render correctly in existing HTML user
agents.
Properly format empty elements I
Include a space before the
trailing / and > of empty elements, e.g. <br />, <hr /> and
<img src="photo.jpg" alt="Photo" />. Also, use the minimized tag syntax
for empty elements, e.g. <br />, as the alternative syntax <br></br>
allowed by XML gives uncertain results in many existing user agents.
Properly format empty elements II
Given an empty instance of an element
whose content model is not EMPTY (for example, an empty title or paragraph)
do not use the minimized form (e.g. use <p> </p> and not <p
/>).
Eliminate embedded Style Sheets and
Scripts
Use external style sheets and
scripts if they use < or & or ]]> or •. XML parsers are permitted
to silently remove the contents of comments, so the historical practice
of "hiding" scripts and style sheets within comments to make the documents
backward compatible is likely to not work as expected in XML-based implementations.
Avoid Line Breaks within Attribute
Values
Avoid line breaks and multiple whitespace
characters within attribute values. These are handled inconsistently by
user agents.
Use only one isindex element
Don't include more than one isindex
element in the document head. It is deprecated in favor of the input element.
Use the lang and xml:lang
Attributes
Use both the lang and xml:lang attributes
when specifying the language of an element.
Fix up Fragment Identifiers
In XML, URIs [RFC2396] that end
with fragment identifiers of the form "#target" do not refer to elements
with an attribute name="target"; rather, they refer to elements with
an attribute defined to be of type ID, e.g., the id attribute
in HTML 4. Many existing HTML clients don't support the use of ID-type
attributes in this way, so identical values may be supplied for both of
these attributes to ensure maximum forward and backward compatibility (e.g.,
<a id="target" name="target">...</a>).
Add the XML Character Encoding declaration
To specify a character encoding
in the document, use both the encoding attribute specification on the xml
declaration (e.g. <?xml version="1.0" encoding="EUC-JP"?>) and
a meta http-equiv statement (e.g. <meta http-equiv="Content-type"
content='text/html; charset="EUC-JP"' >).
Use Ampersand entities in Attribute
Values
When an attribute value contains
an ampersand, it must be expressed as a character entity reference (e.g.
"&"). For example, when the href attribute of the a element refers
to a CGI script that takes parameters, it must be expressed as
NOT: http://webref.com/cgi-bin/xml/demo1.pl?style=none&name=user
BUT: http://webref.com/cgi-bin/xml/demo1.pl?style=none&name=user
=====================================================
Any transition pains?
Unfortunately, yes. Some of the
subtle differences in HTML and XML encoding cause some difficulties:
Boolean Attributes
Some browsers cannot interpret boolean
attributes when these appear in their full, non-minimized form, as required
by XML 1.0. This problem doesn't affect user agents compliant with
HTML 4, though. The following attributes are involved: compact, nowrap,
ismap, declare, noshade, checked, disabled, readonly, multiple, selected,
noresize, defer .
Document Object Model and XHTML
The Document Object Model level
1 Recommendation defines document object model interfaces for XML and HTML
4. The HTML 4 document object model specifies that HTML element and
attribute names are returned in upper-case. The XML document
object model specifies that element and attribute names are returned in
the case they are specified. In XHTML 1.0, elements and attributes are
specified in lower-case. This apparent difference can be
addressed in two ways:
Applications that access XHTML documents
served as Internet media type text/html via the DOM can use the HTML DOM,
and can rely upon element and attribute names being returned in upper-case
from those interfaces.
Applications that access XHTML documents
served as Internet media types text/xml or application/xml can also use
the XML DOM. Elements and attributes will be returned in lower-case. Also,
some XHTML elements may or may not appear in the object tree because they
are optional in the content model (e.g. the tbody element within table).
This occurs because in HTML 4 some elements were permitted to be minimized
such that their start and end tags are both omitted (an SGML feature).
This is not possible in XML. Rather than require document authors to insert
extraneous elements, XHTML has made the elements optional. Applications
need to adapt to this accordingly.
XML Processing Instructions
Be aware that processing
instructions are rendered on some user agents. However, also
note that when the XML declaration is not included in a document, the document
can only use the default character encodings UTF-8 or UTF-16.
Cascading Style Sheets (CSS) and
XHTML
The Cascading Style Sheets
level 2 Recommendation [CSS2] defines style properties which are
applied to the parse tree of the HTML or XML document. Differences
in parsing will produce different visual or aural results, depending on
the selectors used. The following hints will reduce this effect for documents
which are served without modification as both media types:
CSS style sheets for XHTML should
use lower case element and attribute names. In tables, the tbody
element will be inferred by the parser of an HTML user agent, but not by
the parser of an XML user agent. Therefore you should always explicitly
add a tbody element if it is referred to in a CSS selector.
Within the XHTML name space, user
agents are expected to recognize the "id" attribute as an
attribute of type ID. Therefore, style sheets should be able
to continue using the shorthand "#" selector syntax even if the
user agent does not read the DTD.
Within the XHTML name space, user
agents are expected to recognize the "class" attribute. Therefore,
style sheets should be able to continue using the shorthand "." selector
syntax.
CSS defines different conformance
rules for HTML and XML documents; be aware that
the HTML rules apply to XHTML documents delivered as HTML
and
the XML rules apply to XHTML documents delivered as XML.
=====================================================
XHTML Modules
XHTML modules specify well-defined
sets of XHTML elements that can be combined and extended
to deliver content on a greater number and diversity of platforms.
Modularizing XHTML provides a means for product designers to specify which elements are supported by a device using standard building blocks and standard methods for specifying which building blocks are used. It is not economically feasible for content developers to tailor content to each and every permutation of XHTML elements. By specifying a standard, either software processes can autonomously tailor content to a device, or the device can automatically load the software required to process a module.
Modularization also allows for the extension of XHTML's layout and presentation capabilities, using the extensibility of XML, without breaking the XHTML standard. This development path provides a stable, useful, and implementable framework for content developers and publishers to manage the rapid pace of technological change on the Web.
The modules themselves are not yet finalized, nevertheless it is useful to get a feeling for the proposed granularity of those modules, and I personally do not expect significant changes to the final recommendation.
Basic Modules
The basic modules are modules that
are required to be present in any XHTML Family Conforming Document Type.
Structure Module
The Structure Module defines the
major structural elements for XHTML. These elements effectively act as
the basis for the content model of many XHTML family document types. The
elements and attributes included in this module are:
html
head
title
body
This module is the basic structural
definition for XHTML content. The html element acts as the root element
for all XHTML Family Document Types.
Basic Text Module
This module defines all of the basic
text container elements, attributes, and their content model. Some prominent
examples are the headings h1, h2, h3, h4, h5,
h6, block directives address, blockquote, div, p, pre, and inline tags
abbr, acronym, br, cite, code, dfn, em, kbd, q, samp, span, strong, var.
Hypertext Module
This module adds the a element to
the Inline content set of the Basic Text Module.
List Module
As its name suggests, the List Module
provides list-oriented elements. Specifically, the List Module supports
the elements:
dl, containing dt's
and dd's
ol, containg li's
ul, containing li's
=====================================================
Extension Modules
While all user agents have to support
the former, they may or may not support any of the following modules, many
of which contain only one or two elements:
Applet: supports the
applet and param elements
Text extension: defines
a variety of additional textual markup modules
Presentation: contains
character modifiers like b, big, i, small, sub, sup, tt, and the horizontal
ruler hr.
Edit: deletions and
insertions with del and ins for citations, dates, or inline content.
BDO: can be used to
declare the bi-directional rules for the element's content.
Forms: There is one
module for the forms features found in HTML 3.2, and one for those in HTML
4.0.
Tables: One basic
module for the table-related elements table, td, tr, th and caption, and
a more advanced one for table-related elements that improve access with
non-visual user agents.
Images: basic image
embedding with the img tag, and may be used in some implementations independently
of client side image maps.
Client-side Image Maps:
the area and map elements for client side image maps. It requires that
the Image Module (or another module that supports the img element) be included.
Server-side Image Maps:
provides support for image-selection and transmission of selection coordinates.
It requires that the Image Module (or another module that supports the
img element) be included. The Server-side Image Map Module adds the ismap
attribute to the img tag.
Objects: elements
for general-purpose object inclusion. Specifically, the Object Module sports
the object and param tags.
Frames: all frame-related
elements like frameset, frame, noframes.
Iframes: the iframe
element that can be used to define a base URL against which relative URIs
in the document will be resolved.
Events: all of the
well-known onXXX event handler attributes such as onload, onfocus.
Metainformation: the
meta element that describes information within the declarative portion
of a document (in XHTML within the head element).
Scripting: elements
that are used to contain information pertaining to executable scripts or
the lack of support for executable scripts, namely script and noscript.
Stylesheet Module:
enables style sheet processing with the style element.
Link Module: used
to define links to external resources with the link element.
URL base: the base
element that can be used to define a base URL against which relative URIs
in the document will be resolved.
Legacy: elements and
attributes that were deprecated in previous versions of HTML and XHTML,
namely font, s, strike, u, body attributes background,
bgcolor, text, link, vlink, alink, br attribute clear, strike, and u.
=====================================================
Document Profiles
With the modularization of XHTML
we have solved one problem: The increasingly proliferating types of Web
clients can pick and choose specific subsets of the HTML standard to meaningfully
support them given their form factors and display capabilities. But we
created two new problems:
How does a client advertise its
rendering capabilities to a server?
How does a document express the
modules used for its content?
The latter will be adressed through
not yet specified document profiles, but will the former
remain the art of mapping user-agent-strings to capabilities? Who knows...
you?
Conclusion
The reformulation of HTML in XML
is an elegant way to bring both worlds together in a future-oriented but
compatible way. While it is unlikely that every hand-written HTML page
will be upgraded to XHTML, writing new pages in XHTML and improving the
templates of page generators like ASP and JSP will give your documents
a much wider audience in the brave new world of non-PC Web clients. Make
sure you do the best you can to make it happen:
If you are a Web master, start planning
the migration of your site to XHTML now!
If you are an HTML tool author,
upgrade your tool to support document creation in XHTML now!
If you write HTML documents, adhere
to the XHTML rules and DTDs now! Use the W3C validator to be sure you got
it right. (Ja, this document has been validated. The remaining errors in
line 30, 84, and 92 are caused by server-side includes not under my direct
control.)
Thank you, in the name of all current
and future owners of Web-enabled PDAs, phones,
TVs, and toasters!
=====================================================