This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.


MIND
XPath, XSLT, and other XML Specifications
Aaron Skonnard 

Code for this article: XML0500.exe (31KB)
W
hen I wrote the first installment of The XML Files for the premiere issue of MSDN™ Magazine, I dove right into a discussion of XML-based persistence behaviors. This month I’d like to give you a proper introduction to this column. The XML Files will focus primarily on XML and the technologies that support it. Upcoming columns will provide a study of the technology specifications and standards (as defined by W3C, IETF, and so on), and the support for these technologies in Microsoft® products. The main goal of this column is to distill XML’s undeniable merits from the noisy industry hype. 
      Over the past year, several XML specifications have reached the W3C recommendation status, and the development process has begun. This introduction wouldn’t be proper if I didn’t review what every XML developer needs to know about the basics of XML 1.0 and the family of supporting languages and technologies. Once I’ve established this foundation, I’ll dive deeper in future columns.

Keeping Syntax Simple

      It’s easy to fall in love with XML’s elegant simplicity. As you might know, XML is a simplified, fully conforming subset of the Standard Generalized Markup Language (SGML) understood by word processing software, printers, and other display devices. Like SGML, XML separates a document’s structure from its contents. But in the XML specification, the W3C XML working group was careful to exclude many of the optional and more complicated aspects of SGML. If you take a look at the W3C XML 1.0 recommendation found at http://www.w3.org/TR/REC-xml, you’ll see the specification isn’t as long as you might expect.
      Take a look at the following XML fragment that describes a contact from a contact manager database: 

<contact category="enemy of the state">
 <fullname>Will Smith</fullname>
 <phonenumbers>
 <home>801-555-2323</home>
 <cell>801-555-3232</cell>
 </phonenumbers> 
 <email address='will@jiggy.com'/>
</contact>
This XML document illustrates most of what you need to know about XML syntax. Notice that every begin tag has either an end tag (properly nested) or a forward slash, to indicate that it’s a properly formatted empty element. And notice that all attribute values are enclosed within single or double quotes. When an XML document adheres to the syntactical rules outlined by XML 1.0, it’s considered well-formed. In fact, by definition all XML documents are well-formed. If a document is not well-formed, it’s not an XML document—it’s just a bunch of characters mixed together. 
      The following four requirements of a well-formed XML document are frequently violated:
  1. All attribute values must be enclosed in single or double quotation marks.
  2. All elements must have both begin and end tags (unless they’re empty).
  3. All empty elements must contain an empty element identifier (/) at the end of the begin tag.
  4. Elements must be nested properly.
      For example, consider the following XML element (which coincides the standard HTML image element): 

<IMG SRC="background.gif" ID=img1>
This element violates two XML syntax rules: it does not use single or double quotation marks around the ID attribute, and it doesn’t have an end tag or an empty element identifier. The well-formed version of this image element looks like this: 

<IMG SRC="background.gif" ID="img1" />
      In theory, HTML pages should also be well-formed XML, although in practice browsers are tolerant of things like unclosed tags. After you’ve been using XML for a while, though, you’ll find yourself inserting <p/> tags in your HTML.
      Elements must also nest properly to be considered well-formed. This XML fragment 

<foo><bar><baz></bar></baz></foo>
is not well-formed because the </baz> end tag is not contained within <bar>. In a well-formed document, all elements must be completely contained within their parent element—they must be properly nested. In other words, both begin and end tags exist at the same scope.
Figure 1 A Document that is not Well-formed
Figure 1 A Document that is not Well-formed

      To test whether a document is well-formed, simply load it into your favorite XML processor—it will tell you if something is wrong. Microsoft Internet Explorer 5.0 will tell you exactly where your syntax went wrong, as long as you give your file the .xml extension. I’ve provided a simple learning tool with this column that allows you to type in XML and find out instantly whether it’s well-formed (see Figures 1 and 2).
Figure 2 Well-formed Document
Figure 2 Well-formed Document

Infoset: The Information Model

      As the XML standards continue to evolve, it has become evident that most developers—including those in the W3C XML working groups—tend to refer to XML content abstractions (like document or element) in a way that is independent of the concrete syntax. As a result, the W3C formed the XML Information Set working group to formally define the Infoset—a set of abstract objects and properties that define the abstract information model of a well-formed XML document. The Infoset helps promote a common vocabulary and abstract dataset throughout the supporting family of XML specifications and software. The Infoset is currently a W3C working draft found at http://www.w3.org/TR/xml-infoset.
      The Infoset doesn’t mandate any XML processing behavior or set of interfaces, it simply defines the abstract information model that an XML processor should make available to the consuming application. The act of formalizing the information items found in an XML document helps ensure that all XML processors and languages provide similar abstractions in their implementations.
      A document’s Infoset consists of two or more information items. All well-formed XML documents contain at least the document and element information items. For the smallest well-formed XML document, consisting of only </x>, there are two abstractions. One abstraction is for the document information item and the other abstraction is for the element information item, which is the root element of the tree.
      In addition to the document and element information items last described, a document may also contain the following information items: attribute, processing instruction, reference to a skipped entity (an excluded external parsed entity), character, comment, document type declaration, entity, notation, entity start marker, entity end marker, CDATA start marker, CDATA end marker, and namespace declaration. 
      Because the Infoset makes it possible to refer to information found in an XML document in a way that’s independent of the exact syntax, most new XML developments—the Document Object Model (DOM), XPath, XPointer, XML Schema, and so on—refer to its information model.

Namespaces

      The X in XML is especially significant. XML is an extensible language that allows developers to create vocabularies for use throughout their systems. Developers are free to use element and attribute names that convey meaning to the particular processing application. A vocabulary can be thought of as a group of elements and attributes that make sense to a certain application type. Once software modules have been written to understand a given markup vocabulary, it’s typically best for other developers to reuse that vocabulary instead of creating a new one. This allows developers to use the software modules already in place. 
      It’s common for developers to reuse externally defined markup vocabularies in their XML documents. This makes document sharing easy. Whenever a document contains multiple markup vocabularies, however, collisions and recognition errors may occur. To satisfy the need for universally unique element and attribute names, the W3C developed the XML Namespaces recommendation found at http://www.w3.org/TR/REC-xml-names
      Namespaces allow developers to qualify element and attribute names with unique URIs that the developer should typically control. For example, in this XML fragment 

<awl:book awl:ID="1-2323-23424"
 xmlns:awl="http://www.awl.com/cseng"
 xmlns:dm="http://www.develop.com/courses">
 <awl:title>Essential Legos</awl:title>
 <dm:related-course dm:ID="EMIND"/>
 <dm:title>Essential Mindstorm</dm:title>
 </dm:related-course>
</awl:book>
both the book and related-course elements have ID attributes, which could potentially confuse the consuming application. Also, there are two title elements that mean different things. This example illustrates how qualifying element and attribute names with a unique namespace (resource) identifier (a URI) ensures proper usage of the vocabulary. 
      The xmlns attributes (seen in the previous example) are called namespace declarations. The namespace declaration associates a prefix (unique to the document) with a URI. Prefixing an element or attribute name with the namespace prefix (for example, awl:book) automatically qualifies the element or attribute name with the associated namespace URI and guarantees its uniqueness. If the namespace declaration doesn’t contain a namespace prefix, however, it’s considered the default namespace for all element names. (This doesn’t apply to attributes.)

XML APIs

      Most XML developers don’t want to implement XML processors. It’s tedious, time-consuming, and difficult to get right. And with freely available advanced implementations like the Microsoft XML parser, MSXML 2.x (http://msdn.microsoft.com/downloads/tools/xmlparser/xmldl.asp), why would anyone want to go through that pain?
      What XML developers do need, however, is to work with processor implementations that conform to the Infoset and standard XML API specifications. This makes it possible for developers to write implementation-independent code against standard interfaces that is easier to port and upgrade.
      There are two major types of XML APIs in use today: event-based and tree-based. An event-based API attempts to formalize the XML parsing process by defining event interfaces that processors can use to serve up the document’s information items to the application as they are parsed. A tree-based API, on the other hand, defines an in-memory object model that represents the XML document’s logical structure, which is made available to the application after the document has been parsed and loaded.
      The Simple API for XML (SAX) is an event-based API specification that is rapidly gaining popularity because it allows developers to hook directly into the document parsing process. SAX is different from the rest of the technologies discussed here in that it was developed collaboratively by members of the XML-DEV mailing list without the intervention of any industry standards bodies. SAX has received industry acceptance and has influenced many of the current XML processor implementations, such as Apache.org’s Xerces, IBM’s XML4J, and Sun’s ProjectX.
      The original SAX proponents included Peter-Murray Rust, Tim Bray, David Megginson, and many others on the XML-DEV mailing list. David Megginson coordinated the development discussion that took place on the list, authored the first draft of SAX 1.0, and has taken responsibility for SAX’s progress ever since. In January 2000, Megginson released SAX 2.0, which includes support for namespaces and parser extensibility. Megginson provides a sample SAX driver for the Java-language version of MSXML, which can be downloaded from http://www.megginson.com/.
      The XML document object model (DOM) is the standard tree-based API specification that has gained the greatest industry-wide acceptance. The DOM defines the logical structure of an XML document, which is simply a hierarchy of nodes, and the interfaces that must be made available to a consuming application. The DOM exposes elements of the document so they can be manipulated individually through script or other code. The DOM implies that the document will be loaded into memory for random access or traversal by the application. 
      The DOM has been around for some time now. DOM Level 1 (see http://www.w3.org/TR/REC-DOM-Level-1) is a W3C recommendation that defines the core XML functionality as well as functionality specific to HTML applications. DOM Level 2 is currently a W3C candidate recommendation (see http://www.w3.org/TR/DOM-Level-2) that adds complete namespace support as well as several other necessary features like traversal, stylesheets, events, and so on.
      MSXML 2.x conforms to DOM Level 1 and does a decent job of conforming to the DOM Level 2 namespace additions, even though DOM Level 2 is not yet a W3C recommendation.

XPath, XPointer, and XLink

      A well-formed XML document consists of informational items as well as structural metadata. The structural metadata defines the implicit relationships that exist between distinct information items. These implicit relationships can be used for addressing parts of an XML document. Using abstract relationship descriptions to identify document parts—as opposed to explicit traversal techniques—greatly simplifies document processing. XPath makes this possible.
      XPath is a comprehensive language for document addressing, which recently became a W3C recommendation (http://www.w3.org/TR/xpath). XPath got its name from its use of path notation (as with URLs and directories) for navigating through the hierarchy of an XML document. The precursor to XPath was XSL Patterns, supported in MSXML 2.0. XPath models an XML document as a tree of nodes (similar to the DOM) that map to the Infoset. For example, consider the following XML fragment: 

<contact category="enemy of the state">
 <fullname>Smith</fullname>
 <numbers>
 <home>801-555-2323</home>
 <cell>801-555-3232</cell>
 </numbers> 
</contact>
Suppose it was necessary to locate Smith’s phone numbers contained within the document. While this could be accomplished using SAX or DOM manual processing, this type of query can be described with a simple XPath expression: 

/descendant::contact[fullname="Smith"]/child::numbers/child::*
      The MSXML 2.6 technology preview contains an implementation of the latest XPath specification. The technology preview and the accompanying documentation can be downloaded from MSDN Online at http://msdn.microsoft.com/downloads/webtechnology/xml/msxml.asp. While MSXML 2.0 supported XSL Patterns, MSXML 2.6 supports both XSL Patterns and XPath, but it defaults to XSL Patterns for backward compatibility.
      MSXML 2.6 supports XPath queries through the IXMLDOMNode selectNodes and selectSingleNode methods, as well as through its implementation of XSLT (more on this shortly). To begin experimenting with XPath, simply specify the selection language through a call to setProperty: 

doc.setProperty "SelectionLanguage", "XPath"
sel = doc.selectNodes("descendant::numbers")
      There are two languages layered on top of XPath that make it possible to define explicit relationships between documents: XPointer and XLink. XPointer extends XPath for use in URI fragment identifiers, which are useful for defining links between documents—or even between different elements of the same document. 

contacts.xml#xpointer(/descendant::numbers/child::*)
XPointer also extends XPath by introducing the notion of points and ranges in an XML document. XPointer is currently a W3C Working Draft (http://www.w3.org/TR/xptr).
      XLink defines the standard mechanism for using XPointer expressions to create links (or explicit relationships) between document instances or document elements. XLink is also a W3C working draft (http://www.w3.org/TR/xlink). MSXML 2.x does not currently support either XPointer or XLink.

Validation and Metadata

      A well-formed XML document is not necessarily considered a valid XML document. A well-formed XML document, as described earlier, meets all of the syntactical requirements defined by XML 1.0. A valid document, on the other hand, must also conform to additional vocabulary-level constraints often defined by a document’s Document Type Definition (DTD). Hence, all valid XML documents are well-formed, but not all well-formed documents are valid.
      Much of the XML 1.0 specification relates to DTDs, which provide document metadata. DTDs define constraints for a given vocabulary such as the child/parent relationships, attributes, attribute types, and so on that will be permitted in that type of document, as well as entities and notations. These constraints are often referred to as the document’s vocabulary or schema. 
      The concept of a DTD was borrowed from SGML, although in a somewhat simplified form. Nevertheless, they remain the most complicated aspect of the XML language, which is why the W3C is working on a replacement for them. On top of this, DTD syntax is itself not XML-compliant. This requires developers to learn a new syntax to write documents with vocabulary constraints. It also places a much greater burden on XML processor developers since they need to support both XML and DTD syntax. 
      DTDs are also oblivious to namespaces. Using DTDs with namespace-aware documents requires hardcoding the namespace prefix into all DTD markup declarations. Hardcoding a namespace prefix into the DTD really goes against everything namespaces represent. Developers have tried to come up with some creative solutions to this problem using parameter entities, but despite their best efforts, namespaces and DTDs just don’t mix.
      DTDs also have very weak support for element and attribute type descriptions. In XML 1.0, an element’s type is simply derived from the element name. 
      The replacement for DTDs that the W3C has been working on is called the XML Schema Definition Language, or simply XML Schema. XML Schema is broken down into two separate specifications: one for describing the structure and constraining the contents of an XML document (http://www.w3.org/TR/xmlschema-1), and another for defining data types to be used in XML Schema (http://www.w3.org/TR/xmlschema-2). Both specifications are currently in the working draft phase.
      XML Schema improves upon DTDs as a metadata language. First and foremost, the XML Schema syntax is XML-compliant, which simplifies things for everyone involved. Second, XML Schema completely supports and exploits the power of XML namespaces throughout the language. Finally, XML Schema offers an improved content model that separates type from instance. In short, XML Schema will become the standard mechanism for defining XML document metadata in the near future.
      MSXML 2.x supports DTDs as well as a reduced set of XML Schema referred to as XML-Data Reduced (XDR), which is described in a W3C note at http://www.w3.org/TR/1998/NOTE-XML-data-0105, as well as in a more current document at http://www.ltg.ed.ac.uk/~ht/XMLData-Reduced.htm (and, of course, on the MSDN Online XML DevCenter at http://msdn.microsoft.com/xml/).
      An XML processor that supports validation against a DTD or XML Schema is referred to as a validating processor, while a processor that doesn’t support validation is referred to as a nonvalidating processor. Some XML processors can operate in both modes. MSXML 2.x provides the validateOnParse property on the IXMLDOMDocument interface for controlling such behavior.

Transformations

      If organizations and developers could agree on a single XML vocabulary or schema, data and document sharing would be much easier. There are, however, many industry-wide initiatives that promote the sharing of XML vocabularies, such as BizTalk.org, OASIS, and others. Organizations will benefit from using domain-specific vocabularies already in place when it’s feasible. However, most developers realize that the likelihood of a single vocabulary fitting the needs of all interested organizations is infinitely small. It’s more likely that many organizations will end up using slight variations of a published vocabulary. 
      In situations like this, transformations are required to promote interoperability between distinct XML vocabularies. The W3C developed the XSL Transformations (XSLT) language for describing these transformations. XSLT makes it possible to transform an XML document into any other text-based document (XML, HTML, comma-separated, C++ header/source files, and so on).
      The XSLT language is just another XML vocabulary that defines a declarative, rules-based language for specifying the transformation process. XSLT builds upon XPath to define which portions of the source XML document should be transformed to the target document. The following XML code illustrates the structure of a simple XSLT stylesheet: 

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:template match="<XPATH EXPRESSION>">
 <!— transformation defined here —> 
 </xsl:template>
</xsl:stylesheet>
      XSLT recently became a W3C recommendation (see http://www.w3.org/TR/xslt), and the technology preview release of MSXML 2.6 provides an implementation of the latest XSLT specification.

Character Encodings

      ISO/IEC 10646 is an international standard that defines the Universal Character Set (UCS). UCS defines a very large character repertoire and its corresponding character codes. UCS includes all major and commercially important languages, and has room to grow. The Unicode standard serves a similar purpose but was developed by the Unicode Consortium, a group of major American computer manufacturers. Although they are officially two separate standards, Unicode has kept in synch with UCS. Today, UCS and Unicode typically refer to the same character repertoire (although that could change in the future).
      In XML, a character is simply a number. The UCS standard defines the meaning of the number. These numbers can be stored digitally using a variety of character encoding algorithms. For character repertoires that contain no more than 256 characters, each character code can be mapped to a single octet (as with ASCII). For character repertoires that contain more than 256 characters, more sophisticated algorithms are required (as with UCS).
      Several character encoding algorithms are available for UCS. The most common are UTF-16 and UTF-8. UTF-16 simply maps all characters to 2 octets (16 bits), and uses surrogates for larger numbers when it becomes necessary. UTF-8, on the other hand, stores 7-bit ASCII characters as a single octet (as with ASCII), but uses anywhere from 2 to 5 octets for all other characters. If your documents are mostly ASCII, UTF-8 will save space—otherwise it wastes space.
      According to the XML 1.0 recommendation, all XML processors are required to understand UTF-8 and UTF-16. For that reason, XML processors are automatically capable of processing ASCII documents because the UTF-8 encoding of an ASCII document is equivalent to the corresponding ASCII encoding. XML doesn’t exclude other character encodings, but processors are not officially required to support them.
      For an XML processor to read XML, it must be able to figure out which character encoding has been used with a given document. One way to achieve this is through information supplied by the transport layer, as with the MIME Content-Type header: 

Content-Type: text/xml; charset=iso-8859-1
This is probably the safest mechanism for identifying the character encoding used by an XML resource; unfortunately, such information is not always present. An XML document can also explicitly declare its character encoding within the XML declaration: 

<?xml version="1.0" encoding="UTF-8"?>
      But how can the XML processor deal with the text before the encoding declaration? If a document uses any character encoding other than UTF-8 or UTF-16, it’s required to have the XML declaration with an encoding attribute as shown here. Since every one of these XML documents must begin with "<?xml", it’s possible for processors to autodetect the character encoding family, which is enough to read the encoding definition and determine the specific character encoding in use. On the same note, the XML specification defines a mechanism for autodetecting the use of UTF-8 or UTF-16 in the absence of the encoding declaration. This is achieved through the UTF-16 Byte Order Mark (BOM).
      If an XML processor doesn’t support the particular character encoding identified by the document or the transport layer, it’s considered a fatal error. MSXML 2.x supports a range of standard character encodings, including UTF-8 and UTF-16. IXMLDOMDocument::load identifies the character encoding through the mechanisms described earlier. IXMLDOMDocument::loadXML, on the other hand, assumes UTF-16 since the XML is received as a BSTR. In this case, if the XML passed to loadXML contains an encoding declaration of something other than UTF-16, MSXML generates an error.
      Although most of the XML examples you’ll see today are ASCII-based, XML is capable of serving all languages contained within UCS and Unicode. This standard mechanism for dealing with characters forms the foundation of XML’s interoperability benefits. Luckily, developers are shielded from most of these issues since it’s the responsibility of the XML processor to hide most of the nasty details of dealing with character encoding issues.

XML Everywhere

      As more XML specifications develop and solidify, XML continues to mature into a powerful and flexible technology capable of serving many application domains. It’s evident that XML is finding its way into every facet of the software industry. It’s becoming an integral part of database technologies (such as DBMS and ADO), remote procedure call mechanisms such as SOAP, and business-to-business integration and messaging software such as BizTalk™. XML is showing up in Web browsers and servers such as Internet Explorer 5.0 and Internet Information Services 5.0, and many other domain-specific applications.
      Sometimes it seems hard to stay afloat in the sea of XML technologies and specifications. The technologies covered in this column represent the important aspects of XML that every XML developer should be familiar with (see Figure 3). In future issues I’ll delve deeper into many of these areas, as well as other new topics. If you have an XML topic that you would like me to cover, send your ideas to xmlfiles@microsoft.com.


Aaron Skonnard is an instructor and researcher at Developmentor, where he co-manages the XML curriculum. Aaron wrote Essential WinInet (Addison-Wesley Longman, 1998) and coauthored Essential XML due out from Addison Wesley Longman in June 2000. Get in touch with Aaron at http://www.skonnard.com/.

From the May 2000 issue of MSDN Magazine.