Java Explorer - XML Overview

XML Overview

In this article we will take a high level view of XML, from about 36000 feet. We will of course stoop to observe certain details but only where occassion warrants it. It is intended that the reader will benefit from an orbital view of the XML space before homing in on the finer aspects of this rather intriguing technology.

XML is intriguing in many respects. It is the subject of every designer, developer and architect whose job primarily involves moving data between interfaces, applications and systems. Here are some of the areas where XML is the enabling technology.

Provide multiple views of data
Provide structure to data while keeping to the text format
Define language to express data in domain-specific vocabulary
Enable data interchange between disparate often incompatible systems like in businesses
Build data repositories not unlike current database systems
Publish database
Enable push publishing
Distribute software
Automate the Web development
Enable internationalization & localization
Unified Messaging
Multimedia

The list is by no means complete. On the other hand, it is growing every year. We are already witnessing a rash of new XML-based technologies such as SOAP and XML-RPC, to mention only two. How a text-based XML data can be made to work out this kind of magic? Why is there so much rush in using XML for a myriad of purposes? As we go orbitting in the XML space, we will try to find out answers to these questions.

What is XML?
XML is a three letter acronym (TLA) for eXtensible Markup Language. It is a subset of the Structured General Markup Language (SGML), the grand daddy of all markup languages. It resembles HTML in the way it is used to markup text, but there the resemblance ends. A derivative of SGML, HTML is used to present data using a set of pre-defined tags. XML, on the other hand, concerns itself with providing structure to data using tags defined by the user. The presentation of the data is left to tools such as a style sheet or a style language, itself an XML document.

Let us say you wanted to represent data relating to books. If you have the data as you would normally have in a relational database table named book, it might look like so:

    title                                 subject    author            pages    publication     date_of_pub 
--------------------------------------------------------------------------------------
Web Markup                      HTML  Tim Berners Lee    200    Weblications    May 2002
The New Markup Language   XML    Charles Goldfarb  1200   Weblications   October 2001
--------------------------------------------------------------------------------------

Now in order to hold this data in a XML document, you can define tags and wrap up the data in them like so:

<?xml version="1.0"?>
<books>
<book>
<title>Web Markup</title>
<subject>HTML</subject>
<author>Tim Berners Lee</author>
<pages>200</pages>
<publication>Weblications</publication>
<date-of-pub>May 2002</date-of-pub>
</book>
<book>
<title>The New Markup Language</title>
<subject>XML</subject>
<author>Charles Goldfarb</author>
<pages>1200</pages>
<publication>Weblications</publication>
<date-of-pub>October 2001</date-of-pub>
</book>
</books>

You may give any name to the tags you use in an XML document, but it makes more sense to use the names from the data domain itself. The naming scheme and the data hierarchy we used to wrap the data is like shown below:

book [name of the table]
- title [ field names...]
- subject
- author
- pages
- publication
- date-of-pub

The XML document provides a tree structure to data, thus opening up possibilities for querying the data in a language independent way. All you need is an XML processor (a.k.a. parser) to navigate the tree-like hierarchy and locate the data points within the document.

Why XML?
You may argue that SQL provides similar facilities, and is also independent of any programming language. So why XML? It must be clear from the outset that XML is in no competition with a relational database. On the other hand, XML complements it. Data points in a database are related indirectly via the relations we assign to the tables that hold the data. XML provides a logical, hierarchical structure to the data. Structured data is easy to process because it holds the data relationships tightly and not indirectly like in a database. It is also possible to process the same data multiple times, in multiple environments and in multiple ways.

In the example above, once the data is fetched from a databse and serialized to XML, the data may be sent to an online book store for presenting in a browser, a mobile device, a retail store to fulfil an order, or another application in a different system and platform for further processing. In a word, any application that has a processor to read an XML document will be able to use that data. Today we have XML processors in almost all languages, including scripting languages like JavaScript and ActionScript for Macromedia Flash.

XML is currently the most popular technology in the computing world. With the industry backing it from the core, the businesses are adopting it at a frenetic pace. The popularity of XML is probably on account of two reasons: Text processing is simple, and XML processors are ubiquitous and free.

Define your own markup language
We have seen how data is captured in an XML document using domain-specific tags. The tags can be nested to any level, but beyond certain level (say three or four) it becomes too complex to handle. The tags can have attributes too. The XML declaration at the start of the document has a 'version' attribute. Attributes are simple name-value pairs like those in a HTML document. The alternative to using attributes is to nest tags. So it is really a trade-off between attribute complexity and nesting tags complexity. You may use both judiciously to solve a particular problem.

Data in an XML document is wrapped between opening and closing tags. In the example above, the title tag encloses the book title between the start tag <title> and the end tag </title>. No different from HTML tags if you are familiar with it. In HTML you may open a tag but not close it. Some browsers permit you to do that. Also, some tags like <br> and <hr> do not have a corresponding closing tag. In XML, if you open a tag, you must close it; if you don't, you violate a rule called well-formedness.

XML, like any self-respecting language, has some rules governing the document generation. All XML documents begin with the XML declaration at the start of the document. The declaration specifies the version number of the XML specification the document is based on. The version attribute is mandatory. Another attribute that you may specify is the charcater set the document uses. If it is not specified, the document defaults to UTF-8. XML also supports the UNICODE character set.

The XML standard defines two types of documents:

well-formed document
valid document

A document is well-formed if all the start tags have the corresponding end tags. It is also required that there is only one root element in the document. The root element in our example is books. All XML standard compliant processors check for document well-formedness. A well-formed document allows a processor to build a tree representation of the data.

A valid XML document is one that contains tags as defined in another document called A Document Type Definition, or simply a DTD. A DTD is the basis for a new markup language. The DTD specifies what tags are valid, how they are nested, constrains the allowable number and type of tags and whether they can be empty or not used at all. Such constraints are expressed using the notation similar to wildcard expressions. What does a DTD look like? As an example, we provide a DTD for the document listed above.

<!ELEMENT book (title, subject, author+, pages, publication, date-of-pub)>
<!ELEMENT title #PCDATA>
<!ELEMENT author #PCDATA>
<!ELEMENT pages #PCDATA>
<!ELEMENT publication #PCDATA>
<!ELEMENT date-of-pub #PCDATA>

A DTD does not specify a root element, but the standard mandates that every XML document contain one and only one root element. Each element in the DTD is a tag in the XML document. The first line says that the book element has nested elements title, subject etc. Each of the nested elements has data points. The plus (+) sign after the author indicates that there can be one or more author elements in the XML document.

How many, say, author elements can a document have? Can the nested elements be ignored in a document, or can a book element have more than one title element? Is it possible to nest each of the nested elements further? A well-defined DTD specifies these constraints. If you want your XML document to be checked for these constraints, you define a DTD and reference it in your XML document immediately after the XML declaration. Such a reference is known as Document Type Declaration (don't confuse this with Document Type Definition - the DTD we have been discussing).

<!DOCTYPE ...

The presence of the document type declaration forces the processor to check the XML document for validity against a DTD. A processor that performs a validation check is called a validating processor, or validating parser. An XML document that passes a validation check is called a valid document. A valid XML document is also a well-formed document, though the reverse may not always be true. According to the XML standard, a DTD for an XML document is not mandatory.

How do you process an XML Document?
XML documents are processed by what are called parsers. There are today as many parsers as there are programming languages. In fact, there are many competing parsers in the same language. The parser checks a document for wellformedness or validates against a DTD and provides interface to applications for collecting data hidden in the tags. Parsers are of two types - the SAX type and the DOM type.

SAX stands for Simple API for XML. Designed and developed by David Meggison, SAX soon became a de facto standard among developers. The API is simple and easy to use. A SAX parser interpretes an XML document line by line and provides event callbacks to the application that runs the parser on a document. Every start tag, end tag and the data points in the XML document trigger events that allow the application to capture the name of the tag and the data associated with it.

The World Wide Web consortium (W3C) defined a Document Object Model (DOM) for processing an XML document. Based on the DOM structure originally designed for DHTML, DOM represents XML data as an in-memory tree data structure. A DOM-compliant parser provides API for traversing a DOM tree. Each tag and its corresponding data points become nodes in the DOM tree hierarchy. You traverse the tree to isolate data points for your application.

Presenting XML data
We said that XML data can be manipulated in multiple ways and in multiple environments.

Data in XML can be easily published as a table in a web page. This is accomplished using what is called a style sheet. A style sheet contains information to the browser or a processor to present the data in XML in a format specified in the style sheet.

The XML data can be presented in different ways simply by replacing one stylesheet with another. This flexibility opens up a host of applications that depend on the same data, but needs to be formatted differently for different display devices. For example, the books.xml sample shown earlier may be presented by using one stylesheet designed for the web, and another stylesheet designed for the mobile device. While the former contains embedded HTML tags, the latter has WML embedded content.

Today XSL, eXtensible Style Language, is the preferred choice for presenting XML data. It is easy, flexible and a lot like a scripting language such as JavaScript.
Taking the books.xml file described above, we apply a generic xsl file createTable.xsl ( I found it on the web, and a good one to start with ). Here is the stylesheet in bold.

// createTable.xsl
A generic strylesheet for transforming a table-like structured XML document into an HTML table

The expected structure is of form
<table-marker>
<row-marker>
<column-marker1>
column1-data
</column-marker1>
<column-marker2>
column2-data
</column-marker2>
...
</row-marker>
...
</table-marker>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<head><title>Data table</title></head>
<body>
<xsl:apply-templates/>
</body>
</html>
</xsl:template>

<xsl:template match="/*">
<font face="Verdana" size="2">
<table border="1" cellpadding="2" cellspacing="3">
<xsl:for-each select="*[position() = 1]/*">
<th bgcolor="#C9C9C9">
<xsl:value-of select="local-name()"/>
</th>
</xsl:for-each>
<xsl:apply-templates/>
</table>
</font>
</xsl:template>
<xsl:template match="/*/*">
<tr>
<xsl:apply-templates/>
</tr>
</xsl:template>

<xsl:template match="/*/*/*">
<td bgcolor="#C9FF00">
<xsl:apply-templates/>
</td>
</xsl:template>

</xsl:stylesheet>

This is the result of applying this stylesheet to the books.xml above.

title	subject	author	pages	publication	date-of-pub
Web Markup	HTML	Tim Berners Lee	200	Weblications	May 2002
The New Markup Language	XML	Charles Goldfarb	1200	Weblications	October 2001

How does it work?
Let's look at the stylesheet code in detail. There is a template tag with a match attribute that referes to a pattern in the xml file. The pattern is a actually a path expression corresponding to the location of elements in the xml file.

Like in all path expressions, a single forward slash (/) represents the root, followed by a wild-card character. The path expression is now standardized in XPath, which is beyond the scope of this article.

A recursive descent from the root path is accomplished by the tag . Each such tag has a corresponding <xsl:template> tag that specifies what to do.

Play around with this with changes to path expressions and html markup to get a feel for it. There are two ways to associate a stylesheet with an xml file - programmatically, or embedding the stylesheet tag in xml file.

You associate the stylesheet createTable.xsl with the books.xml file in this manner. Just insert the following line in the xml file after the xml instruction tag , like so.

<?xsl-stylesheet href="createTable.xsl" type="text/xsl"?>

If your browser is capable of executing the associated xsl instructions, then you will see the table displayed above. Otherwise, all you will see is the xml file parsed into a tree-structure.

Another way to associate xsl with xml is to use the XSLT ( XSL Transformer) that is now available in several versions in almost all programming languages. Being a java buff, I use the following code to convert xml data to html (again, from the same web source where I took this xsl example).

import javax.xml.transform.*;
import javax.xml.transform.stream.*;
import java.io.*;
public class xml2html {
public static void main(String[] args)
throws TransformerException,
TransformerConfigurationException,
FileNotFoundException, IOException
{
TransformerFactory tFactory =
TransformerFactory.newInstance();
Transformer transformer =
tFactory.newTransformer(
new StreamSource("createTable.xsl"));
transformer.transform(
new StreamSource(args[0]),
new StreamResult(new FileOutputStream(
args[1])));
System.out.println
("** The output is written in "+
args[1]+" **");
}
}

Just compile and run this code as below. I am using J2SE 1.4v.

java xml2html books.xml books.html

You will see the table shown above. Enjoy.