computersmiths logo

Extensible Markup Language (XML)

Monterey Peninsula College - CSIS78

Randy Smith


Note: You should view this document in Internet Explorer5 or greater, or Netscape 6.
The document contains examples of XML and XSL, and many browser versions will not display.


Contents

1. Some Benefits of XML
2. Turning existing HTML Documents into XML
3. XML Is a Meta-Language
What does a Web Browser do with an XML Doc?
Exercise 1:  What If an XML Doc is Missing an End Tag?
4. How Do We Create an HTML Doc from the XML Doc?
5.  What if You Modify XML Doc, But Insert a Typo?
Exercise 2:  Add Style to the HTML Doc
References


1. Some Benefits of XML

Data files become self-describing

XML permits standard tools - like parsers and structured text editors

The application author no longer needs to write a customer parser.

XML can be represented by a tree - making traversal easy

This idea is embodied in the Document Object Model (DOM).
DOM returns 26630 when asked
" What is the price of the FIREBIRD-TRANSAM? "

XML can be used for any data interchange!

XML may become the language for all data interchange among applications on computers in the Internet - whether for web pages, data base queries, remote procedure calls, or whatever.

You can automatically populate a document template with XML

Here's an example of applying this html template to the above xml document:

screen capture aof autos.htm binding to xml

Here are 10 more points describing XML


2.  Turning existing HTML Documents into XML

    Because HTML and XML are closely related, it is not difficult to make an HTML document XML-compliant. Your basically have to make certain that your HTML is "well-formed" (this is a requirement for all XHTML documents as well).

     

  • Replace the DOCTYPE declaration and any internal subset with the XML declaration. Replace:
  • <DOCTYPE HTML ...>
    with

    <?xml version="1.0" standalone="yes"?>
  • Change any empty elements such as <isindex>, <base>, <meta>, <img> <br> or <hr> so they end with />, for example:
  • <img src="logo.jpg" alt="logo drawing" />

    These elements may require some experimentation. For instance, some browsers treat <br /> or </hr> the same as <br> or <hr>.
    Others will accept /> if there is a space between it, but not otherwise. The browsers are not yet compliant with the standards.
    HTML browsers may not accept XML style empty elements with a trailing slash (e.g., <hr/> and are not backward compatible.
    If you want your XML document read without parsing, you can add dummy end tags to empty elements, so <hr> becomes <hr></hr>.

  • Make sure that the nonempty element has a correctly matched end tag (container); every <p> must have a </p>.
  • Escape all markup characters. (< and & should be written as &lt; and &amp;).
  • Make certain all attribute values are in quotes.
  • Ensure all element names match with respect to upper- and lowercase characters in both start and end tags and are consistent throughout the file.
  • Ensure all attribute tags are similarly in a consistent case throughout the file.
  • Make sure there are no overlapping tags. Each tag should completely contain any tags within it.

3. XML is a Meta-Language

    • You cannot use XML "raw".

    • You must define a vocabulary of tags for your use.

    • You may optionally create a Document Type Definition (DTD) - a grammar telling a parser how to parse a document using your tags.

    • This "meta-langauge" approach makes the language extensible.  People with common interests can define their own markup languages and standardize just within their own community (e.g., chemists). However a schema may require an extension such as Microsoft's MathPlayer to render MathML markups.

What does a Web Browser do with an XML Doc?

  1. Start IE5 or IE6 or NS6.
  2. View the xml doc above.
  3. Click items and the + and - symbols in the IE5 display to see what happens.
  4. Note that IE5 nicely formats the xml tags for you.

Exercise 1:  What If an XML Doc is Missing an End Tag?

  • An xml parser by default checks that a document is well-formed.

    • Tags must have a start and matching end.  There are two syntaxes; either

      • <auto>...</auto>
      or
        <auto/>
    • Tags must nest.  This is ok, and can be mapped to a tree:

      • <auto><doors>...</doors></auto>
      but this is a syntax error:
        <auto><doors>...</auto></doors>
  • Try this:

    1. Save the xml doc above to a local file.

    2. Edit the file to reverse the order of two tags.

    3. Display the result to see what happens.

    4. Edit the file to delete one of the tags you reordered.

    5. Display the result in to see what happens

4. How Do We Create an HTML Doc from the XML Doc?

    This is done by creating an HTML file with missing parts (to be filled in by XML), then do one of two things...

      1. Insert scripting into HTML file that populates missing parts.

      2. Create an XSL (Extensible Style Language) file, which transforms XML to HTML.


      Here is part of an html file for our xml example:


       
        <htm>
        <head>

        <title>Automobile Buyer's Guide</title>
        </head>
        <body>

        <table width=100% border=1>
        <tr>
        <td>Automobile Buyer's Guide for </td>
        <td>Model: </td>
        </tr>

        <tr>

        <td>Body style</td>
        <td>-door</td>
        </tr>

        <tr>
        <td>Engine displacement</td>
        <td> liters</td>
        </tr>
        </table>

        </body>
        </htm>


      View this file in XML-aware browser

      Now we need a way to populate it.  So we do this:

       
        <htm>
        <head>
        <title>Automobile Buyer's Guide</title>
        <script>
        function init() {
          xml = new ActiveXObject("Microsoft.XMLDOM");
          xml.load("autos.xml");

          //Some stuff to fill in the missing HTML parts

        }

        </script>

        </head>
        <body
        onload="init()">

        ...
        </body>
        </htm>

      • The BODY tag says to call function init() when the page is first loaded, before display.  The JavaScript init() function then creates (via new ActiveXObject("Microsoft.XMLDOM")) an instance of the XLM Document Object Model object.

      • Then the XMLDOM parses the xml file (via xml.load("autos.xml")) and allows access later using the doted notation to parts of the object tree.
      • The rest of function init() will pick out the various fields of the xml file and insert them into the <BODY> section.
      But how do we insert "variables" into the BODY section so we can refer to them in the script?  Here's how:
       
        <tr>
        <td>Automobile Buyer's Guide for <span id="year"> </td>
        <td>model: < span id="name"></td>
        </tr>

        <tr>

        <td>Body style</td>
        <td>< span id="doors"></span> -door</td>
        </tr>

        <tr>

        <td>Engine displacement</td>
        <td>< span id="displacement"></span> liters</td>
        </tr>


      So the <SPAN> with an empty body is a placeholder, and the ID declares it a name that can be used in the script.
      We then make the script's init() function this...
       

        function init() {
          xml = new ActiveXObject("Microsoft.XMLDOM");
          xml.load("autos.xml");

          // Get the autos object

          autos = xml.documentElement;
          Year.innerHTML = autos.getAttribute("year");

          // Get the auto object

          auto  = xml.getElementsByTagName("auto").item(0);
          Name.innerHTML = auto.getAttribute("name");

          // Get the number of doors

          Doors.innerHTML = xml.getElementsByTagName("DOORS").item(0).text;

          // Get engine info
          Engine = xml.getElementsByTagName("ENGINE").item(0);
          Displacement.innerHTML = Engine.getAttribute("displacement") + " ";
        }

      Some notes:

      • The <span id="foo"> tags in the html body act like variables.  The Javascript replaces those variables with strings
      • Two useful DOM functions are
        • getElementsByTagName, which finds something like <auto>
        • getAttribute, which finds something like name="BMW M5" in the <auto> tag.
      • Year.innerHTML:  innerHTML is a predefined document part, which refers to the inside of the tag whose ID=Year.  So if Name.innerHTML = "xyz", then later in the HTML document
        • Automobile Buyer's Guide for <span id="year">
        becomes
          Automobile Buyer's Guide for xyz
      • You can construct new strings to insert into the <span> tags by using things like concatenation ("+")
      • X.item(0) refers t o the first child of X.
      • Y.text refers to the body of Y.

      Result...

        The result is this html file.  (Use the View->Source menu item to see the full html file!)

        Hint:  If you want to create xml and html files like this on your server, be sure that you server's configuration files return a mime type of "text/xml" for files ending with the .xml extension.  Otherwise when your browser interprets your html file, " xml.load("autos.xml") ;" will either cause an access violation or return an illegal object.

5.  What if You Modify XML Doc, But Insert a Typo?

    Suppose that you have a bad memory for tag names, and try to create the xml file from memory, and you use <CARS> instead of <autoS>.

    Will an XML parser like that in IE complain?  (Try it!)
    To catch illegal tag names, attribute names, or illegal data type of bodies inside tags, you need a grammar.
     

    How Grammars are Specified in XML

    The grammar of our XML example can be specified something like this:


    In XML the grammar is represented by a Document Type Definition (DTD):

    <!ELEMENT autos (auto)>

    <!ATTLIST autos
              year CDATA #REQUIRED>
    <!ELEMENT auto (doors, engine, performance, PRICE)>
    <!ATTLIST auto
              name CDATA #REQUIRED>
    <!ELEMENT doors EMPTY>
    <!ELEMENT engine EMPTY>
    <!ATTLIST engine
              displacement CDATA #REQUIRED
              horsepower CDATA #REQUIRED>
    <!ELEMENT performance (zeroto60, quartermile)>
    <!ELEMENT zeroto60 EMPTY>
    <!ELEMENT quartermile EMPTY>
    <!ATTLIST quartermile
              second CDATA #REQUIRED
              mpg CDATA #REQUIRED>
    <!ELEMENT price EMPTY>

    [Above DTD is in this file...]

    The final step is to add a line to the XML document naming the DTD as the second line:

    <?xml version="1.0"?>
    <!DOCTYPE BuyerGuide SYSTEM "autos.dtd">

    Notes on the example DTD:

    • <! is an escape character -- it marks something that is not part of the xml doc itself.
      • Think of <!-- ... --> to mark a comment in HTML!
    • !ELEMENT defines the tags in the language -- "autos", "engine", and so on.
    • Syntax like
      • <!ELEMENT auto (doors, engine, performance, price)>
      means the child of auto is a doors tag followed by an engine tag, and so on.
      • doors+ means one or more instances of the doors tag
      • doors? means zero or one instance of the doors tag
      • doors* means zero or more instances of the doors tag
    • !ATTLIST says what attributes are legal for an element
    • !ENTITY introduces identifiers like %AutoName, which are made equivalent to some primitive XML type (like NMTOKEN - a "name token").
    • #REQUIRED says whether an attribute is needed or not or has a default and so on.
      • #IMPLIED =optional and no default value exists
      • #REQUIRED =attribute must be included in every element
      • #FIXED =what follows is a value in quotes for the attribute

    Exercise 2:  Add Style to the HTML Doc

    1. Modify the html doc to link to a CSS style sheet that dresses up the doc.
    2. Add a few more rows to the table with more info from the XML doc (0-60 time, etc.)

References

    • Extensible Markup Language (XML) 1.0, W3C Recommendation 10-February-1998, http://www.w3.org/TR/REC-xml.
    • The Annotated XML 1.0 Specification, http://www.xml.com/axml/testaxml.htm
    • Stephen Mohr, et al., Professional XML, Wrox Press, 2000. [One of the most complete guides]
    • Dino Esposito, " XML Languages, " Microsoft Internet Developer (MIND), June 1999.  [Gives concrete example with complete code for using XML, DTD, and CSS without XSL and with Javascript to generate an HTML doc in Internet Explorer 5.]
    • Dino Esposito, " XML Server Pages," Microsoft Internet Developer (MIND), July 1999.  [Gives concrete example with complete code for using XML, DTD, and CSS with XSL to generate an HTML doc in Internet Explorer 5.]
    • Microsoft Corporation, XMLDOMDocument Interface. [Documents the API for ActiveX object microsoft.xmldom.]
    • Brian Randell, A Beginner's Guide to the XML DOM , MSDN Online Web Workshop, Oct. 1999.  ["This article discusses how to access and manipulate XML documents via the XML DOM implementation, as exposed by the Microsoft® XML Parser."

    email image Send e-mail to randysmith@mpc.edu