Seminar 5

Markup Languages for the Future - XML, XHTML, ***ML

© M. Firebaugh

I. Introduction

The Future of WWW Markup Languages


A. Meta-languages & XML

What is the relationship between SGML, HTML, and XML?

SGML = Standard Generalized Markup Language

All right, then, what is a meta-language?

Visions inspiring XML:

The grand vision of XML is the creation of a worldwide collection of data objects that are fully addressable and fully open to borrowing, reuse, and repackaging by anybody on the net--in short, everything that the copyright laws have strived for centuries to prevent.

XML is a standard for the creation of tagging languages. It sets out a collection of rules that govern how a parser is to behave. An XML parser that follows these rules can parse any document tagged in an XML compatible language. This means you can make up your own language and not have to write any code to parse it. You can concentrate on writing code that processes the information in useful ways. . . .

The advantages of XML is that it allows you to define your own data structures. When you use any particular language defined in XML, you are no longer enjoying the advantages of XML, you are enjoying the advantages of the particular markup language you have chosen.

What XML is:

What XML is NOT:

A very good introduction to XML has been written by the W3 Consortium with commentary by Tim Bray

B.  Design Goals of XML

XML was developed by an SGML Editorial Review Board (ERB) formed under the auspices of the World Wide Web Consortium (W3C) in 1996 and chaired by Jon Bosak of Sun Microsystems, with the very active participation of an SGML Working Group also organized by the W3C. Dan Connolly served as the ERB's contact with the W3C.

The design goals for XML are:

"XML is primarily intended to meet the requirements of large-scale Web content providers for industry-specific markup, vendor-neutral data exchange, media-independent publishing, one-on-one marketing, workflow management in collaborative authoring environments, and the processing of Web documents by intelligent clients. It is also expected to find use in certain metadata applications. XML is fully internationalized for both European and Asian languages, with all conforming processors required to support the Unicode character set in both its UTF-8 and UTF-16 encodings. The language is designed for the quickest possible client-side processing consistent with its primary purpose as an electronic publishing and data interchange format." [971208 W3C press release]"XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form the character data in the document, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure. A software module called an XML processor is used to read XML documents and provide access to their content and structure. It is assumed that an XML processor is doing its work on behalf of another module, called the application. This specification describes the required behavior of an XML processor in terms of how it must read XML data and the information it must provide to the application." [adapted from the Proposal]

from The SGML.XML Web Page by OASIS

C. The Syntax of XML

The Syntax of XML is Defined in XML Part 1. Syntax

XML supports two levels of syntax:

  1. Well-formed XML documents [lower level]
  2. Valid XML documents specified by Document Type Definitions (DTD) [higher level]

An XML document is a systematic set of containers called elements

General Syntax for well-formed documents

Purpose of Document Type Definitions (DTDs)

Advantages of Document Type Definitions (DTDs)

 <!ENTITY  me  "Dmitry Kirsanov,
  St.Perersburg, Russia">


This document was created by &me;
 on October 4, 2000
   <!- -   your DTD goes here  - ->

Summary of the differences between XML and HTML:



Defines Page Layout Defines Page Content
Concerned with appearance of Web Pages Concerned with meaning of Web objects
Data for display only Abstract data concepts given form and structure
Analogous to Spreadsheet Analogous to DataBase
Uses fixed set of Tags (theoretically) Uses customized set of author-defined Tags
Tag attributes specify appearance of objects Tag attributes specify behavior of objects
Appearance of elements can be modified by CSSs Appearance of elements can be modified by CSSs
Tag functionality is fixed Tag functionality is defined by DTD and modifiable
Supported by All Browsers Supported only by IE 4.0, 5.0
Requires bloated browser to parce bad HTML code Requires lean & mean parcer for strict syntax

D. Examples of XML

Example 1: A Well-formed XML Document

     <NOUN> dog</NOUN>

Example 2: A DTD Document

The Root Document Element Definition from play.dtd

<!ELEMENT play (title, fm, personae, scndescr, playsubt, induct?,prologue?,act+,epilogue?)>

Note: the "?" means "optional item", and the "+" means multiple occurences

The play.dtd also shows the relationship between tags:

<!ELEMENT  speech   (speaker+, (line|stagedir|subhead)+)>
<!ELEMENT  speaker  (#PCDATA)>
<!ELEMENT  line     (stagedir  |#PCDATA)+)>
<!ELEMENT  stagedir (#PCDATA)>
<!ELEMENT  subhead  (#PCDATA)>

This is a SPEECH element using the DTD definitions above:

<LINE><STAGEDIR>Aside</STAGEDIR>  The Duke of Milan  </LINE>
<LINE>And his more braver daughter could control thee,  </LINE>
<LINE>If now 'twer fir to do't.  At the first sight</LINE>
<LINE>They have changed eyes.  Delicate Ariel,  </LINE>
<LINE>I'll set thee free for this.  </LINE>
<LINE>A  word, good sir;</LINE>
<LINE>I fear you have done yourself some wrong;  a word. </LINE>

Another excellent resource on XML Files is the XML Magazine

Example 3: An XML Catalog

The XML Code:

<?xml version="1.0" encoding="ISO8859-1" ?> 
  <TITLE>Empire Burlesque</TITLE> 
  <ARTIST>Bob Dylan</ARTIST> 

  <TITLE>Hide your heart</TITLE> 
  <ARTIST>Bonnie Tylor</ARTIST> 

. . .




Note: Internet Explorer 5.0 will display this, but NetScape 4.7 will not.

E.  Linking Capabilities of XML

One of the greatest strengths of XML is its more general and abstract linking capabilities

Prospects for XML

Problems of XML

Additional Considerations on XML

Authoring Tools:

XML Browsers:


F. Evolution of other ***ML Web Languages

Perhaps the greatest contribution of XML to date has been the emergence of XML-based special application languages. Among these are:


The representation of mathematical expression has been one of the weakest features of HTML. Our attempts to do so:

(from Computer Graphics by M. Firebaugh)

have benn to compose the equations with MS-Word, SNAP the image, and save it as a transparent GIF. The results may look nice (see Chapter 10), but are static images, with no access to the symbols for editing, and no mathematical meaning or content.

Now there is an XML-based markup language called MathML.

Incoherent Thompson Scattering Theory


Implementation of MathML in Mozilla

MathML is an XML application for describing mathematical notation and capturing both its structure and content. The goal of MathML is to enable mathematics to be served, received, and processed on the Web, just as HTML has enabled this functionality for text.

MathML is formally specified by the W3C at Mathematical Markup Language (MathML™) 1.01 Specification

Display Problems. Consider the equation . This equation is sized to match the surrounding line in 14pt type on the system where it was authored. Of course, on other systems, or for other font sizes, the equation is too small or too large. A second point to observe is that the equation image was generated against a white background. Thus, if a reader or browser resets the page background to another color, the anti-aliasing in the image results in white "halos." Next, consider the equation . This equation has a descender which places the baseline for the equation at a point about a third of the way from the bottom of the image. One can pad the image like this: , so that the centerline of the image and the baseline of the equation coincide, but this causes problems with the inter-line spacing, which also makes the equation difficult to read. Moreover, center alignment of images is handled in slightly different ways by different browsers, making it impossible to guarantee proper alignment for different clients.

Design Goals of MathML

In order to meet the diverse needs of the scientific community, MathML has been designed with the following ultimate goals in mind.

MathML should:

  • encode mathematical material suitable for teaching and scientific communication at all levels.
  • encode both mathematical notation and mathematical meaning.
  • facilitate conversion to and from other math formats, both presentational and semantic. Output formats should include:
    • graphical displays
    • speech synthesizers
    • computer algebra systems' input
    • other math layout languages, such as TeX
    • plain text displays, e.g. VT100 emulators
    • print media, including braille
  • It is recognized that conversion to and from other notational systems or media may entail loss of information in the process.
  • allow the passing of information intended for specific renderers and applications.
  • support efficient browsing for lengthy expressions.
  • provide for extensibility.
  • be well suited to template and other math editing techniques.
  • be human legible, and simple for software to generate and process.
A = x y z w x 0 t 1 x 0 t x x

An extensive set of MathML test documents are available.

MathML Example 1:

MathML Code:

  <mfrac linethickness='0.2 cm'><mn>1</mn>
  <mfrac linethickness='0.3 ex'><mn>1</mn>


Try it:

1 y+ 3 1 y+ 3

MathML Example 2:

MathML Code:

  <mstyle background='#88cc88'>
  <mstyle fontweight='bold'>
  <mstyle fontfamily='Helvetica'>  
    <mfenced open='['>


Let's link to the HTML description.

The Problem with MathML:

Neither browser supports it.


XHTML 1.0 is the first step toward a modular and extensible web based on XML (Extensible Markup Language). It provides the bridge for web designers to enter the web of the future, while still being able to maintain compatibility with today's HTML 4 browsers. It is the reformulation of HTML 4 as an application of XML. It looks very much like HTML 4, with a few notable exceptions, so if you're familiar with HTML 4, XHTML will be easy to learn and use. XHTML 1.0 was released on January 26th as a Recommendation by the W3C.

  • In 1998 W3C published the draft "XHTML 1.0"
  • XHTML is HTML 4.0 re-created as an XML application
  • Constraints on this transformation include:
    • An XHTML document MUST be well-formed XML
    • <html> MUST be the top-level element
    • Element and attribute names MUST be in lower case
    • Attribute values MUST be quoted
    • End tags are required for non-empty elements
    • All empty elements must use the XML "empty tag" syntax
    • XML does not allow attribute minimization
    • Whitespace handling in attribute values is different in XML
  • Dave Raggett (co-author of HTML 3.0, 3.2, 4.0 specs) has created a free program, HTML TIDY that converts an HTML page to XHTML. See

Advantages of XHTML

Portability: By the year 2002 as much as 75% of Internet access could be carried out on non-PC platforms such as palm computers, televisions, fridges, automobiles, telephones, etc. In most cases these devices will not have the computing power of a desktop computer, and will not be designed to accommodate ill-formed HTML as do current browsers (bloated with code to handle sloppy or proprietary HTML).

An excellent introduction to XHTML is given in Introduction to XHTML, with eXamples

XHTML Example 1:

The XHTML Code:

	"-//W3C//DTD XHTML 1.0 Transitional//EN"
	"DTD/xhtml1-transitional.dtd"	>
<html	xmlns	= "">
	<title>Quick Example</title>
<h1>	Quick Example
<a	href	= "">
<img	src	= ""
	height	= "31"
	width	= "88"
	border	= "0"
	hspace	= "16"
	align	= "left"
	alt	= "Valid XHTML 1.0!"
<p>	Note that the layout (with tabs and alignment) is
	purely for readability - XHTML doesn't require it.


The differences between HTML and XHTML are beautifully summarized in Introduction to XHTML: Differences with HTML

F. Synchronized Multimedia Integration Language (SMIL 2.0)

SMIL, pronounced "smile", has the following two design goals:

1) Define an XML-based language that allows authors to write interactive multimedia presentations. Using SMIL 2.0, an author can describe the temporal behavior of a multimedia presentation, associate hyperlinks with media objects and describe the layout of the presentation on a screen.

2) Allow reusing of SMIL syntax and semantics in other XML-based languages, in particular those who need to represent timing and synchronization. For example, SMIL 2.0 components are used for integrating timing into XHTML [XHTML10] and into SVG [SVG].

The specification for SMIL are published by the W3C consortium.

Example 1: Simple timing within a Parallel time container

Note: In the examples below, the additional syntax related to layout and other issues specific to individual document types is omitted for simplicity.

All the children of a par begin by default when the par begins. For example:

  <img id="i1" dur="5s"  src="img.jpg" />
  <img id="i2" dur="10s" src="img2.jpg" />
  <img id="i3" begin="2s" dur="5s" src="img3.jpg" />

Elements "i1" and "i2" both begin immediately when the par begins, which is the default begin time. The active duration of "i1" ends at 5 seconds into the par. The active duration of "i2" ends at 10 seconds into the par. The last element "i3" begins at 2 seconds since it has an explicit begin offset, and has a duration of 5 seconds which means its active duration ends 7 seconds after the par begins.

Example 2: end specifies end of active dur

There is an important difference between the semantics of end and dur. The dur attribute, in conjunction with the begin time, specifies the simple duration for an element. This is the duration that is repeated when the element also has a repeat specified. The attribute end on the other hand overrides the active duration of the element. If the element does not have repeat specified, the active duration is the same as the simple duration. However, if the element has repeat specified, then the end will override the repeat, but will not affect the simple duration. For example:

<seq repeat="10" end="">
  <img src="img1.jpg" dur="2s" />
  <img src="img2.jpg" dur="2s" />
  <img src="img3.jpg" dur="2s" />

The sequence will play for 6 seconds on each repeat iteration. It will play through 10 times, unless the user clicks on a "stopBtn" element before 60 seconds have elapsed.

Example 3: Cascading Time Model

The time manipulations are based upon a model of cascading time. That is, each element defines its active and simple time as transformations of the parent simple time. This recurses from the root time container to each "leaf" in the time graph. If a time container has a time manipulation defined, this will be reflected in all children of the time container, since they define their time in terms of the parent time container. In the following example a sequence time container is defined to run twice as fast as normal (i.e. twice as fast as its respective time container).

<seq speed="2.0">
  <video src="movie1.mpg" dur="10s" />
  <video src="movie2.mpg" dur="10s" />
  <img src="img1.jpg" begin="2s" dur="10s">
    <animateMotion from="-100,0" to="0,0" dur="10s" />
  <video src="movie4.mpg" dur="10s" />

The entire contents of the sequence will be observed to play (i.e., to progress) twice as fast. Each video child will be observed to play at twice the normal rate, and so will only last for 5 seconds. The image child will be observed to delay for 1 second (half of the specified begin offset). The animation child of the image will also "inherit" the speed manipulation from the sequence time container, and so will run the motion twice as fast as normal, leaving the image in the final position after only 5 seconds. The simple duration and the active duration of the sequence will be 21 seconds (42 seconds divided by 2).

Example 4: Acceleration Attributes

These attributes define a simple acceleration and deceleration of element time, within the simple duration. The values are expressed as a proportion of the simple duration (i.e. between 0 and 1), and are defined such that the length of the simple duration is not changed by the use of these attributes. The normal play speed within the simple duration is increased to compensate for the periods of acceleration and deceleration (this is how the simple duration is preserved). The modified speed is termed the run rate. As the simple duration progresses (i.e., plays back), acceleration causes the rate of progress to increase from a rate of 0 up to the run rate. Progress continues at the run rate until the deceleration phase, when progress slows from the run-rate down to a rate of 0. The SMIL code:

<animation dur="10s" accelerate="0.3" decelerate="0.3" .../>

produces the trajectory:

Example 5: Shape Animation of a Rectangle

As a simple example, the following defines an animation of an SVG rectangle shape.  The rectangle will change from being tall and thin to being short and wide.

<rect ...>
  <animate attributeName="width"  from="10px"  to="100px" 
     begin="0s" dur="10s" />
  <animate attributeName="height" from="100px" to="10px"
     begin="0s" dur="10s" />

The rectangle begins with a width of 10 pixels and increases to a width of 100 pixels over the course of 10 seconds. Over the same ten seconds, the height of the rectangle changes from 100 pixels to 10 pixels.

Example 6: SMIL Transitions

Consider a simple still image slideshow of four images, each displayed for 5 seconds. Using SMIL Timing, this slideshow might look like the following:

  <img src="butterfly.jpg" dur="5s" ... />
  <img src="eagle.jpg"     dur="5s" ... />
  <img src="wolf.jpg"      dur="5s" ... />
  <img src="seal.jpg"      dur="5s" ... />

Currently when this presentation plays, we see a straight "cut" from one image to another, as shown in this animated image. However, what we would like to see are three left-to-right wipes in between the four images. We can get these by modifying the code as follows:

<transition id="wipe1" type="barWipe" subtype="leftToRight" dur="1s"/>
  <img src="butterfly.jpg" dur="5s" fill="transition" ... />
  <img src="eagle.jpg"     dur="5s" fill="transition" transIn="wipe1" ... />
  <img src="wolf.jpg"      dur="5s" fill="transition" transIn="wipe1" ... />
  <img src="seal.jpg"      dur="5s"                   transIn="wipe1" ... />

Now the presentation plays as follows, as illustrated by this animated image.

  • At 0 seconds, we cut directly to butterfly.jpg.
  • At 5 seconds we begin a 1-second left-to-right wipe from butterfly.jpg to eagle.jpg.
  • At 6 seconds, eagle.jpg is fully displayed and remains displayed for 4 more seconds until 10 seconds.
  • At 10 seconds, we begin a 1-second left-to-right wipe from eagle.jpg to wolf.jpg.
  • At 11 seconds, wolf.jpg is fully displayed for 4 more seconds until 15 seconds.
  • At 15 seconds we begin a 1-second left-to-right wipe to from wolf.jpg to seal.jpg.
  • At 16 seconds, seal.jpg is fully displayed for 4 more seconds until 20 seconds. At 20 seconds the presentation ends.

G. Internationalizing the HTML Character Set

Consider the following Trends in WWW Page Design

Trend 1: Internationalization of the WWW

The Problem:

  • The potential growth of the WWW is even greater internationally than domestically. See, for instance, the Summary by Matthew Gray
  • Most of the World citizens do not speak English
  • English has one of the smallest alphabets
  • n bits are capable of encoding 2n characters


  • Add more bits


# of Characters


Name of

Encoding Scheme



ISO 646
ISO 8859-1


ISO 8859-7


Unicode Consortium

ISO 10646 BMP

4.295 x 109
Universal Multiple-

Octet Coded

Character Set (UCS)

ISO 10646
  • Note: Even 7-Bit ASCII can't represent 128 Characters
    • First 32 codes represent (invisible) control characters
    • Most of the rest produce visible shapes called glyphs
    • Only English, Latin, and Swahili can be represented by 7-Bit ASCII

  • Extending the number of bits accommodates more language character sets
    • 8-Bit ASCII and its extensions cover most Western languages
    • Unicode 16-bit representation includes 20,000 ideographs of Asian languages (CJK = Chinese, Japanese, and Korean)
    • Unicode remains only about 70% assigned
    • ISO 10646 code of 31 bits spans over 2 billion possible characters
    • A subset of ISO 10646 called Basic Multilingual Plane is identical to Unicode

  • Automatic Machine Translation
  • Ray Kurzweil has made great progress in this area

MIME --> Multipurpose Internet Mail Extensions

In 1992, the Network Working Group published RFC 1341 extending the ability of Internet e-mail to handle various non-text file formats

Examples of MIME Types and corresponding Extensions include:

Content-Type Extensions
application/octet-stream bin, exe
txt/html html, htm
text/plain txt
text/richtext rtx
video/mpeg mpeg, mpg mpe
video/quicktime qt, mov
video/x-msvideo avi

Language Identification & Markup

HTML 4.0 introduced the LANG attribute:

    Ce paragraphe est en Français

<P LANG="fr">Ce paragraphe est en Français</P>


The code:

<P LANG="en"> The English language always uses quotes <Q>like this</Q>,
French has <Q LANG="fr">comme &ccedil;a</Q>
and German prefers <Q LANG="de">wie hier</Q>.</P>


    The English language always uses quotes like this, French has comme ça and German prefers wie hier.

So HTML 4.0 does not seem to be implemented.

H.  Pure HTML

We should distinguish "Pure HTML" from HTML "Extensions" 

  • Pure HTML is what the W3C says it is.
  • Extensions are what NetScape, Microsoft, Mosiac, etc. say they are.

Why are Extensions created?

  • Obvious missing features in existing HTML
  • Sometimes based on proposed standards
  • Desire to be different (unique, innovative, state-of-the-art?)

Benefits of Pure HTML

  • Advantages of being part of a standard
  • Everyone will support your page
  • Platform Independence
  • Feedback - more people look at your page

Disadvantages of Using Pure HTML

  • "Standard" Web pages are pretty tame
  • Standards Committees stifle Creativity
  • New Standard Tags take a long time to arrive

 Advantages of Extensions

  • Provide immediate access to cutting-edge technology
  • De facto standards become accepted standards

Disadvantages of Extensions

  • Syntax can change, leaving you hanging
  • Some extensions never get supported

Author Recommends:

  • If it sounds like I'm discouraging the use of extensions, that's because I am.
  • Frames is one of the most widely used extensions around.
  • ..some corporate sites that initially had multimedia files as part of their Web pages soon took them out.
  • Perhaps the best piece of advice is to not work on the bleeding edge.

I.  HTML Beyond the Web

The key point is: HTML is a Universal Language

  • The essential role of computers has shifted from calculation to communication
  • Information Technology (IT) is getting easier and less expensive
  • The distinction between an individual computer, a local network, and the WWW is blurring [cf. Sherlock]
  • Physical location is no longer the primary factor in the market, education, commerce, or the workplace.

HTML has become the New User Interface - the Lingua Franca

  • It provides a painless route to integrating text, graphics, sound, video, and interactive programs
  • It provides the natural route to digital, electronic communication - video teleconferencing
  • It provides the generality and abstraction of programming languages - Java

HTML Applications of the Future

  • Kiosks with HTML-based interactive content
  • Sucking up content from CD-ROMs - Encyclopedia Britannica
  • Delivery of Academic Journals and Proto-TextBooks
  • Corporate Intranet applications - Newsletters
  • Delivery of Education - Remote and Distance Learning
  • Vehicle for conducting E-Commerce
  • Interactive, VR Game Delivery
  • Medium for creation, presentation, and dissemination of Art (see above)



Web Design in a Nutshell, Jennifer Neiderst, O'Reilly & Associates, Sebastopol, CA (1999)

Just XML, John E. Simpson, Prentice Hall PTR, Upper Saddle River, NJ (1999)

Web Sites URLs

A good introduction to scalable vector graphics is given at:

The specification for SMIL is available at

A good summary of XML features is given at

A good introduction to MathML is given at:

Scalable Vector Graphics (SVG) Specifications are given at: