Kbase 19502: XML 101 - A Brief Technical Introduction
Autor |
  Progress Software Corporation - Progress |
Acesso |
  Público |
Publicação |
  8/21/2003 |
|
Status: Technically Reviewed
GOAL:
XML 101 - A Brief Technical Introduction
FIX:
XML
---
The Extensible Markup Language (XML) is a data format for structured
document interchange on the Web. It is hardware architecture neutral,
application-independent, flexible, yet simple and powerful.
XML was developed by an XML Working Group (originally known as the SGML Editorial Review Board) formed under the auspices of the World Wide Web Consortium (W3C) in 1996.
The World-Wide Web Consortium¹s official recommendation for XML and a
variety of related materials can be found at the following URL:
http://www.w3.org/XML/.
XML is a subset of another markup language called SGML, which was adopted as an international standard in 1986 [ISO 8879]. SGML is based on a markup language called GML, which was developed by researchers at IBM in 1969. SGML is quite complex and the XML subset was created to eliminate the complexity while keeping the value.
XML describes a class of data objects called "XML documents" and partially describes the behavior of computer programs that process them. XML is an "application profile" or restricted form of SGML. By construction, XML documents are conforming SGML documents.
XML documents are made up of storage units called "entities", which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form "markup". Markup encodes a description of a document's content and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure.
A software module called an "XML processor" is used to read XML documents and provide access to their content and structure. It is assumed that an XML processor is doing its work on behalf of an application. The XML specification describes the required behavior of an XML processor in terms of how it must read XML data and the information it must provide to the application.
XML Documents
-------------
XML documents are made up of two parts: a prologue and a body. The optional prologue may contain the XML version the document conforms to, information about the character encoding used to encode the contents of the document, and a "document type definition" (DTD) which describes the grammar and vocabulary of the document. The body may contain elements, entity references, and other markup information.
Elements represent the logical components of documents. They can contain data or other elements. For example, a customer element can contain a number of column (field) elements and each column element a data value.
Here is an example of an element:
<name>Clyde<
ame>
Note that the element begins with the construct <name> and ends with
<
ame>. These are delimeters called "tags" that specify the beginning and end of the elment and what its name is. The characters between the delimiters form the element¹s contents or data. XML does not have any predefined tags. We are free to use whatever tags we wish, as long as the names abide by a few simple restrictions imposed by the XML recommendation.
Elements can have additional information called "attributes" attached to them. Attributes are used to describe properties of elements.
Here is an example of an element with an attribute:
<name emp-num="1">Mary<
ame>
Here is an example of elements that contain other elements nested within them:
<phonelist>
<entry>
<name>Chip<
ame>
<extension>3</extension>
</entry>
</phonelist>
Document Type Definitions
--------------------------
There are an infinite variety of possible kinds of documents, such as the repair manual for a vehicle, a dictionary, a telephone directory, an order for equipment, an invoice, and so forth. Each kind of document can have unique structure and organ.ization that can be used over and over.
The descriptions of classes of documents are called "Document Type
Definitions" or DTD¹s. DTD¹s are sets of rules that define the required and optional elements that can be used in a document, and what the relationships among the various elements are. A DTD can be included as part of the content of an XML document, or it can be separate from it and referred to by the document.
Here is an example of a small document that includes a DTD in its prologue.
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE customer
[
<!ELEMENT customer (name, cust-num)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT cust-num (#PCDATA)>
]>
<customer>
<name>Lift Line Skiing<
ame><cust-num>1</cust-num>
</customer>
The DOM
-------
The Document Object Model (DOM) is an application programming interface (API) for HTML and XML documents. It defines the logical structure of documents and the way a document is accessed and manipulated. The DOM is an object model that represents XML documents in a platform-neutral and application-independent form as a "tree" of objects of various types.
The W3C¹s official recommendation for the DOM and a variety of related
materials may be found at the following URL: http://www.w3.org/DOM/.
In the DOM specification, the term "document" is used in the broad sense - increasingly, XML is being used as a way of representing many different kinds of information that may be stored in diverse systems. Much of this would traditionally be seen as data rather than as documents. Nevertheless, XML presents this data as documents, and the DOM may be used to manipulate this data.
With the Document Object Model, programmers can build documents, navigate their structure, and add, modify, or delete elements and content. Anything found in an HTML or XML document can be accessed, changed, deleted, or added using the Document Object Model, with a few exceptions - in particular, the DOM interfaces for the XML internal and external subsets have not yet been specified.
The SAX API
-----------
The DOM API allows manipulation of an XML document after it has been
completely parsed. Once the entire document has been converted into its "DOM tree" representation, an application can use it. There is another API called the "Simple API for XML" (SAX) under development.
With the SAX API, the XML parser generates events via a callback mechanism. As the document is being parsed, events are generated for the start and end of the document and for the start and end of the elements contained within it. SAX provides an alternate programming model for working with XML documents. It is simpler than the DOM API, but does not include all of the functionality provided it.
HTML and XML
------------
HTML is useful for describing the visualization of text documents and
related images on the World-Wide-Web. It has a number of deficiencies:
- HTML tags describe only the visualization of the data. There is no
facility for describing the data themselves. And the visualization is only approximate.
- HTML is not extensible. The tags defined by the HTML specification are the only ones that can be used.
- Many HTML documents that exist today do not conform to the HTML
specification. Browsers tolerate many variations and structural errors as well as allowing proprietary extensions.
XML addresses these deficiencies, and others. Since XML is extensible, one of the natural extensions is to describe HTML in terms of XML. This is relatively straightforward because both are subsets of SGML. It is expected that the HTML 4.0 specification will be superseded by XHTML 1.0 in the near future. Once this occurs, there will be gradual conversion of many HTML documents to XHTML. However, HTML in its curre.nt form will probably be supported for many years to come.
Summary
-------
XML is:
* Not new
* Hardware architecture neutral
* Application independent
* Extensible
* Flexible
* Simple
* Powerful
* Can be read by humans
XML is useful for many things but it is not the solution to every problem. It will make solving certain problems easier than it might be without it. Data interchange among applications is an area where XML can be tremendously useful. This is so because when XML is used to encode messages, exchanging data becomes simpler than when messages are encoded in some binary form..