Title

XML reader syntax

Author

Per Bothner <per@bothner.com>

Status

This SRFI is currently in ``draft'' status. To see an explanation of each status that a SRFI can hold, see here. To provide input on this SRFI, please mail to <srfi minus 107 at srfi dot schemers dot org>. See instructions here to subscribe to the list. You can access previous messages via the archive of the mailing list.

The Kawa Scheme implementation has working support for this reader extension. (Some details are still in flux, depending on how this specification ends up.)

Abstract

We specify a reader extension that reads data in a superset of XML/HTML format, and produces conventional S-expressions. We also suggest a possible semantics interpretation of how these forms may be evaluated to produce XML-node values, but this is non-normative (???).

Rationale

While XML may be a poor re-invention of S-expressions, many people are familiar with it. Furthermore, when working with XML or HTML data, using XML syntax may be preferable to S-expressions. This specification defines a Scheme reader extension matching XML syntax with expression escapes (unquote), a translation into standard S-expressions, and a semantics for the latter.

Some other programming languages also define a syntax for XML literals. Examples include EcmaScript for XML (E4X), Visual Basic, and XQuery.

Here is a simple example:

#<p>The result is <b>final</b>!</p>
Actually, these are really quasi-literals since they can contain enclosed expressions, which are unquoted:
#<em>The result is &[result].</em>

The value of result is substituted into the output, in a similar way to quasi-quotation. Notice the use of &, which is used in XML for character and entity references, but we use it as a multi-purpose prefix character to avoid adding extra special characters that might need escaping.

The specification does not define a Scheme API for working with XML data. It assumes there is some data type which we here call an XML-node. This specification does not require the XML-node type to be distinct from other types. Many Scheme XML libraries just use lists to encode XML-nodes. However, newer Schemes that have an extensible type system are encouraged to make XML-node a distinct type. This follows the W3C Document Object Model (DOM).

The XML data model distinguishes between a document node and a document element. A document element is just an XML element node that is the top-level element in a document. A document node is a special kind of node whose primary child is the document element, but may have other children (comments and process-instructions). This specification provides a syntax for creating XML elements, but does not have any special provisions for creating document nodes.

Discussion: Is has been suggested that this specification is over-large, and that it should focus on just the reader syntax and on how it is mapped to S-expressions, leaving the semantics for other specification(s). Alternatively, re-organization is suggested. (See Cowan 2012-11-18.)

Discussion: (Not necessarily part of this specification, but perhaps a future specification.) It seems useful to specify a syntax for document nodes. One solution is to use a SRFI-108 named literal, whose body is the XML text (with optional enclosed expressions). For example:

#xml{<!DOCTYPE HTML>
<html>
<body>Hello &[name]!</>
</html>}

One could also support more structured prefix arguments:

#&xml[version: 1.1 encoding: "UTF-8" standalone: #t
  doctype: "HTML"
  public: "-//W3C//DTD HTML 4.01 Transitional//EN"]
{
<html>...</>
}

Specification

An xml-literal is usually an element constructor. We'll cover later the less common processing-instruction, comment, and CDATA-section forms.

xml-literal ::= #xml-constructor
xml-constructor ::= xml-element-constructor
  | xml-PI-constructor
  | xml-comment-constructor
  | xml-CDATA-constructor

Qualified names

The names of elements and attributes are qualified names (QNames). The lexical syntax for a QName is either a simple identifier, or a (prefix,local-name) pair:

QName ::= xml-local-part
   | xml-prefix:xml-local-part
xml-local-part ::= identifier
xml-prefix ::= identifier

Sometimes one needs to calculate the QName at runtime, evaluating an expression instead of using a literal QName:

xml-name-form ::= QName
  | xml-enclosed-expression
xml-enclosed-expression ::=
    [expression]
  | (expression...)

The first variant is the general case; the second variant (expression...) is just syntactic sugar for: [(expression...)]. For example the following equivalent forms:

#<[(if be-bold 'strong 'em)]>important</>
#<(if be-bold 'strong 'em)>important</>

When evaluating the expression (in the first variant), the result is a QName value. While this specification does not define an API or representation for QName values, it is an object with three string components: The local name part, the prefix part, and the namespace URI part. The local name and the prefix parts match the parts in a literal QName, while the namespace URI part is an arbitrary globally unique string. Two QNames are considered equivalent if they have the same local name part and namespace URI part, even if the prefix parts are different. The prefix is used for input and output; it can be considered a local nickname for a namespace URI. The binding from a prefix to a namespace URI can be defined using namespace-declaration-attribute. An implementation may also define such bindings using Scheme code; for example Kawa has a define-namespace form.

This specification specifies that a symbol is considered equivalent to a QName whose local name part is the string name of the symbol, and whose prefix and namespace URI are both empty, as long as the name of the symbol matches the syntax of identifier, and does not contain a colon. The result is implementation-defined if a symbol's name contains a colon.

Element constructors

xml-element-constructor ::=
    <QName xml-attribute...>xml-element-datum...</QName >
  | <xml-name-form xml-attribute...>xml-element-datum...</>
  | <xml-name-form xml-attribute.../>

The first xml-element-constructor variant uses a literal QName, and looks like standard non-empty XML element, where the starting QName and the ending QName must match exactly:

#<a href="next.html">Next</a>

As a convenience, you can leave out the ending tag(s):

<para>This is a paragraph in <emphasis>DocBook</> syntax.</>

You can use an expression to compute the element tag at runtime - in that case you must leave out the ending tag:

#<p>This is <[(if be-bold 'strong 'em)]>important</>!</p>

The third xml-element-constructor variant above is an XML “empty element”; it is equivalent to the second variant when there are no xml-element-datum items.

(Note that every well-formed XML element, as defined in the XML specifications, is a valid xml-element-constructor, but not vice versa.)

Element contents (children)

The “contents” (children) of an element are a sequence of character (text) data, nested nodes, and enclosed (unquoted) expressions. The latter are discussed later. The characters &, <, and > are special, and need to be escaped.

xml-element-datum ::=
    any character except &, or <.
  | xml-constructor
  | xml-escaped

A nested xml-constructor is equivalent to an xml-literal (i.e. the xml-constructor prefixed by a #) inside an enclosed expression. For example:

#<p>This is <em>important</em>!</p>
is equivalent to:
#<p>This is &{#<em>important</em>}!</p>
xml-escaped ::=
    &xml-enclosed-expression
  | &xml-entity-name;
  | xml-character-reference
xml-character-reference ::=
    &# digit digit... ;
  | &#x hex-digit hex-digit... ;

Here is an example with both hex and decimal character references:

#<p>A&#66;C&#x44;E</p>  ⟹  <p>ABCDE</p>
xml-entity-name ::= identifier

Currently, the only supported values for xml-entity-name are the builtin XML names lt, gt, amp, quot, and apos, which stand for the characters <, >, &, ", and ', respectively. The following two expressions are equivalent:

#<p>&lt; &gt; &amp; &quot; &apos;</p>
#<p>&{"< > & \" '"}</p>

Attributes

An attribute associates an attribute name with an attribute value. This is done using a xml-true-attribute form, which is an xml-attribute that does not have the form of xml-namespace-declaration-attribute. I.e. in a xml-true-attribute the attribute name may not be the special reserved name xmlns, nor may it be a QName whose prefix is the special reserved name xmlns.

xml-attribute ::=
    xml-true-attribute
  | xml-namespace-declaration-attribute

A true attribute has the form name=value. It can also be an enclosed expression that evaluates to an attribute node value.

xml-true-attribute ::=
    xml-name-form=xml-attribute-value
  | xml-enclosed-expression
xml-attribute-value ::=
    " quot-attribute-datum* "
  | ' apos-attribute-datum* '
quot-attribute-datum ::=
    any character except ", &, or <.
  | xml-escaped
apos-attribute-datum ::=
    any character except ', &, or <.
  | xml-escaped

Discussion: When an attribute-value is specified by an expression, having to writes an xml-escaped inside string quotes seems clumsy. We codul allow the much simpler:

xml-attribute-value ::= ...
  | [ expression ]

Handling of enclosed expressions

Both element content and attribute values may contain xml-enclosed-expressions. These are expressions evaluated at runtime, where the evaluated result becomes part of the element content or the attribute value.

If the expression evaluates to an element, comment, or processing node, and the context is element content, then the node is added as a child of the element. It is unspecified if the node is copied or shared. It is also unspecified if the expression result is some other kind of XML-node, or the context is an attribute value.

If the expression evaluates to a string, the result is pasted as a text (child) content of an element or a substring of an attribute value, respectively.

If the expression evaluates to a CDATA segement, the result is equivalent to the string value of the segment.

If the expression evaluates to some other scalar value (including numbers, booleans, and characters) the value is converted to a string according to implementation-specified rules. An implementation MAY convert a value as if using display. Alternatively, an implementation MAY convert a value to yield a canonical representation according to the XML Schema specification. (In the latter case, Booleans #f and #t should yield false and true, respectively.)

If the expression evaluates to a list or vector, then each element is inserted into the element or attribute content. Spaces are inserted between two elements if neither element is an XML-node.

Note that some XML specifications (include XML Schema and the XQuery and XPath data model) have the concept of typed value of a node. The typed value may be a number, a string, or another atomic type. The typed value may also be a sequence of strings, numbers, or other atomic values. Some implementations may optionally store the typed value instead of or in addition to the text value. For example:

#<prices>&(vector 230 599 98 763)</prices>

It is undefined if in the XML-node the contents is stored as a sequence of 4 integers, or as the string "230 599 98 763", as long as the result prints the same way.

Namespace declarations

An xml-prefix is an alias for a namespace-uri, and the mapping between them is defined by a namespace declaration attribute, which has the form of an xml-attribute where either the QName or the prefix is the special identifier xmlns:

xml-namespace-declaration-attribute ::=
    xmlns:xml-prefix=xml-attribute-value
  | xmlns=xml-attribute-value

The former declares xml-prefix as a namespace alias for the namespace-uri specified by xml-attribute-value (which must be a compile-time constant). The second declares that xml-attribute-value is the default namespace for simple (unprefixed) element tags. (A default namespace declaration is ignored for attribute names.)

Processing instructions

An xml-PI-constructor can be used to create an XML processing instruction, which can be used to pass instructions or annotations to an XML processor or tool.

xml-PI-constructor ::= <?xml-PI-target xml-PI-content?>
xml-PI-target ::= NCname (i.e. a simple (non-compound) identifier)
xml-PI-content ::= any characters, not containing ?>.

For example, the DocBook XSLT stylesheets can use the dbhtml instructions to specify that a specific chapter should be written to a named HTML file:

#<chapter><?dbhtml filename="intro.html" ?>
<title>Introduction</title>
...
</chapter>

XML comments

You can cause XML comments to be emitted in the XML output document. Such comments can be useful for humans reading the XML document, but are usually ignored by programs.

xml-comment-constructor ::= <!--xml-comment-content-->
xml-comment-content ::= any characters, not containing --.

CDATA sections

A CDATA section can be used to avoid excessive quoting in element content.

xml-CDATA-constructor ::= <![CDATA[xml-CDATA-content]]>
xml-CDATA-content ::= any characters, not containing ]]>.

A CDATA section is semantically equivalent to text consitsing of the xml-CDATA-content, though some XML-node representations may record that the text came from a CDATA so it can be written out the same way. (Kawa does this.)

The following are equivalent:

#<p>Special characters <![CDATA[< > & ' "]]> here.</p>
#<p>Special characters &lt; &gt; &amp; &quot; &apos; here.</p>

Output of XML nodes

If XML-node is a separate data-type, implementations are encouraged to use this XML-literal format when writing to an output port, since this provides input-output round-tripping. Specifically, calling write on an XML-node SHOULD write an xml-literal (with an initial #). The xml-constructor SHOULD be in standard XML syntax without using any of extensions in this specification, such as an unnamed end tag. Calling display on an XML-node SHOULD write an xml-constructor (without an initial #). Alternatively, if the output port is an extended port that can handle rich text then an implementation MAY instead display a styled representation. For example if the XML-node is compatible with HTML, and the output port is inserting text into a browser, then the implementation may copy the DOM into the browser, perhaps resulting in styled text.

Translation

The following specifies how the reader syntax is translated by the reader into standard S-expressions. These basically create macro invocations; the implementation is responsible for implementing those macros as described in the Translated forms section. As an example:

#<a class="title">Result: &{sum}.</a>
is read as if it were:
($xml-element$ () ($resolve-qname$ a)
  ($xml-attribute$ 'class "title")
  "Result: " sum ".")

The () in the result is the translation of any namespace declaration attributes - in this case none.

The translation is defined in terms of a recursive read-time translation function Tr which maps an xml-constructor to an S-expression.

Note: This translation is preliminary. It may need to be tweaked (and debugged) a bit.

Tr[<QName xml-attribute...>xml-element-datum...</QName>]Tr[<QName xml-attribute...>xml-element-datum...</>]
Tr[<xml-name-form xml-attribute...>xml-element-datum...</QName>]<($xml-element$ (TrNamespaceDecl[xml-attribute]...) TrElementName[xml-name-form] TrAttr[xml-attribute]... TrContent[xml-element-datum]...)
TrAttr[xml-namespace-declaration-attribute]#|nothing|#
TrAttr[xml-name-form=xml-attribute-value]($xml-attribute TrAttrName[xml-name-form] TrContent[xml-attribute-value])
TrChar[any character except &, or <]
  ⟾ any character except &, or <
TrChar[&#x hex-digit hex-digit... ;]\xhex-digit hex-digit... ;
TrChar[&#x digit digit... ;]\xcorresponding hex-digits;
TrContent[simple-char...]"TrChar[simple-char]..."
TrContent[&xml-entity-name;]($entity-reference$ xml-entity-name)
TrContent[{expression}]expression
TrContent[{string-literal}](quote string-literal)
TrContent[&{expression...}]expression...
TrContent[(expression...)](expression...)

Note that a string literal in an enclosed expression is handled specially by enclosing it a quote form. This is allows a macro to distinguish an enclosed expression from literal content; that may sometimes be useful.

TrNamespaceDecl[xml-true-attribute]#|nothing|#
TrNamespaceDecl[xmlns:xml-prefix=xml-attribute-value](TrContent[xml-attribute-value] xml-prefix)
TrNamespaceDecl[xmlns=xml-attribute-value](TrContent[xml-attribute-value])

Element (tag) names are translated by TrElementName, while attribute names are translated by TrAttrName. These are both handled by TrElementOrAttrName in both cases. However, if there is no namespace-prefix, then attribute names default to the empty namespace, but element names default to the current default element namespace prefix (indicated by $default-element-namespace$).

TrElementName[identifier]($resolve-qname$ identifier )
TrAttrName[identifier](quote identifier)
TrAttrName[other-form]TrElementOrAttrName[other-form]
TrElementName[other-form]TrElementOrAttrName[other-form]
TrElementOrAttrName[prefix:local-name]($resolve-qname$ local-name prefix )
TrElementOrAttrName[(expression)](expression)
TrElementOrAttrName[{expression}]expression

The special node constructors are translated similarly: (Note This is not quite right, since these forms should not handle escape characters the way element and attribute content does.)

Tr[<![CDATA[xml-CDATA-content]]>]($xml-CDATA$ "xml-CDATA-content")
Tr[<--xml-comment-content-->]($xml-comment$ "xml-comment-content")
Tr[<?xml-PI-target xml-PI-content?>]($xml-processing-instruction xml-PI-target TrContent[xml-PI-content])

Translated forms

The above translation maps the new reader syntax to S-expressions using macros specified in this section. Of course it is possible to write these macro forms directly, though they are less human-readable. However, code generators and macros may target these macros. This format can also be used as an interchange format.

($xml-element$ ( (namespace-binding...) ) name attribute... content... ) 

Creates an element node.

Each namespace-binding is a one- or two-element list (namespace-uri [prefix]). The prefix is a literal symbol that represents a namespace prefix; if missing it defaults to the symbols $default-element-namespace$. The namespace-uri is a literal string. (An possible extension is to allow an expression that evaluates to a string.)

The name is an expression that evaluates to a symbol or a QName, most commonly a quoted symbol or a $resolve-qname$ form.

The binding for $xml-element$ must be a macro, not a function, because each namespace-binding adds a (prefix,URI)-binding in the lexical context. That binding is used to evaluate QNames in the remaining parameters, which are all expressions.

Each attribute is usually an $xml-attribute$ form, but an implementation may support oter expressions that evaluate to attribute nodes. Each content is an expression that evaluates to to element content, handled as described in the Handling of enclosed expressions section.

($xml-attribute$ name content... ) 

Creates an attribute node from the parameters. The name is an expression that evaluates to a symbol or QName value. The content arguments are concatenated to produce the attribute value.

($resolve-qname$ local-name [prefix]) 

Resolve the gives prefix/local-name-pair to a QName value. Both arguments are literal unquoted symbols. If prefix is missing it defaults $default-element-namespace$.

($xml-comment$ content...) 
($xml-CDATA$ content...) 
($xml-processing-instruction$ xml-PI-target content...) 
Creates processing-instruction (PI) node. The xml-PI-target should be a symbol (or a QName in the empty namespace). The content arguments should be strings, which are concatenated.
($entity-reference$ xml-entity-name)

The argument is an unquoted symbol. Returns a string value matching the entity name. For example:

($xml-entityref$ lt) ==> "<"
The standard XML entity names (lt, gt, amp, quot, and apos) are defined at a mininum. Standard Scheme character names should also be supported. An implementation may support other names, for example variables bound in the lexical namespace.

Implementation

The implementation is necessarily non-portable, though the Translation section provides a template for the reader part.

Handling namespaces

Implementing of the Translated forms should mostly be obvious: Just call an appropriate function to create the XML-node. Handling namespace definitions is non-obvious, however. The form:

($xml-element$ ((namespace-uri prefix)...) name attribute... content...)
can be translated into something like:
(let ()
  (define-namespace prefix namespace-uri)
  ...
  (make-element name attribute... content... )

The initial environment has pre-defined:

(define-namespace $default-element-namespace$ "")

One way to implement define-namespace is to expand:

(define-namespace prefix namespace-uri)
to:
(define $namespace$:prefix namespace-uri)

In that case:

($resolve-qname$ local-name prefix)

could be implemented as:

(make-qname local-name prefix $namespace$:prefix)

assuming a 3-argument make-qname function that creates a QName with the given local-name, prefix, and namespace-uri. Implementations SHOULD provide a custom error message in the case $namespace$:prefix is undefined, rather than depend on a generic error message.

Testsuite

There is a test suite in the Kawa source tree.

Copyright

Copyright (C) Per Bothner 2012

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Author: Per Bothner
Editor: Mike Sperber