Per Bothner <per@bothner.com>
This SRFI is currently in ``draft'' status. To see an explanation of
each status that a SRFI can hold, see here.
To provide input on this SRFI, please
mail to
<srfi minus 107 at srfi dot schemers dot org>
. See
instructions here to
subscribe to the list. You can access previous messages via
the archive of the mailing list.
The Kawa Scheme implementation has working support for this reader extension. (Some details are still in flux, depending on how this specification ends up.)
We specify a reader extension that reads data in a superset of XML/HTML format, and produces conventional S-expressions. We also suggest a possible semantics interpretation of how these forms may be evaluated to produce XML-node values, but this is non-normative (???).
While XML may be a poor re-invention of S-expressions, many people are familiar with it. Furthermore, when working with XML or HTML data, using XML syntax may be preferable to S-expressions. This specification defines a Scheme reader extension matching XML syntax with expression escapes (unquote), a translation into standard S-expressions, and a semantics for the latter.
Some other programming languages also define a syntax for XML literals. Examples include EcmaScript for XML (E4X), Visual Basic, and XQuery.
Here is a simple example:
#<p>The result is <b>final</b>!</p>Actually, these are really
quasi-literalssince they can contain enclosed expressions, which are
unquoted:
#<em>The result is &[result].</em>
The value of result is substituted into the output,
in a similar way to quasi-quotation.
Notice the use of
, which is used in XML
for character and entity references, but we use it as a multi-purpose prefix
character to avoid adding extra special characters that
might need escaping.
&
The specification does not define a Scheme API for working with XML data. It assumes there is some data type which we here call an XML-node. This specification does not require the XML-node type to be distinct from other types. Many Scheme XML libraries just use lists to encode XML-nodes. However, newer Schemes that have an extensible type system are encouraged to make XML-node a distinct type. This follows the W3C Document Object Model (DOM).
The XML data model distinguishes between a document node and a document element. A document element is just an XML element node that is the top-level element in a document. A document node is a special kind of node whose primary child is the document element, but may have other children (comments and process-instructions). This specification provides a syntax for creating XML elements, but does not have any special provisions for creating document nodes.
Discussion: Is has been suggested that this specification is
over-large, and that it should focus on just the reader syntax and
on how it is mapped to S-expressions, leaving the semantics
for other specification(s).
Alternatively, re-organization is suggested. (See Cowan 2012-11-18.)
Discussion: (Not necessarily part of this specification, but perhaps a future specification.) It seems useful to specify a syntax for document nodes. One solution is to use a SRFI-108 named literal, whose body is the XML text (with optional enclosed expressions). For example:
#xml{<!DOCTYPE HTML> <html> <body>Hello &[name]!</> </html>}
One could also support more structured prefix arguments:
#&xml[version: 1.1 encoding: "UTF-8" standalone: #t doctype: "HTML" public: "-//W3C//DTD HTML 4.01 Transitional//EN"] { <html>...</> }
An xml-literal
is usually an element constructor.
We'll cover later the less common processing-instruction,
comment, and CDATA-section forms.
xml-literal ::= #
xml-constructor
xml-constructor
::= xml-element-constructor
| xml-PI-constructor
| xml-comment-constructor
| xml-CDATA-constructor
The names of elements and attributes are qualified names (QNames). The lexical syntax for a QName is either a simple identifier, or a (prefix,local-name) pair:
QName ::= xml-local-part
| xml-prefix:
xml-local-part
xml-local-part ::= identifier
xml-prefix ::= identifier
Sometimes one needs to calculate the QName at runtime, evaluating an expression instead of using a literal QName:
xml-name-form ::= QName | xml-enclosed-expression xml-enclosed-expression ::=[
expression]
|(
expression...)
The first variant is the general case; the second variant (
expression...)
is just syntactic sugar for:
[(
expression...)]
. For example the following equivalent forms:
#<[(if be-bold 'strong 'em)]>important</> #<(if be-bold 'strong 'em)>important</>
When evaluating the expression (in the first variant), the result is a QName value
. While this specification does
not define an API or representation for QName values, it is an object
with three string components: The local name part,
the prefix part,
and the namespace URI part.
The local name and the prefix parts match the parts in a literal QName,
while the namespace URI part is an arbitrary globally unique string.
Two QNames are considered equivalent if they have the same
local name part and namespace URI part, even if the prefix parts are
different. The prefix is used for input and output;
it can be considered a local nickname for a namespace URI.
The binding from a prefix to a namespace URI can be defined
using namespace-declaration-attribute.
An implementation may also define such bindings using Scheme code; for example
Kawa has a define-namespace
form.
This specification specifies that a symbol is considered equivalent to a QName whose local name part is the string name of the symbol, and whose prefix and namespace URI are both empty, as long as the name of the symbol matches the syntax of identifier, and does not contain a colon. The result is implementation-defined if a symbol's name contains a colon.
xml-element-constructor ::=<
QName xml-attribute...>
xml-element-datum...</
QName
>
|<
xml-name-form xml-attribute...>
xml-element-datum...</>
|<
xml-name-form
xml-attribute.../>
The first xml-element-constructor
variant uses a literal QName
,
and looks like standard non-empty XML element, where the starting QName
and the ending QName
must match exactly:
#<a href="next.html">Next</a>
As a convenience, you can leave out the ending tag(s):
<para>This is a paragraph in <emphasis>DocBook</> syntax.</>
You can use an expression to compute the element tag at runtime - in that case you must leave out the ending tag:
#<p>This is <[(if be-bold 'strong 'em)]>important</>!</p>
The third xml-element-constructor
variant above is an XML
“empty element”; it is equivalent to the second variant
when there are no xml-element-datum
items.
(Note that every well-formed XML element, as defined in the XML specifications,
is a valid xml-element-constructor
, but not vice versa.)
The “contents” (children) of an element
are a sequence of character (text) data,
nested nodes, and enclosed (unquoted) expressions.
The latter are discussed later.
The characters &
, <
, and >
are special,
and need to be escaped.
xml-element-datum ::= any character except&
, or<
. | xml-constructor | xml-escaped
A nested xml-constructor
is equivalent to an xml-literal
(i.e. the xml-constructor prefixed
by a #
) inside an enclosed expression.
For example:
#<p>This is <em>important</em>!</p>is equivalent to:
#<p>This is &{#<em>important</em>}!</p>
xml-escaped ::=&
xml-enclosed-expression |&
xml-entity-name;
| xml-character-reference xml-character-reference ::=&#
digit digit...;
|&#x
hex-digit hex-digit...;
Here is an example with both hex and decimal character references:
#<p>ABCDE</p> ⟹ <p>ABCDE</p>
xml-entity-name ::= identifier
Currently, the only supported values for xml-entity-name
are the builtin XML names lt
, gt
, amp
,
quot
, and apos
, which stand for the characters
<
, >
, &
, "
, and '
, respectively.
The following two expressions are equivalent:
#<p>< > & " '</p> #<p>&{"< > & \" '"}</p>
An attribute associates an attribute name with an attribute value.
This is done using a xml-true-attribute form,
which is an xml-attribute
that does not have the form of
xml-namespace-declaration-attribute.
I.e. in a xml-true-attribute the
attribute name may not be the special reserved name
xmlns
, nor may it be a QName whose
prefix is the special reserved name xmlns
.
xml-attribute ::= xml-true-attribute | xml-namespace-declaration-attribute
A true attribute has the form name=value
.
It can also be an enclosed expression that evaluates to an attribute node value.
xml-true-attribute ::= xml-name-form=
xml-attribute-value | xml-enclosed-expression xml-attribute-value ::="
quot-attribute-datum*"
|'
apos-attribute-datum*'
quot-attribute-datum ::= any character except"
,&
, or<
. | xml-escaped apos-attribute-datum ::= any character except'
,&
, or<
. | xml-escaped
Discussion: When an attribute-value is specified by an expression, having to writes an xml-escaped inside string quotes seems clumsy. We codul allow the much simpler:
xml-attribute-value ::= ... |[
expression]
Both element content and attribute values may contain xml-enclosed-expressions. These are expressions evaluated at runtime, where the evaluated result becomes part of the element content or the attribute value.
If the expression evaluates to an element, comment, or processing node, and the context is element content, then the node is added as a child of the element. It is unspecified if the node is copied or shared. It is also unspecified if the expression result is some other kind of XML-node, or the context is an attribute value.
If the expression evaluates to a string, the result is pasted as a text (child) content of an element or a substring of an attribute value, respectively.
If the expression evaluates to a CDATA segement, the result is equivalent to the string value of the segment.
If the expression evaluates to some other scalar value
(including numbers, booleans, and characters) the value
is converted to a string according to implementation-specified
rules. An implementation MAY convert a value as if
using display
. Alternatively, an implementation MAY convert
a value to yield a canonical representation according to the XML
Schema specification. (In the latter case, Booleans #f
and #t
should yield false
and true
,
respectively.)
If the expression evaluates to a list or vector, then each element is inserted into the element or attribute content. Spaces are inserted between two elements if neither element is an XML-node.
Note that some XML specifications (include XML Schema and the XQuery and XPath data model) have the concept of typed value of a node. The typed value may be a number, a string, or another atomic type. The typed value may also be a sequence of strings, numbers, or other atomic values. Some implementations may optionally store the typed value instead of or in addition to the text value. For example:
#<prices>&(vector 230 599 98 763)</prices>
It is undefined if in the XML-node the contents is stored as a
sequence of 4 integers, or as the string "230 599 98 763"
,
as long as the result prints the same way.
An xml-prefix
is an alias for a namespace-uri,
and the mapping between them is defined by a namespace declaration attribute,
which has the form of an xml-attribute
where either the QName or the prefix is the special identifier
xmlns
:
xml-namespace-declaration-attribute ::=xmlns:
xml-prefix=
xml-attribute-value |xmlns=
xml-attribute-value
The former declares xml-prefix
as a namespace alias for
the namespace-uri specified by xml-attribute-value
(which must be a compile-time constant).
The second declares that xml-attribute-value
is the default
namespace for simple (unprefixed) element tags.
(A default namespace declaration is ignored for attribute names.)
An xml-PI-constructor
can be used to create an XML
processing instruction, which can be used to pass
instructions or annotations to an XML processor or tool.
xml-PI-constructor ::=<?
xml-PI-target xml-PI-content?>
xml-PI-target ::= NCname (i.e. a simple (non-compound) identifier) xml-PI-content ::= any characters, not containing?>
.
For example, the DocBook XSLT stylesheets can use the dbhtml
instructions to specify that a specific chapter should be
written to a named HTML file:
#<chapter><?dbhtml filename="intro.html" ?> <title>Introduction</title> ... </chapter>
You can cause XML comments to be emitted in the XML output document. Such comments can be useful for humans reading the XML document, but are usually ignored by programs.
xml-comment-constructor ::=<!--
xml-comment-content-->
xml-comment-content ::= any characters, not containing--
.
A CDATA
section can be used to avoid excessive
quoting in element content.
xml-CDATA-constructor ::=<![CDATA[
xml-CDATA-content]]>
xml-CDATA-content ::= any characters, not containing]]>
.
A CDATA section is semantically equivalent to text consitsing of the xml-CDATA-content, though some XML-node representations may record that the text came from a CDATA so it can be written out the same way. (Kawa does this.)
The following are equivalent:
#<p>Special characters <![CDATA[< > & ' "]]> here.</p> #<p>Special characters < > & " ' here.</p>
If XML-node is a separate data-type, implementations
are encouraged to use this XML-literal format when writing to an output port,
since this provides input-output round-tripping.
Specifically, calling write
on an XML-node SHOULD write
an xml-literal (with an initial
#
).
The xml-constructor SHOULD be in standard XML
syntax without using any of extensions in this specification, such
as an unnamed end tag.
Calling display
on an XML-node SHOULD write
an xml-constructor (without an initial
#
). Alternatively, if the output port
is an extended port that can handle rich text
then an
implementation MAY instead display a styled representation.
For example if the XML-node is compatible with HTML, and the
output port is inserting text into a browser, then the implementation
may copy the DOM into the browser, perhaps resulting in styled text.
The following specifies how the reader syntax is translated by the reader into standard S-expressions. These basically create macro invocations; the implementation is responsible for implementing those macros as described in the Translated forms section. As an example:
#<a class="title">Result: &{sum}.</a>is read as if it were:
($xml-element$ () ($resolve-qname$ a) ($xml-attribute$ 'class "title") "Result: " sum ".")
The ()
in the result is the translation
of any namespace declaration attributes - in this case none.
The translation is defined in terms of a recursive read-time
translation function
Tr which maps
an xml-constructor to an S-expression.
Note: This translation is preliminary. It may need to be tweaked (and debugged) a bit.
Tr[<
QName xml-attribute...>
xml-element-datum...</QName>] ⟾ Tr[<
QName xml-attribute...>xml-element-datum...</>] Tr[<
xml-name-form xml-attribute...>xml-element-datum...</QName>] ⟾<($xml-element$ (
TrNamespaceDecl[xml-attribute]...)
TrElementName[xml-name-form] TrAttr[xml-attribute]... TrContent[xml-element-datum]...)
TrAttr[xml-namespace-declaration-attribute] ⟾#|nothing|#
TrAttr[xml-name-form=
xml-attribute-value] ⟾($xml-attribute
TrAttrName[xml-name-form] TrContent[xml-attribute-value])
TrChar[any character except &, or <] ⟾ any character except &, or < TrChar[&#x
hex-digit hex-digit...;
] ⟾\x
hex-digit hex-digit...;
TrChar[&#x
digit digit...;
] ⟾\x
corresponding hex-digits;
TrContent[simple-char...] ⟾"
TrChar[simple-char]..."
TrContent[&
xml-entity-name;
] ⟾($entity-reference$
xml-entity-name)
TrContent[{
expression}
] ⟾ expression TrContent[{
string-literal}
] ⟾(quote
string-literal)
TrContent[&{
expression...}
] ⟾ expression... TrContent[(
expression...)
] ⟾(
expression...)
Note that a string literal in an enclosed expression is handled specially
by enclosing it a quote
form. This is allows a macro to
distinguish an enclosed expression from literal content; that may sometimes
be useful.
TrNamespaceDecl[xml-true-attribute] ⟾#|nothing|#
TrNamespaceDecl[xmlns:
xml-prefix=
xml-attribute-value] ⟾(
TrContent[xml-attribute-value] xml-prefix)
TrNamespaceDecl[xmlns=
xml-attribute-value] ⟾(
TrContent[xml-attribute-value])
Element (tag) names are translated by TrElementName
,
while attribute names are translated by TrAttrName
.
These are both handled by TrElementOrAttrName
in both cases.
However, if there is no namespace-prefix, then attribute names default
to the empty namespace, but element names default to the current
default element namespace prefix (indicated by $default-element-namespace$
).
TrElementName[identifier] ⟾($resolve-qname$
identifier)
TrAttrName[identifier] ⟾(quote
identifier)
TrAttrName[other-form] ⟾ TrElementOrAttrName[other-form] TrElementName[other-form] ⟾ TrElementOrAttrName[other-form] TrElementOrAttrName[prefix:
local-name] ⟾($resolve-qname$
local-name prefix)
TrElementOrAttrName[(
expression)
] ⟾(
expression)
TrElementOrAttrName[{
expression}
] ⟾ expression
The special node constructors are translated similarly: (Note This is not quite right, since these forms should not handle escape characters the way element and attribute content does.)
Tr[<![CDATA[
xml-CDATA-content]]>
] ⟾($xml-CDATA$
"
xml-CDATA-content")
Tr[<--
xml-comment-content-->
] ⟾($xml-comment$
"
xml-comment-content")
Tr[<?
xml-PI-target xml-PI-content?>
] ⟾($xml-processing-instruction
xml-PI-target TrContent[xml-PI-content])
The above translation maps the new reader syntax to S-expressions using macros specified in this section. Of course it is possible to write these macro forms directly, though they are less human-readable. However, code generators and macros may target these macros. This format can also be used as an interchange format.
($xml-element$
(
(namespace-binding...))
name attribute... content...)
Creates an element node.
Each namespace-binding
is a one- or two-element list (namespace-uri [prefix])
.
The prefix is a literal symbol
that represents a namespace prefix; if missing it defaults to the
symbols $default-element-namespace$
.
The namespace-uri
is a literal string. (An possible extension is to allow an
expression that evaluates to a string.)
The name is an expression that
evaluates to a symbol or a QName, most commonly a quoted symbol
or a $resolve-qname$
form.
The binding for $xml-element$
must be a macro, not a function,
because each namespace-binding
adds a (prefix,URI)-binding in the lexical context. That binding is
used to evaluate QNames in the remaining parameters, which are all
expressions.
Each attribute is usually
an $xml-attribute$
form, but an implementation may support
oter expressions that evaluate to attribute nodes
.
Each content is an expression
that evaluates to to element content, handled as described in the
Handling of enclosed expressions section.
($xml-attribute$
name content...)
Creates an attribute node from the parameters. The name is an expression that evaluates to a symbol or QName value. The content arguments are concatenated to produce the attribute value.
($resolve-qname$
local-name [prefix])
Resolve the gives prefix/local-name-pair to a QName value. Both arguments are literal
unquoted symbols. If prefix is
missing it defaults $default-element-namespace$
.
($xml-comment$
content...)
($xml-CDATA$
content...)
Creates processing-instruction (PI) node. The xml-PI-target should be a symbol (or a QName in the empty namespace). The content arguments should be strings, which are concatenated.($xml-processing-instruction$
xml-PI-target content...)
($entity-reference$
xml-entity-name)
The argument is an unquoted symbol. Returns a string value matching the entity name. For example:
($xml-entityref$ lt) ==> "<"The standard XML entity names (lt, gt, amp, quot, and apos) are defined at a mininum. Standard Scheme character names should also be supported. An implementation may support other names, for example variables bound in the lexical namespace.
The implementation is necessarily non-portable, though the Translation section provides a template for the reader part.
Implementing of the Translated forms should mostly be obvious: Just call an appropriate function to create the XML-node. Handling namespace definitions is non-obvious, however. The form:
($xml-element$ ((namespace-uri prefix)...) name attribute... content...)can be translated into something like:
(let () (define-namespace prefix namespace-uri) ... (make-element name attribute... content... )
The initial environment has pre-defined:
(define-namespace $default-element-namespace$ "")
One way to implement define-namespace
is to expand:
(define-namespace prefix namespace-uri)to:
(define $namespace$:prefix namespace-uri)
In that case:
($resolve-qname$ local-name prefix)
could be implemented as:
(make-qname local-name prefix $namespace$:prefix)
assuming a 3-argument make-qname
function that creates a QName
with the given local-name, prefix, and namespace-uri.
Implementations SHOULD provide a custom error message in the case
$namespace$:prefix
is
undefined, rather than depend on a generic error message.
There is a test suite in the Kawa source tree.
Copyright (C) Per Bothner 2012
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.