SRFI 108: Named quasi-literal constructors

This SRFI is currently in final status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-108 @nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Abstract

This specifies an extensible reader syntax for named value constructors. A reader prefix is followed by a tag (an identifier), and then expressions and literal text parameters. The tag can be though of as a class name, and the expression and literal text are arguments to an object constructor call. The reader translates &tag{...} to a list ($construct$:tag ...), where $construct$:tag is normally bound to a predefined macro.

Rationale

Named (quasi-)literals

When adding new datatypes it is useful to add new literals of that type, or at least a compact readable notation for creating instances. SRFI-10 provided one solution. Here is an example, which assumes a URI type for representing encoded Uniform Resource Identifiers (URIs - or generalized URLs):

SRFI-10 has a number of problems. One issue is that SRFI-10 conflicts with syntax-case and R6RS. More fundamentally SRFI-10 resolves the tag name to a constructor function at read time, which requires managing those names using a distinct mechanism. It seems better to use normal scope rules (including library import) to manage this mapping. The reader is also responsible for calling the constructor, which means the format handled by read is extensible, which is good, but you have to be careful about the security implications, making sure only safe constructor functions are called. Finally, SRFI-10 doesn't integrate with quasi-quotation, and some of us find its syntax a bit ugly.

Enclosed unquoted expressions

Note that & is used both to introduce the top-level quasi-literal, and as an escape character before the unquoted expression.

Note that enclosed expressions are commonly strings but not always. This example executes an SQL query with a numeric parameter:

Consider what happens if name is constructed by a malicious user and has the value: smith' or ''='. In that case the effective condition would be:

Initial arguments

These are almost the same, but there is conceptual difference: The latter variant is typically used for options or XML-style attributes. The former variant is used to list components of the result object, or children in the XML sense. Initial expressions can be used for keyword arguments - or general non-string arguments. Here is an example (converted from the Scribble documentation):

Consider objects that are normally constructed from a string representation. In that case one might want to concatenate the non-initial enclosed expressions along with the literal text to yield the string, while using initial arguments for keywords or non-string arguments. Therefore the $construct$:cname implementation needs to be able to unambiguously select the initial arguments. To do this, the first example in this section is read as

Document processing

Markup is commonly nested, which suggests that a & in text can be used as an abbreviated extended-datum-literal. Specifically:

This nesting of markup motivates using the same escape character for both top-level and enclosed forms.

Translation to list form

As shown, the Scheme reader translates a named quasi-literal to a list, which is then subject to regular macro-expansion and evaluation:

The choice of the translation $construct$:tag is somewhat arbitrary. We want it to be easy for programmers to write, to be readable, and thus not excessively verbose. We want the symbol to include the actual tag as part of the name, but using just tag by itself is likely to lead to awkward name clashes. (Of course it is perfectly reasonable to implement $construct$:tag using a tag function.) Using colon to delimit the tag part seems readable and clean. Note there may be some complication in a Scheme variant that uses colon as a package or namespace separator, as for example Kawa does. However, the problem is easily solved (at least in Kawa) by defining $construct$ as a predefined namespace prefix.

Translating enclosed expressions

The translation uses a pair of special symbols to mark the start and end of the enclosed expressions:

This translation scores highly on information-preservation. It also scores highly on implementation-ease in the simple case where we can just ignore which expressions are enclosed and which are literal. For example if $construct$:foo is defined in the simplest way possible:

When you do a more complex translation, you may have to write a macro, and dealing with $<<$ and $>>$ is not completely trivial, Still, this seems a reasonable tradeoff; we later provide a helper macro define-simple-constructor to simplify some common cases.

Extra text features

Resolving to constructor

The reader creates $construct$:cname invocations, so the application or library programmer must provide a definition of $construct$:cname. It seems useful to provide some utilility functions or syntax to simplify these. As a start, this specification proposes:

The default for str-maker is $string$ , as specified in SRFI-109. This combines all the non-prefix arguments and treat them as a string quasi-literal. That is makes it easy to implement:

Possible extensions

This section discusses some ideas that seem worthwhile, but need more thought, so are deferred for now.

Read-time literals

A possible extension is to support SRFI-10 style read-time literals in certain restricted cases, when all the expressions are literal, and the transformers are available to the reader. This should probably not be the default (for consistency and because of security concerns), but could be supported in an implementation that has programmable read-tables.

Splicing of lists and vectors

Expecting each $construct$:foo implementation to desugar the $splice$ forms is unfriendly, but it could be handled by define-simple-constructor. This seems easy enough when the implementation rewrites to a function call, since we can handle the splicing by writing to an apply call. It gets trickier when macros are involved.

Handling splicing seems cleaner if the Scheme compiler handles splicing natively - i.e. as a general feature of function application. This seems worth exploring, but is obviously beyond the scope of this SRFI.

Discussion: Delimiter options

Different or same escape characters in literal-text? There are multiple different escape character roles: first we have escapes in string-literal-part. Then in a named-literal-part we have escaped strings and characters (same as in string-literal-part), plus we have nested extended-datum-body. For the latter we prefer a single escape character for both uses, to avoid a proliferation of escape characters. Also, for consistency it seems better to use the same escape character and syntax for string and character escapes in both string-literal-part and named-literal-part. The conclusion seems to be we should use the same escape character in all roles (at least within a literal-part). As to which character to use, the most plausible choices seem to be &, @, or \.

Use & as escape character: Using & is compatible with XML, HTML, SGML, and also "XML literals" embedded in programming languages, including SRFI-107 (XML reader syntax).

Use \ as escape character: Using \ is of course compatible with standard Scheme string literals. Backslash has also been used for as an escape in many languages, for string literals, regular expressions, shells, TeX, and more. If using \ as an escape for SRFI-109 strings, it would be tempting to enhance standard string literals with some of the same features, such as enclosed expressions. However, traditional C-style single-letter escapes, such as \n cause a problem: You either don't allow them in the literal-part of this specification (in which case the latter is not a super-set of standard string escapes), or you need some non-letter prefix character in front of a cname, which is tedious.

Use @ as escape character: Using @ as the escape character goes back to Scribe, TexInfo, and Scribble. These are all markup languages, not programming languages. However, Scribble allows nested Racket Scheme expression, and (if you select the at-exp Racket parser) you can also nest Scribble nested in a top-level Scheme program.

Braces vs brackets: The specification uses {curly braces} for quoted (literal) text, and uses [square brackes] to delimit unquoted expressions. This is compatible with Scribble; BRL's use of square brackets; Tcl's use of brackets and braces. On the other hand, JavaFX Script used {curly braces} for escaped expressions. So did Kawa's XML literals. (However Kawa XML literals can support both brackets as well as braces as a depecated alternative.)

Use braces only: Another option is instead of a single escape character we just use brackets to enclose expressions, without a prefix character, as in:

Use implicit concatenation instead of enclosed expressions: Finally, it is possible to not have any support for expression escapes, but instead have a more compact format for concatenation. For example a string literal right next to an expression, with no space in between, could be defined as concatenation. Thus:

Single character to start quasi-literals: Next, when it comes to the the Scheme expression level, we need an unambiguous character or sequence of character to mark the start of a quasi-literal. If we use a single character, it makes sense for that character to match the literal-part escape character, since it easies nested named-literal-part forms.

Using \ as the start character does not appear to conflict with (draft-)R7RS, but it would be a conflict for many Scheme implementations that use \ as a single-escape character as in Common Lisp.

Using @ as the start character does not seem to conflict with standard Scheme, because it is not a valid identifier-start character. However, it might conflict with implementation extensions. (For example Kawa uses @ to name Java-style annotations.)

Using & as the start character may cause compatibility problems, since & is a valid <initial> character in standard Scheme, thus it might be difficult to disambiguate from an identifier. Some R6RS-based naming conventions use such names for record types or exception types. The sequence & followed by a name followed by brackets or braces is effectively non-conflicting: In a Scheme that defines brackets as equivalent to parentheses, the following is techically well-defined:

Starting quasi-literals with # and a dispatch character: Starting quasi-literals with #\ conflicts with character literals. Neither #& or #@ appear problematic. However, starting a string literal such as #&{text} with 3 delimiter characters is rather ugly and easily mistyped.

Specification

Syntax

expression ::= ...
  | extended-datum-literal

extended-datum-literal ::=
    extended-datum-body
extended-datum-body ::=
    & cname { initial-ignored^? named-literal-part^* }
  | & cname [ expression^* ]{ initial-ignored? named-literal-part^* }
cname ::= tagname

An implementation may allow leaving out the braces if empty, i.e.:

extended-datum-body ::= ... as above ...
    | & cname [ expression^* ]

However, note that accordingly to R6RS &foo[abc] should be read as the symbol &foo followed by a list [abc] - i.e. as if it were &foo (abc). Implementations may handle this ambiguity differently, so portable programs should not leave out the empty braces.

For the definition and discussion of tagname see SRFI-109 (tagname).

The non-terminal named-literal-part is the same as string-literal-part in SRFI-109 (extended string quasi-literals), except for the support for a nested extended-datum-body.

named-literal-part ::=
    any character except &, { or }
  | { named-literal-part⁺ }
  | char-ref
  | entity-ref
  | special-escape
  | enclosed-part
  | extended-datum-body

The remaining non-terminals match those of SRFI-109 (extended string quasi-literals).

initial-ignored ::=
    intraline-whitespace line-ending intraline-whitespace &|
special-escape ::=
    intraline-whitespace &|
  | & nested-comment
  | &- intraline-linespace line-ending
char-ref ::=
    &# digit⁺ ;
  | &#x hex-digit⁺ ;
entity-ref ::=
    & char-or-entity-name ;
opt-format-specifier ::= empty
  | ~ format-specifier-after-tilde
  | % format-specifier-after-percent
enclosed-part ::=
    & enclosed-modifier [ expression^* ]
  | & enclosed-modifier ( expression⁺ )

An enclosed-modifier is normally empty, but implementations may support extensions (for example format specifiers); see discussion in SRFI-109.

enclosed-modifier ::= empty

Translation

Tr[&name [ expression* ]{ initial-ignored? content-piece* }]
   ⟾ ($construct$:name expression* $>>$ TrContent[content-piece]* )
Tr[&name { initial-ignored? content-piece* }]
   ⟾ ($construct$:name TrContent[content-piece]* )

TrContent is as in SRFI-109, except we add this rule:

TrContent[extended-datum-body]
  ⟾ Tr[extended-datum-body]

Definitions

Implementation

Since this specification changes the reader format, and there is no standard Scheme way to do that, there is no portable implementation. However, this specification is being implemented in Kawa. (Check out the development version using Subversion.)

Test suite

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Title

Author

Status