Title

Extended string quasi-literals

Author

Per Bothner <per@bothner.com>

Status

This SRFI is currently in ``draft'' status. To see an explanation of each status that a SRFI can hold, see here. To provide input on this SRFI, please mail to <srfi minus 109 at srfi dot schemers dot org>. See instructions here to subscribe to the list. You can access previous messages via the archive of the mailing list.

Received: 2012/11/03
Revised: 2013/02/04
Revised: 2013/03/26
Revised: 2013/04/19
Draft: 2012/11/10-2013/01/10

Abstract

This specifies a reader extension for extended string quasi-literals, including nicer multi-line strings, and enclosed unquoted expressions.

This proposal is related to SRFI-108 (extended string quasi-literals) and SRFI-107 (XML reader syntax), as they share quite a bit of syntax.

Rationale

This proposal aims to aid in a number of related problems relating to string literals.

Multi-line string literals

Standard Scheme literals are awkward for multi-line strings. One problem is that the same delimiter (double-quote) is used for both the start and end of the string. This is error-prone and not robust: adding or removing a single character changes the meaning of the entire rest of the program. A related problem is that if the delimiter appears in the string it needs to be quoted using an escape character, which can get hard-to-read. If we have distinct start and end delimiters, then we only need to escape unbalanced use of the delimiters.

A common solution is a here document, where distinct multi-character start and end delimiters are used. For example the Unix shell uses uses << followed by an arbitrary token as the start delimiter, and then the same token as the end delimiter:

tr a-z A-Z <<END_TEXT
one two three
uno dos tres
END_TEXT

This proposal uses just #&{ and } as the default start and end delimiters, respectively:

(string-upcase &{
one two three
uno dos tres
})

Enclosed (unquoted) expressions

Commonly one wants to construct a string as a concatenation of literal text and evaluated expressions. Using explicit string concatenation (Scheme string-append or Java's + operator) is verbose and can be error-prone. Using format is an alternative, but it is also a bit verbose. Worse, the format specifier and expression it controls are non-adjacent, which is awkward and error-prone. Nicer is to be able to use Variable interpolation, as in Unix shells:

echo "Hello ${name}!"

This proposal uses the syntax:

&{Hello &[name]!}

Note that & is used both as part of the prefix &{ to mark the entire string, and as an escape character within the string. See the discussion SRFI-108 (delimiter options).

Template processing

Going one step further, a template processor has many uses. Examples include BRL and JSP, which are both used to generate web pages.

The simple solution is to allow general Scheme expressions in substitutions:

&{Hello &[(string-capitalize name)]!}

You can also leave out the square brackets when the expression is a parenthesized expression:

&{Hello &(string-capitalize name)!}

Note that this syntax for unquoted expressions matches that used in SRFI-107 (XML reader syntax).

Indentation and line-endings

By default there is a one-to-one mapping between whitespace in the literal and the resulting string (except that line-ending is normalized to the newline character), but it is often convenient (or at least prettier) for them to be different.

You can of course easily add extra newline characters beyond those in the literal:

&{a&newline;b} ⟹ "a\nb"

Conversely, the line-continuation marker &- is used to suppress a newline:

&{abc&-
  def} ⟹ "abc  def"

The marker also suppresses any intraline whitespace between the &- and the newline, but it does not suppress intraline whitespace following the newline. In the latter respect it differs from the \ at the end of a line in an R6RS string literal.

Suppressing initial whitespace is more generally useful than just for continuation lines. For example it is important for properly indenting source code to match the program structure. The indentation marker &| is used to mark the end of insignificant initial whitespace, typically to indent strings inside a function. The &| characters and all the preceding whitespace are removed:

(display (string-upcase &{
     &|one two three
     &|uno dos tres
}) out)

As a matter of style, all of the indentation lines should line up: An implementation may warn if indentation is inconsistent. It is an error if there are any non-whitespace characters between the previous newline and the indentation marker. It is also an error to write an indentation marker before the first newline in the literal.

One does not normally want an initial newline in a multi-line string. However, as in the above example, the natural way to write this is with the left brace on the previous line - otherwise either the source is wrongly indented, or the matching columns in the result don't line up in the source. For that reason &| also suppresses an initial newline. Specifically, when the initial left-brace is followed by optional (invisible) intraline-whitespace, then a newline, then optional intraline-whitespace (the indentation), and finally the indentation marker &| - all of which is removed from the output. Otherwise the &| only removes initial intraline-whitespace on the same line (and itself).

However, traditionally there should be a final newline in a multi-line string. So the following styles are suggested. If the text is at top-level, or more generally, the closing brace is in the first column, then write it like this:

(define help-message &{
   &|This is the first of 2 lines.
   &|This last line is followed by a final newline.
})

When the text is nested such that writing the closing brace should not be in the left column, then you can use an extra indentation marker, like this:

(display
  (string-upcase &{
     &|This is the first of 2 lines.
     &|This last line is followed by a final newline.
     &|})
  out)

Note in the above there are 3 indentation markers, but the resulting string has 2 lines followed by a total of 2 newline characters, because the first indentation markers suppresses the initial newline.

If you do not want to not end the final line with a newline, you can either use a line-continuation marker, or end the line with the closing brace:

(display (string-upcase &{
   &|This is the first of 2 lines.
   &|This last line is not followed by a final newline.}) out)

Embedded comments

For long strings it may be useful to embed comments, even though this is redundant since it could be done using enclosed expressions:

&{preamble &[#|ignore this part|#] postamble}

However, this seems clumsy, so this specification has a comment syntax:

&{preamble &#|ignore this part|# postamble}

For example for line numbers:

(display (string-upcase &{
     &|&#|line 1|#one two
     &|&#|line 2|# three
     &|&#|line 3|#uno dos tres
  }) out)

(It is temping to allow comments before a &| indentation marker, but it entails more complexity that seems justified.)

Character escapes

We support the standard XML syntax for character references, using either decimal or hexadecimal values. The following string has two instances of the Ascii escape character, as either decimal 27 or hex 1B:

&{&#27;&#x1B;}

You can also use the pre-defined XML entity names:

&{&amp; &lt; &gt; &quot; &apos;} ⟹ "& < > \" '"

In addition, { } can be used for left and right curly brace:

&{&rbrace;_&lbrace;}  ⟹ "}_{"

Note that these are only needed for unbalanced braces:

&{A left brace '{' followed by a right brace '}' is ok.}
  ⟹ "A left brace '{' followed by a right brace '}' is ok."

An implementation must support the character names amp, lt, gt, quot, apos, lbrace, and rbrace. An implementation should support the standard XML entity names (though resource-limited or non-Unicode-based implementations are not required to). For example:

&{L&aelig;rdals&oslash;yri}
  ⟹ "Lærdalsøyri"

An implementation should also support the standard R7RS character names null, alarm, backspace, tab, newline, return, escape, space, and delete. For example:

&{&escape;&space;}

The reader translates the entity reference &name; to the variable reference $entity$:name. Therefore user-defined entity names are possible:

(define $entity$:crnl "\r\n")
&{&crnl;} ⟹ "\r\n"

Possible extensions

This section discusses some ideas that seem worthwhile, but need more thought, so are deferred for now.

Special characters

Only the characters '{', '}', and '&' are reserved and thus need special escaping. Braces only need escaping when unbalanced, which is likely to be rare in both text and quoted programs, thus the only real problem is &. A common solution in other languages is doubling. That is one could read && as a single &. However, doubling is not otherwise used in Scheme, so it may not be worth adding as a special case.

It might convenient to support standard string single-character slash escapes in some form, For example:

&{Hello!&\r&\n} ⟹ "Hello\r\n"

Maybe not really needed, since one could just write:

&{Hello&["\r\n"]}

Formatting

Many Scheme implementations use format for finer-grained control of the output. A problem with format is that the association between format specifiers and data expressions is positional, which is hard-to-read and error-prone. A better solution places the specifier adjacant to the data expression:

&{The response was &~,2f(* 100.0 (/ responses total))%.}

The reader would map this to:

($string$ "The response was " ($format$ "~,2f" (* 100.0 (/ responses total))) "%.")

A simple definition of $format$ :

(define ($format$ fmt . args) (apply format #t fmt args))

Implementations that support printf-style formatting can also optionally support those:

&{The response was &%.2f(* 100.0 (/ responses total))%.}

This would be read as:

($string$ "The response was " ($sprintf$ "%.2f" (* 100.0 (/ responses total))) "%.")

(The JavaFX Script language provided similar functionality.)

Internationalized strings

Internationalization refers to a framework so that text messages can be emitted in multiple (human) languages, depending on the user's preferred locale. See SRFI-29. Strings that may need to be translated are marked specially. For the sake of discussion we can use the prefix ^ followed by a key:

&^hello{Hello!}

Here the key is the string hello. At runtime this key is combined with the current language to produce a translated string. If no translation is found, then the string in the literal Hello! is used.

If there is no explicit key, the string is used as the key. In the following, "Hello!" is used as the key.

&^{Hello!}

Complex formats and internationalization

A simple implementation of $format$ as a call to the format function does not handle format specifiers that change the argument order. These are primarily useful for localizing messages, since one might want change argument order when translating from one language to another. Consider this warning message:

&^{['&[partition]' has only &[avail] bytes free.}

A translation might want to re-order the arguments, as if it were:

&^{Only &[avail] bytes free on '&[partition]'.}

That could be done if the translation database provides for a format that re-orders the arguments, perhaps using the tilde-asterisk format specifier forms. For example (to pick some hypothetical translation database syntax):

"'&[]' has only &[] bytes free." => "Only &~1@*~d[] bytes free on '&~0@*~s[]'."

It follows that we can't use a one-to-one translation from a format-specifier ( $format$ ) to a call to the format function. Instead we need to work with single format string constructed from the entire text to be localized. The complicates the implementation. The basic algorithm should be something like:

Construct a text-part by taking the literal text, format specifiers, and expanded entity-references. Leave out all the enclosed expressions. Exact translation format to be specified, but one idea is to represent each enclosed expression by &[] if there is no format-specifier, and &[specifier] if there is one.
If translation is specified, create a translation-key: Either use an explicit translation-key given in the quasi-literal, or use the text-part as an implicit translation-key (GNU gettext-style). Look for a translation in the translation database. If one is found, use that as the translated text-part; otherwise use text-part as-is.
Convert the text-part to a format-string by escaping stand-alone ~ characters. Replace each &[] by ~a, and each &[specifier] by the specifier,
Invoke format with the resulting format string and the enclosed expressions as the arguments.

User-defined end token

Many languages, including the Bourne shell, allow for a a user-defined end token. We could allow the as an option following a marker character - for example !:

(string-upcase &!END-TEXT{
one two three
uno dos tres
}!END-TEXT)

Specification

Syntax

expression ::= ...
  | extended-string-literal

extended-string-literal ::= &{ initial-ignored? string-literal-part^* }
string-literal-part ::=
    any character except &, { or }
  | { string-literal-part^* }
  | char-ref
  | entity-ref
  | special-escape
  | enclosed-part
char-ref ::=
    &# digit⁺ ;
  | &#x hex-digit⁺ ;
entity-ref ::=
    & char-or-entity-name ;
char-or-entity-name ::= tagname
initial-ignored ::=
    intraline-whitespace line-ending intraline-whitespace &|
special-escape ::=
    intraline-whitespace &|
  | & nested-comment
  | &- intraline-linespace line-ending
enclosed-part ::=
    & enclosed-modifier { expression^* }
  | & enclosed-modifier ( expression⁺ )

tagname ::= tagname-initial tagname-subsequent*
tagname-initial ::= letter
tagname-subsequent ::= tagname-initial | digit | - (hyphen) | _ (underscore) | . (period)

If we allowed tagname to be an arbitrary Scheme identifier there would be parsing difficulties. One problem is that we use &| to skip indentation, but R7RS identifier syntax uses | as a delimiter for symbols with special characters. Another conflict is if an implementation uses &~ or &% to indicate format specifiers, since these are allowed as R7RS identifier initial characters.

An implementation may extend tagname to match Name as defined by the XML 1.1 specification.

The following are defined by R7RS: nested-comment, intraline-whitespace, line-ending, letter, digit, and hex-digit.

enclosed-modifier ::= empty

An enclosed-modifier is normally empty: However, implementations or future extensions may support non-empty modifiers. For example, Kawa supports both format-style and printf-style specifiers, so the syntax is:

enclosed-modifier ::= empty
  | ~ format-specifier-after-tilde (optional feature)
  | % format-specifier-after-percent (optional feature)

Translation

When the Scheme reader reads an extended-string-literal it returns a list whose first element is the symbol $string$ , and whose remaining elements are the translations of the string-literal parts. The literal content (including each char-ref but excluding each entity-ref) is translated to literal strings. An entity-ref &ename; is translated to a symbol $entity$:ename. Enclosed expressions are prefixed by a $<<$ symbol¸ and followed by a $>>$ .

The translation is defined by conceptual read-time re-write function Tr which maps an extended-string-literal in the input stream to an equivalent $string$ list - which is then (conceptually) re-read. (A real reader would generate S-expression forms directly, but this way we can express the translation more concisely.)

Tr[&{ initial-ignored? content-segment^* }]
   ⟾ ($string$ TrContent[content-segment]^* )

Each segment corresponds to a string-literal-part in the syntax, except that a run of multiple plain characters and char-refs are combined to a single string literal. In addition the special-escape forms are dropped without appearing in the result.

TrContent[simple-text⁺]
   ⟾ " TrText[simple-text]⁺ "
TrText[any character except &, or \, line-ending, or final (unbalanced) }]
  ⟾ that character as-is
TrText[line-ending]
  ⟾ \n
TrText[\]
  ⟾ \\
TrText[&#x hex-digit⁺ ;]
  ⟾ \x hex-digit⁺ ;
TrText[&# digit⁺ ;]
  ⟾ \x corresponding hex-digits ;
TrText[& nested-comment]
  ⟾ 
TrText[intraline-whitespace &|]
  ⟾ 
TrText[&- intraline-whitespace line-ending]
  ⟾

Translations for the other segment kinds are straight-forward:

TrContent[&ename;]
   ⟾ $entity$:ename
TrContent[&( expression⁺ )]
   ⟾  $<<$ ( expression⁺ ) $>>$
TrContent[&[ expression^* ]]
   ⟾  $<<$ expression^* $>>$

The following are optional and/or for a future specification:

TrContent[&~ format ( expression⁺ )]
   ⟾  ($format$ " format " ( expression⁺ ))
TrContent[&~ format [ expression⁺ ]]
   ⟾  ($format$ "format" expression⁺ )
TrContent[&% format ( expression⁺ )]
   ⟾  ($sprintf$ " format " ( expression⁺ ))
TrContent[&% format [ expression⁺ ]]
   ⟾  ($sprintf$ "format" expression⁺ )

Implementing the translated forms

The reader translation:

($string$ form ...)

evaluates approximately to an immutable string created by concatenating each form. A basic implementation could be:

(define ($string$ . args)
  (let ((port (open-output-string)))
    (for-each
     (lambda (arg) (display arg port))
     args)
    (get-output-string port)))

The string created by a $string$ form is immutable, and need not have a unique identity. E.g. if the operands are constant then an implementation is allowed to constant-fold the expression to a string literal.

In addition $<<$ $>>$ are constant unique zero-length strings. These bindings should preferbly be non-assignable if an implementation has a mechanism for that (for example using identifier macros).

(define $<<$ (make-string 0))
(define $>>$ (make-string 0))

Note that R6RS and R7RS allows eq? to return #t for distinct calls to (make-string 0). A hypothetical implementation that does so needs to initialize $<<$ and $>>$ some other way.

If $format$ is supported, a minimal implementation is:

(define-syntax $format$
  (syntax-rules ()
    (($format$ fmt arg ...)
     (format #f fmt arg ...))))

Implementation

Since this specification changes the reader format, and there is no standard Scheme way to do that, there is no portable implementation. However, this specification is being implemented in Kawa. (Check out the development version using Subversion.)

A more sophisticated implementation of the $string$ macro which maps to a single format call is at the time of writing in syntax.scm.

Test suite

There is a test suite in the Kawa source tree. There are also tests of mal-formed literals.

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Author: Per Bothner

Editor: Mike Sperber