Title

Extended string quasi-literals

Author

Per Bothner <per@bothner.com>

Status

This SRFI is currently in ``draft'' status. To see an explanation of each status that a SRFI can hold, see here. To provide input on this SRFI, please mail to <srfi minus 109 at srfi dot schemers dot org>. See instructions here to subscribe to the list. You can access previous messages via the archive of the mailing list.

Abstract

This specifies a reader extension for extended string quasi-literals, including nicer multi-line strings, enclosed unquoted expressions, and formatting.

This proposal is related to SRFI-108 (named quasi-literal constructors) and SRFI-107 (XML reader syntax), as they share quite a bit of syntax.

Rationale

This proposal aims to aid in a number of related problems relating to string literals.

Note the section Discussion: Delimiter options in SRFI-108 discusses alternative delimiter characters. The syntax examples show two plausible syntax choices: What I call xml-style in red, or scribble-style in green.

Multi-line string literals

Standard Scheme literals are awkward for multi-line strings. One problem is that the same delimiter (double-quote) is used for both the start and end of the string. This is error-prone and not robust: adding or removing a single character changes the meaning of the entire rest of the program. A related problem is that if the delimiter appears in the string it needs to be quoted using an escape character, which can get hard-to-read. If we have distinct start and end delimiters, then we only need to escape unbalanced use of the delimiters.

A common solution is a here document, where distinct multi-character start and end delimiters are used. For example the Unix shell uses uses << followed by an arbitrary token as the start delimiter, and then the same token as the end delimiter:

tr a-z A-Z <<END_TEXT
one two three
uno dos tres
END_TEXT

This proposal uses just #&[ and ] as the default start and end delimiters, respectively:

(string-upcase #&[
one two three
uno dos tres
])

or when if the consensus prefers Scribble-style syntax:

(string-upcase @{
one two three
uno dos tres
})

Discussion: It may be useful to allow an option to use a user-defined token, following a marker character - for example!:

(string-upcase #&!END-TEXT[
one two three
uno dos tres
!END-TEXT])
(string-upcase @!END-TEXT{
one two three
uno dos tres
!END-TEXT})
Perhaps the end delimiter should be like this instead:
]END-TEXT!)
}END-TEXT!)

Indentation and line-endings

It is nice to have a feature where continuation lines can be indented relative to the surreounding expression context. The characters &| or @| are only allowed following initial whitespace on a (source) line. Those character and all the preceding whitespace are removed:

(display (string-upcase #&[
     &|one two three
     &|uno dos tres
]) out)
(display (string-upcase @{
     @|one two three
     @|uno dos tres
}) out)
(An implementation is encouraged to warn if indentation is inconsistent.)

Discusssion: How about initial and final newlines? Normally, for a multi-line string you want each line to be ended by a newline. In a multi-line literal, you want to start the first line on a fresh line following the start delimiter [. The suggests the following: If a start delimiter is following immediately by a newline (with possible spaces between the delimiter and the newline), then that initial newline is ignored. The alternative is to require use of a continuation (line-join) marker on the first line following the delimiter. That may be more logical but less friendly.

On the other hand, a final newline should not be suppressed. The above example gives the desired result: two lines, each terminated by a newline. However, the code looks prettier if the closing delimiter ] can be indented, without adding extra final whitespace:

(display (string-upcase #&[
     &|one two three
     &|uno dos tres
  ]) out)
(display (string-upcase @{
     @|one two three
     @|uno dos tres
  }) out)
This would create the undesired extra spaces unless there is a rule to suppress them. Instead of adding such a rule, people could use indentation markers, though that is rather ugly:
(display (string-upcase #&[
     &|one two three
     &|uno dos tres
     &|]) out)
(display (string-upcase @{
     @|one two three
     @|uno dos tres
     @|}) out)
It might be useful to allow comments at the end of each line. For example, this facility could be used for line numbers:
(display (string-upcase #&[
     &|one two three &;; 1
     &|uno dos tres &;;  2
  ]) out)
(display (string-upcase @{
     @|one two three @;; 1
     @|uno dos tres @;;  2
  ]) out)

Here &; can be followed by horizontal whitespace or comments; both the &; and the whitespace and comments are ignored. (It is an error if there is anything else following.) This is useful to add per-line comments. It is also useful to indicate that the line ends with whitespace; adding &; after the included whitespace makes that clear.

The marker &- is used to suppress a newline:

(display (string-upcase #&[
     &|one two&-
     &| three
     &|uno dos tres
  ]) out)
(display (string-upcase @{
     @|one two&@-
     @| three
     @|uno dos tres
  }) out)

Character escapes

(This dicussion assumes "xml style". I haven't decided on how best to do character escapes in Scribble style.)

We support the standard XML syntax for character references, using either decimal or hexadecimal values. The following string has two instances of the Ascii escape character, as either decimal 27 or hex 1B:

#&[&#27;&#x1B;]

Design note: Note we use #& to introduce a literal, and &# for a character escape. This could be confusing, but we assume numeric character escapes will be rare.

You can also use the pre-defined XML entity names:

#&[&amp; &lt; &gt;] ==> "& < >"
You can also use the standard Scheme character names, for example:
#&[&esc;&space;]
Similar to Scheme character literals, a single-character name names that character. This provides a reasonable solution to escaping the special characters:
#& [&{; &}; &[; &]; &&;] ==> "{ } [ ] &"
Discussion: Alternatively we could introduces names for these:
#&[&lcurly;&rcurly;&lsquare;&rsquare;] ==> "{}[]"

Discussion: Should we also support the standard string single-character slash escapes in some form? For example:

#&[Hello!&\r&\n] ==> "Hello\r\n"
Maybe not really needed, since one could just write:
#&[Hello&{"\r\n"}]

Discussion: Instead of &lcurly; &rcurly; &lsquare; &rsquare; would other names be better? For example: &lbrace; &rbrace; &lbracket; &rbracket;. If we support slash-forms perhaps we don't need them, but instead one could write: &\{ &\} &\[ &\].

Discussion: Should we allow user-defined "entity" strings? For example:

(define crln "\r\n")
#&[&crnl;] ==> "\r\n"

Enclosed (unquoted) expressions

Commonly one wants to construct a string as a concatenation of literal text with evaluated expressions. Using explicit string concatenation (Scheme string-append or Java's + operator) is verbose and can be error-prone. Using format is an alternative, but it is also a bit verbose, and has the problem that the format specifier in the string is widely separated from the expression. Nicer is to be able to use Variable interpolation, as in Unix shells:

echo "Hello ${name}!"

This proposal uses the syntax:

#&[Hello &{name}!]
@{Hello @[name]!}

Note that & is used for two different related purposes: Part of the prefix #&[ to mark the entire string, and as an escape character for the variable interpolation. This will be justified shortly xxxx.

Template processing

Going one step further, template processor has many uses. Examples include BRL and JSP, which are both used to generate web pages.

The simple solution is to allow general Kawa expressions in substitutions:

#&[Hello &{(string-capitalize name)}!]
@{Hello @[{(string-capitalize name)]!}

You can also leave out the curly braces when the expression is a parenthesized expression:

#&[Hello &(string-capitalize name)!]
@{Hello @(string-capitalize name)!}

Note that this syntax for unquoted expressions matches that used in SRFI-107 (XML reader syntax).

Formatting

Many Scheme implementations use format for finer-grained control of the output. A problem with format is that the association between format specifiers and data expressions is positional, which is harder-to-read and error-prone. A better solution would place these adjacant to the data expressions. The proposal provides allows an enclosed expression to be prefixed by a format specifier:

#&[The response was &~,2f(* 100.0 (/ responses total))%.]
@{The response was @~,2f(* 100.0 (/ responses total))%.}

This is an optional extension for implementations that support format. Imlementations that support printf-style formatting can also optionally support those:

#&[The response was &%.2f(* 100.0 (/ responses total))%.]
@{The response was @%.2f(* 100.0 (/ responses total))%.}

(The JavaFX Script language provided similar functionality.)

Internationalized strings

(Probably this should be a separate SRFI.)

See SRFI-29.

((motivation))

An internationalized string is marked with the prefix ^ followed by a key:

#&^hello[Hello!]
@^hello{Hello!}
Here the key is the string hello. At runtime this key is combined with the current language to produce a translated string. If no translation is found, the the string in the literal Hello! is used.

(Formatting makes this more complicated. Note that a translation might want to use different formats, and may also want to re-order the arguments, so positional formats must be supported and part of the trabslation.)

If there is no explicit key, the string is used as the key. In the following, "Hello!" is used as the key.

#&^[Hello!]
@^{Hello!}

Specification

Syntax - if using & as escape (XML-style)

expression ::= ...
  | extended-string-literal
extended-string-literal ::= #&[string-literal-part...]
string-literal-part ::=
    any character except &, [ or ]
  | [string-literal-part...]
  | char-or-entity-ref
  | special-escape
  | &enclosed-part
special-escape ::=
    ignored-whitespace&|
  | TBD (at least line pasting and comments)
char-or-entity-ref ::=
    &char-or-entity-name;
  | &#digits;
  | &#xhex-digits;
opt-format-specifier ::= empty
  | ~format-specifier-after-tilde
  | %format-specifier-after-percent
enclosed-part ::=
    &opt-format-specifier{expression ...}
  | &opt-format-specifier(expression...)

Syntax - if using @ as escape (Scribble-style)

expression ::= ...
  | extended-string-literal
extended-string-literal ::= @{string-literal-part...}
string-literal-part ::=
    any character except @, { or }
  | {string-literal-part...}
  | char-or-entity-ref
  | special-escape
  | @enclosed-part
special-escape ::=
    ignored-whitespace@|
  | TBD (at least line pasting and comments)
char-or-entity-ref ::=
    @char-or-entity-name;
  | @#digits;
  | @#xhex-digits;
opt-format-specifier ::= empty
  | ~format-specifier-after-tilde
  | %format-specifier-after-percent
enclosed-part ::=
    @opt-format-specifier[expression ...]
  | @opt-format-specifier(expression...)

Translation

Tr[#&[content-piece...]]($quasi-string$ TrContent[content-piece]...)
TrChar[any character except &, or <]
  ⟾ any character except &, or <
TrChar[&#x hex-digit hex-digit... ;]\xhex-digit hex-digit... ;
TrChar[&#x digit digit... ;]\xcorresponding hex-digits;
TrContent[simple-char...]"TrChar[simple-char]..."
TrContent[&cname;]($entity-reference$ cname )
TrContent[&~format (expression )]($format$ "format" (expression))
TrContent[&~format { expression... }]($format$ "format" expression...)

Translated forms

($quasi-string$ form ...)
evaluates approximately to an immutable string created by concatenating each form. Thus a basic implementation would be to expand to:
(string-append (format #f "~a" form) ...)

This assumes we translate:

($format$ format-specifier form ...)
to the obvious call to format:
(format #f format-specifier form ...)

However, this doesn't work as desired if the format-specifier contains specifiers that change the argument order. This would be rare as entered by a programmer, but it can happen for localized (translated) text. Consider:

#&^['&{partition}' has only &{avail} bytes free.]

A translation might want to re-order the arguments, as if it were:

#&^[Only &{avail} bytes free on '&{partition}'.]

That could be done if the translation database provides for a format that re-orders the arguments, perhaps using the tilde-asterisk format specifier forms. For example (to pick some hypothetical translation database syntax):

"'&{}' has only &{} bytes free." => "Only &~1@*~d{} bytes free on '&~0@*~s{}'."

Supporting localization and complex formats requires a more complex implementation:

  1. Construct a text-part by taking the literal text, format specifiers, and expanded entity-references. Leave out all the enclosed expressions. (Exact translation format to be specified.)
  2. If translation is specified, create a translation-key: Either use an explicit translation-key given in the quasi-literal, or use the text-part as an implicit translation-key (GNU gettext-style). Look for a translation in the translation datebase. If one is found, use that as the translated text-part; otherwise use text-part as-is.
  3. Convert the text-part to a format-string by escaping stand-alone ~ characters. Remove {} pairs if they're preceded by format specifier, or replace the pairs with ~a if there is no format specifier.
  4. Invoke format with the resulting format string and the enclosed expressions as format arguments.

This procedure can be optimized at compile time if there is no localizarion or format specifiers. Unfortunately, if translation is required, then we basically have to convert the $quasi-string$ back to a text-part string, translate it, and then parse the translated string. Only the first part can be done at compile-time. This unparsing and reparsing is hard to avoid as long as the translation mappings are in text form, and anything else seems difficult to work with.

Implementation

Syntax

Since this specification changes the reader format, and there is no standard Scheme way to do that, there is no portable implementation. However, this specification is being implemented in Kawa.

Translation

Test suite

Copyright

Copyright (C) Per Bothner 2012

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Author: Per Bothner
Editor: Mike Sperber