Per Bothner
<per@bothner.com>
This SRFI is currently in ``draft'' status. To see an explanation of
each status that a SRFI can hold, see here.
To provide input on this SRFI, please
mail to
<srfi minus 109 at srfi dot schemers dot org>
. See
instructions here to
subscribe to the list. You can access previous messages via
the archive of the mailing list.
This specifies a reader extension for extended string quasi-literals, including nicer multi-line strings, enclosed unquoted expressions, and formatting.
This proposal is related to SRFI-108 (named quasi-literal constructors) and SRFI-107 (XML reader syntax), as they share quite a bit of syntax.
This proposal aims to aid in a number of related problems relating to string literals.
Note the section Discussion: Delimiter options in SRFI-108
discusses alternative delimiter characters.
The syntax examples show two plausible syntax choices:
What I call xml-style
in red,
or scribble-style
in green.
Standard Scheme literals are awkward for multi-line strings.
One problem is that the same delimiter (double-quote) is used for both
the start and end of the string. This is error-prone and not robust:
adding or removing a single character changes the meaning of the entire
rest of the program.
A related problem is that if the delimiter appears in the string it
needs to be quoted using an escape character, which can get hard-to-read.
If we have distinct start and end delimiters, then we only
need to escape unbalanced
use of the delimiters.
A common solution is a
here document
,
where distinct multi-character start and end delimiters are used.
For example the Unix shell
uses uses <<
followed by an arbitrary token
as the start delimiter, and then the same token as the end delimiter:
tr a-z A-Z <<END_TEXT one two three uno dos tres END_TEXT
This proposal uses just #&[
and ]
as the default start and end delimiters, respectively:
(string-upcase #&[ one two three uno dos tres ])
or when if the consensus prefers Scribble-style syntax:
(string-upcase @{ one two three uno dos tres })
Discussion: It may be useful to allow an option to use
a user-defined token, following a marker
character - for example!
:
(string-upcase #&!END-TEXT[ one two three uno dos tres !END-TEXT])
(string-upcase @!END-TEXT{ one two three uno dos tres !END-TEXT})Perhaps the end delimiter should be like this instead:
]END-TEXT!)
}END-TEXT!)
It is nice to have a feature where continuation lines can be
indented relative to the surreounding expression context.
The characters &|
or
@|
are only allowed following
initial whitespace on a (source) line. Those character and all the preceding
whitespace are removed:
(display (string-upcase #&[ &|one two three &|uno dos tres ]) out)
(display (string-upcase @{ @|one two three @|uno dos tres }) out)(An implementation is encouraged to warn if indentation is inconsistent.)
Discusssion:
How about initial and final newlines? Normally, for a multi-line string
you want each line to be ended by a newline. In a multi-line literal,
you want to start the first line on a fresh line following the
start delimiter [
.
The suggests the following: If a start delimiter is following immediately
by a newline (with possible spaces between the delimiter and the newline),
then that initial newline is ignored. The alternative is to require
use of a continuation (line-join) marker on the first line following the
delimiter. That may be more logical
but less friendly.
On the other hand, a final newline should not be suppressed.
The above example gives the desired result: two lines, each terminated
by a newline. However, the code looks prettier if the closing delimiter
]
can be indented, without adding extra
final whitespace:
(display (string-upcase #&[ &|one two three &|uno dos tres ]) out)
(display (string-upcase @{ @|one two three @|uno dos tres }) out)This would create the undesired extra spaces unless there is a rule to suppress them. Instead of adding such a rule, people could use indentation markers, though that is rather ugly:
(display (string-upcase #&[ &|one two three &|uno dos tres &|]) out)
(display (string-upcase @{ @|one two three @|uno dos tres @|}) out)It might be useful to allow comments at the end of each line. For example, this facility could be used for line numbers:
(display (string-upcase #&[ &|one two three &;; 1 &|uno dos tres &;; 2 ]) out)
(display (string-upcase @{ @|one two three @;; 1 @|uno dos tres @;; 2 ]) out)
Here &;
can be followed by horizontal whitespace
or comments; both the &;
and the whitespace and comments
are ignored. (It is an error if there is anything else following.)
This is useful to add per-line comments. It is also useful to indicate
that the line ends with whitespace; adding &;
after the
included whitespace makes that clear.
The marker &-
is used to suppress a newline:
(display (string-upcase #&[ &|one two&- &| three &|uno dos tres ]) out)
(display (string-upcase @{ @|one two&@- @| three @|uno dos tres }) out)
(This dicussion assumes "xml style". I haven't decided on how best to do character escapes in Scribble style.)
We support the standard XML syntax for character references,
using either decimal or hexadecimal values.
The following string has two instances of the Ascii escape character,
as either decimal 27 or hex 1B
:
#&[]
Design note: Note we use #&
to introduce a literal, and for a character escape.
This could be confusing, but we assume numeric character escapes will be rare.
You can also use the pre-defined XML entity names:
#&[& < >] ==> "& < >"You can also use the standard Scheme character names, for example:
#&[&esc;&space;]Similar to Scheme character literals, a single-character name names that character. This provides a reasonable solution to escaping the special characters:
#& [&{; &}; &[; &]; &&;] ==> "{ } [ ] &"Discussion: Alternatively we could introduces names for these:
#&[&lcurly;&rcurly;&lsquare;&rsquare;] ==> "{}[]"
Discussion: Should we also support the standard string single-character slash escapes in some form? For example:
#&[Hello!&\r&\n] ==> "Hello\r\n"Maybe not really needed, since one could just write:
#&[Hello&{"\r\n"}]
Discussion: Instead of
&lcurly; &rcurly; &lsquare; &rsquare;
would other names be better? For example:
{ } &lbracket; &rbracket;
.
If we support slash-forms perhaps we don't need them,
but instead one could write:
&\{ &\} &\[ &\]
.
Discussion: Should we allow user-defined "entity" strings? For example:
(define crln "\r\n") #&[&crnl;] ==> "\r\n"
Commonly one wants to construct a string as a concatenation of
literal text with evaluated expressions.
Using explicit string concatenation (Scheme string-append
or Java's +
operator)
is verbose and can be error-prone.
Using format
is an alternative, but it is also a bit verbose,
and has the problem that the format specifier in the string is widely
separated from the expression.
Nicer is to be able to use
Variable interpolation, as in Unix shells:
echo "Hello ${name}!"
This proposal uses the syntax:
#&[Hello &{name}!]
@{Hello @[name]!}
Note that &
is used for two different related purposes:
Part of the prefix #&[
to mark the entire string,
and as an escape character for the variable interpolation.
This will be justified shortly xxxx.
The simple solution is to allow general Kawa expressions in substitutions:
#&[Hello &{(string-capitalize name)}!]
@{Hello @[{(string-capitalize name)]!}
You can also leave out the curly braces when the expression is a parenthesized expression:
#&[Hello &(string-capitalize name)!]
@{Hello @(string-capitalize name)!}
Note that this syntax for unquoted expressions matches that used in SRFI-107 (XML reader syntax).
Many Scheme implementations use format
for
finer-grained control of the output. A problem with format
is that the association between format specifiers and data expressions
is positional, which is harder-to-read and error-prone.
A better solution would place these adjacant to the data expressions.
The proposal provides allows an enclosed expression
to be prefixed by a format specifier:
#&[The response was &~,2f(* 100.0 (/ responses total))%.]
@{The response was @~,2f(* 100.0 (/ responses total))%.}
This is an optional extension for implementations that
support format
. Imlementations that support printf-style
formatting can also optionally support those:
#&[The response was &%.2f(* 100.0 (/ responses total))%.]
@{The response was @%.2f(* 100.0 (/ responses total))%.}
(The JavaFX Script language provided similar functionality.)
See SRFI-29.
((motivation))
An internationalized string is marked with the prefix ^
followed by a key:
#&^hello[Hello!]
@^hello{Hello!}Here the key is the string
hello
. At runtime this key
is combined with the current languageto produce a translated string. If no translation is found, the the string in the literal
Hello!
is used.
(Formatting makes this more complicated. Note that a translation might want to use different formats, and may also want to re-order the arguments, so positional formats must be supported and part of the trabslation.)
If there is no explicit key, the string is used as the key.
In the following, "Hello!"
is used as the key.
#&^[Hello!]
@^{Hello!}
expression ::= ... | extended-string-literal
extended-string-literal ::= #&[
string-literal-part...]
string-literal-part ::= any character except&
,[
or]
|[
string-literal-part...]
| char-or-entity-ref | special-escape |&
enclosed-part special-escape ::= ignored-whitespace&|
| TBD (at least line pasting and comments) char-or-entity-ref ::=&
char-or-entity-name;
|&#
digits;
|&#x
hex-digits;
opt-format-specifier ::= empty |~
format-specifier-after-tilde |%
format-specifier-after-percent enclosed-part ::=&
opt-format-specifier{
expression ...}
|&
opt-format-specifier(
expression...)
expression ::= ... | extended-string-literal
extended-string-literal ::= @{
string-literal-part...}
string-literal-part ::= any character except@
,{
or}
|{
string-literal-part...}
| char-or-entity-ref | special-escape |@
enclosed-part special-escape ::= ignored-whitespace@|
| TBD (at least line pasting and comments) char-or-entity-ref ::=@
char-or-entity-name;
|@#
digits;
|@#x
hex-digits;
opt-format-specifier ::= empty |~
format-specifier-after-tilde |%
format-specifier-after-percent enclosed-part ::=@
opt-format-specifier[
expression ...]
|@
opt-format-specifier(
expression...)
Tr[#&[
content-piece...]
] ⟾($quasi-string$
TrContent[content-piece]...)
TrChar[any character except &, or <] ⟾ any character except &, or < TrChar[&#x
hex-digit hex-digit...;
] ⟾\x
hex-digit hex-digit...;
TrChar[&#x
digit digit...;
] ⟾\x
corresponding hex-digits;
TrContent[simple-char...] ⟾"
TrChar[simple-char]..."
TrContent[&
cname;
] ⟾($entity-reference$
cname)
TrContent[&~
format(
expression)
] ⟾($format$ "
format"
(
expression))
TrContent[&~
format{
expression...}
] ⟾($format$ "
format"
expression...)
($quasi-string$ form ...)evaluates approximately to an immutable string created by concatenating each form. Thus a basic implementation would be to expand to:
(string-append (format #f "~a" form) ...)
This assumes we translate:
($format$ format-specifier form ...)to the obvious call to
format
:
(format #f format-specifier form ...)
However, this doesn't work as desired if the format-specifier contains specifiers that change the argument order. This would be rare as entered by a programmer, but it can happen for localized (translated) text. Consider:
#&^['&{partition}' has only &{avail} bytes free.]
A translation might want to re-order the arguments, as if it were:
#&^[Only &{avail} bytes free on '&{partition}'.]
That could be done if the translation database provides for a format that re-orders the arguments, perhaps using the tilde-asterisk format specifier forms. For example (to pick some hypothetical translation database syntax):
"'&{}' has only &{} bytes free." => "Only &~1@*~d{} bytes free on '&~0@*~s{}'."
Supporting localization and complex formats requires a more complex implementation:
gettext
-style).
Look for a translation in the translation datebase.
If one is found, use that as the translated text-part;
otherwise use text-part as-is.
~
characters.
Remove {}
pairs if they're preceded
by format specifier, or replace the pairs with ~a
if there is no format specifier.
format
with the resulting format string and
the enclosed expressions as format arguments.
This procedure can be optimized at compile time if there is
no localizarion or format specifiers. Unfortunately, if
translation is required, then we basically have to convert
the $quasi-string$
back to a text-part string,
translate it, and then parse the translated string.
Only the first part can be done at compile-time.
This unparsing and reparsing is hard to avoid
as long as the translation mappings are in text form, and anything
else seems difficult to work with.
Since this specification changes the reader format, and there is no standard Scheme way to do that, there is no portable implementation. However, this specification is being implemented in Kawa.
Copyright (C) Per Bothner 2012
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.