[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

SRFI withdrawn; comments on the possible future



If it isn't already by the time you read this message, SRFI-75 will be
withdrawn. We would like to thank everyone once again for the feedback,
and we note that discussion can continue on the list even after the
SRFI is withdrawn.

Your feedback pointed out several places where the R6RS editors needed
to work harder, so R6RS will certainly not match the SRFI exactly. Due
to organizational changes within the editors group, we can't update the
SRFI under the banner of the R6RS process. We can, however, comment on
the things that will likely change, and we can offer our (combined)
opinion on likely changes. Those comments appear below.

If you have any opinion on these changes please speak up ASAP so that
the R6RS editors can take your comments into account.

Matthew and Marc

----------------------------------------

Straightforward additions
-------------------------

    * `char-general-category', which accepts a character and returns one
       of 'lu, 'li, ...

    * `string-normalize-nfd', `string-normalize-nfkd,
      `string-normalize-nfc', and `string-normalize-nfkc', which each
      accept a string and produce its normalization according to normal
      form D, KD, C, or KC, respectively.

The #\newline character
-----------------------

It is likely that #\newline will be removed from Scheme leaving only
#\linefeed. Since R6RS will pin down characters to Unicode scalar
values, the right name for the character is #\linefeed.

This change will break compatibility with some R5RS code that might
have worked otherwise. One view is that removing #\newline will be
healthier in the long run. Another view is that #\newline can serve as
an abstaction of the end-of-line character sequence which is returned
by read-char when the end-of-line character sequence is read (be it
#\linefeed, or #\return, or # \return followed by #\linefeed). So even
though #\newline and #\linefeed are the same characters, Scheme
programs might use #\newline to highlight that the character is being
used to denote the end-of-line sequence. The name #\newline would also
reinforce the link with the escape sequence "\n" in strings.

Escape sequences
----------------

The \x, \u, and \U variations for hexadecimal escapes are compatible
(to various degrees) with other languages, including C and Java, but
the feedback on this list pointed to a single escape with a terminator.

For characters, since R6RS requires each character literal to be
followed by a delimiter, we could just allow any number of hexadecimal
characters after #\x (perhaps limiting the number of characters to six):

    #\xX...X

The use of a terminator within a string is probably clearer although
it often requires more typing.  One possible terminator is semi-colon:

   with semi-colon terminator          without terminator

   "A\x42;C" = "ABC"                   "A\x42\x43" = "ABC"
   "\x41;\x42;\x43;" = "ABC"           "\x41\x42\x43" = "ABC"
   "\x03BB;x.x" = "λx.x"               "\x03BBx.x" = "λx.x"

An alternate possibility is to use some form of brackets around the
hexadecimal digits. However, since parentheses and brackets tend to be
delimiters, this choice interacts somewhat badly with the character
syntax:

    #\x(03BB) = #\x followed by a list, or #\λ ?

Using less-than and greater-than characters, which are not actual
brackets, avoids this problem:

    #\x<03BB> = #\λ

However, they become somewhat more difficult to read when multiple
escape appear in a string:

   "\x<41>\x<42>\x<43>" = "ABC"

Also, this is arguably an abuse of less-than and greater-than (as
opposed to angle-bracket characters).

In either case, the trade-off is that Scheme strings are unlikely to be
compatible with any other language's string syntax. A consequence is
that there is additional burden on the programmer which must learn yet
another string and character syntax.

Symbol characters
-----------------

To ensure that every symbol has an external representation while also
enabling a 1-to-1 correspondence between symbols and immutable strings,
the SRFI specified a syntax for symbols based on quoting vertical bars.
At the same time, the SRFI was very liberal in the set of characters
allowed in unquoted symbols.

To tighten up the set of characters allowed in a symbol, those with
Unicode general category Ps, Pe, Pi, Pf, Zs, Zp, Zl, Cc, or Cf will be
disallowed in a symbol's external unquoted representation. That is,
paired punctuation, whitespace, controls, and format characters will be
disallowed.

Moreover it is likely that the vertical-bar notation will be dropped.
To achieve the same functionality, hexadecimal escapes will be allowed
in symbols using the same notation as for characters and strings. A
backslash not followed by x is an error (i.e. only hexadecimal escapes
are allowed). In that case,

  with the new hexadecimal escapes     with the old vertical bar  
  notation

  'a\x20;b = (string->symbol "a b")    '|a b| = (string->symbol "a b")
  'a\x0a;b = (string->symbol "a\nb")   '|a\nb| = (string->symbol "a\nb")
  \x03BB; = λ                          |\u03BB| = λ

On the one hand, the vertical-bar notation supports a symbol syntax
that is analogous to strings, it is easy to remember, and it has a
clear precedent in existing Lisps and Schemes.

On the other hand, the analogy to strings doesn't entirely hold, in
that the vertical bar is optional with symbols and quotes are not
optional for strings; the alternative of just allowing hexadecimal
escapes is arguably more consistent with Scheme's current symbol
syntax, and it avoids the potential abuse (arguably) of whitespace
within program identifiers using the vertical bar notation.

Meanwhile, the symbol escapes are similar yet not identical to the
escapes in strings and characters, so there is a potential for mistakes
if the programmer is not careful. For example one might expect a\nb to
be a valid symbol, but it is an error. Also, #\x03BB; without the
leading hash may surprise a programmer by reading as a symbol, rather
than producing a lexical error. Finally, syntax-highlighting and cursor
motion commands (such as M-C-b in emacs) may be difficult to arrange in
some editors, due to the semicolon escape terminator.