[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SRFI withdrawn; comments on the possible future

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.




On Thu, 25 May 2006, Matthew Flatt wrote:

>Straightforward additions
>-------------------------
>
>    * `char-general-category', which accepts a character and returns one
>       of 'lu, 'li, ...

Requires big tables.  Library rather than core, I hope.

>    * `string-normalize-nfd', `string-normalize-nfkd,
>      `string-normalize-nfc', and `string-normalize-nfkc', which each
>      accept a string and produce its normalization according to normal
>      form D, KD, C, or KC, respectively.

Buh.  I'd really rather keep the normalization form as
part of the port abstraction.  A string, internally, is
characters, full stop.  You produce a normalization form
by writing it on a port that is defined in terms of that
normalization form.  Or you produce whatever the internal
representation is by reading it from a port defined in
terms of that normalization form.

These routines might be useful for converting strings to
particular kinds of binary blobs or vice versa, but IMO
they're not necessary.

>The #\newline character
>-----------------------
>
>It is likely that #\newline will be removed from Scheme leaving only
>#\linefeed. Since R6RS will pin down characters to Unicode scalar
>values, the right name for the character is #\linefeed.

This seems sudden.  I'd rather see #\newline deprecated for one
report version before removal, or ...

>Another view is that #\newline can serve as
>an abstaction of the end-of-line character sequence which is returned
>by read-char when the end-of-line character sequence is read (be it
>#\linefeed, or #\return, or # \return followed by #\linefeed).

>To tighten up the set of characters allowed in a symbol, those with
>Unicode general category Ps, Pe, Pi, Pf, Zs, Zp, Zl, Cc, or Cf will be
>disallowed in a symbol's external unquoted representation. That is,
>paired punctuation, whitespace, controls, and format characters will be
>disallowed.

Reasonable, I think.  Paired punctuation in particular creates a
class of lexemes that can be used for external representation of
user-defined or implementation-defined types.

>Meanwhile, the symbol escapes are similar yet not identical to the
>escapes in strings and characters, so there is a potential for mistakes
>if the programmer is not careful. For example one might expect a\nb to
>be a valid symbol, but it is an error. Also, #\x03BB; without the
>leading hash may surprise a programmer by reading as a symbol, rather
>than producing a lexical error. Finally, syntax-highlighting and cursor
>motion commands (such as M-C-b in emacs) may be difficult to arrange in
>some editors, due to the semicolon escape terminator.

I think the semicolon is a bad choice for an escape terminator
given its use as a comment initiator.  It makes programming
syntax-highlighting modes very context-dependent and hairy,
and lexical handling of programs similarly context-dependent
and hairy.  I would suggest the colon instead.

Whatever format is chosen for escaping unicode characters into
source, I hope that it is possible to run ASCII source through
a "dumb" formatter that just finds instances of valid unicode
escape sequences and replaces them with unicode codepoints when
outputting source.  By "dumb" I mean that it shouldn't have to
keep track of context.  The same escape sequence, if possible,
should mean exactly the same thing whether it appears in a string,
a symbol, a comment, or a character constant.  This also gives
an advantage in parsing, because unicode source can be run through
a dumb formatter to become ascii source, even if the programmer
never sees the ascii source - which gives you the ability to use
fast lexers with small tables and fast algorithms.

So, assuming we pick something like \XXX: as a unicode escape,
where XXX is a variable-length hexadecimal string and colon is a
terminator, this becomes easy:  ASCII representation of Unicode
character syntax is a bit funny-looking, #\\XXX:  but the \XXX:
sequence escapes into strings, comments, and symbols exactly
the same, you don't have to lex first in order to know what
bits are unicode escapes, and you can easily convert ascii to
unicode source and back again with a dumb context-insensitive
formatting program that does nowhere near as much work as lexing.

					Bear