This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
On Thu, 25 May 2006, Matthew Flatt wrote: >Straightforward additions >------------------------- > > * `char-general-category', which accepts a character and returns one > of 'lu, 'li, ... Requires big tables. Library rather than core, I hope. > * `string-normalize-nfd', `string-normalize-nfkd, > `string-normalize-nfc', and `string-normalize-nfkc', which each > accept a string and produce its normalization according to normal > form D, KD, C, or KC, respectively. Buh. I'd really rather keep the normalization form as part of the port abstraction. A string, internally, is characters, full stop. You produce a normalization form by writing it on a port that is defined in terms of that normalization form. Or you produce whatever the internal representation is by reading it from a port defined in terms of that normalization form. These routines might be useful for converting strings to particular kinds of binary blobs or vice versa, but IMO they're not necessary. >The #\newline character >----------------------- > >It is likely that #\newline will be removed from Scheme leaving only >#\linefeed. Since R6RS will pin down characters to Unicode scalar >values, the right name for the character is #\linefeed. This seems sudden. I'd rather see #\newline deprecated for one report version before removal, or ... >Another view is that #\newline can serve as >an abstaction of the end-of-line character sequence which is returned >by read-char when the end-of-line character sequence is read (be it >#\linefeed, or #\return, or # \return followed by #\linefeed). >To tighten up the set of characters allowed in a symbol, those with >Unicode general category Ps, Pe, Pi, Pf, Zs, Zp, Zl, Cc, or Cf will be >disallowed in a symbol's external unquoted representation. That is, >paired punctuation, whitespace, controls, and format characters will be >disallowed. Reasonable, I think. Paired punctuation in particular creates a class of lexemes that can be used for external representation of user-defined or implementation-defined types. >Meanwhile, the symbol escapes are similar yet not identical to the >escapes in strings and characters, so there is a potential for mistakes >if the programmer is not careful. For example one might expect a\nb to >be a valid symbol, but it is an error. Also, #\x03BB; without the >leading hash may surprise a programmer by reading as a symbol, rather >than producing a lexical error. Finally, syntax-highlighting and cursor >motion commands (such as M-C-b in emacs) may be difficult to arrange in >some editors, due to the semicolon escape terminator. I think the semicolon is a bad choice for an escape terminator given its use as a comment initiator. It makes programming syntax-highlighting modes very context-dependent and hairy, and lexical handling of programs similarly context-dependent and hairy. I would suggest the colon instead. Whatever format is chosen for escaping unicode characters into source, I hope that it is possible to run ASCII source through a "dumb" formatter that just finds instances of valid unicode escape sequences and replaces them with unicode codepoints when outputting source. By "dumb" I mean that it shouldn't have to keep track of context. The same escape sequence, if possible, should mean exactly the same thing whether it appears in a string, a symbol, a comment, or a character constant. This also gives an advantage in parsing, because unicode source can be run through a dumb formatter to become ascii source, even if the programmer never sees the ascii source - which gives you the ability to use fast lexers with small tables and fast algorithms. So, assuming we pick something like \XXX: as a unicode escape, where XXX is a variable-length hexadecimal string and colon is a terminator, this becomes easy: ASCII representation of Unicode character syntax is a bit funny-looking, #\\XXX: but the \XXX: sequence escapes into strings, comments, and symbols exactly the same, you don't have to lex first in order to know what bits are unicode escapes, and you can easily convert ascii to unicode source and back again with a dumb context-insensitive formatting program that does nowhere near as much work as lexing. Bear