This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
Matthew Flatt <mflatt@xxxxxxxxxxx> writes: > * `string-normalize-nfd', `string-normalize-nfkd, > `string-normalize-nfc', and `string-normalize-nfkc', which each > accept a string and produce its normalization according to normal > form D, KD, C, or KC, respectively. If the basic concept of the SRFI - a string being a sequence of code points - does not change, I do think these procedures are useful (contrary to bear and Alex Shinn). An implementation can still normalize internally in the "usual case", and if the programmer enforces a different normalization, that's eir problem. STRING=? and similar procedures need to define which kind of normalization they work on (or just "the same normalization for all arguments"). STRING-DOWNCASE, STRING-APPEND etc. need to define whether they may normalize their arguments, and if so, which normalization they return. If the normalization shouldn't be prescribed, another procedure, STRING-NORMALIZE (or similar), needs to be added to return the normalization the implementation prefers. A higher-level string API can (and should) be built on top of the strings defined in this SRFI. > The #\newline character > ----------------------- > > It is likely that #\newline will be removed from Scheme leaving only > #\linefeed. Since R6RS will pin down characters to Unicode scalar > values, the right name for the character is #\linefeed. I'm always in favor of breaking stuff to get a clean result. > Another view is that #\newline can serve as an abstaction of the > end-of-line character sequence which is returned by read-char > when the end-of-line character sequence is read (be it > #\linefeed, or #\return, or # \return followed by #\linefeed). > So even though #\newline and #\linefeed are the same characters, > Scheme programs might use #\newline to highlight that the > character is being used to denote the end-of-line sequence. The > name #\newline would also reinforce the link with the escape > sequence "\n" in strings. If #\newline is considered to be some kind of abstraction of the end-of-line character sequence, please remember that Unicode defines U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR as canonical new line code points, to finally get rid of all these distinctions. > Escape sequences > ---------------- > with semi-colon terminator without terminator > > "A\x42;C" = "ABC" "A\x42\x43" = "ABC" > "\x41;\x42;\x43;" = "ABC" "\x41\x42\x43" = "ABC" > "\x03BB;x.x" = "λx.x" "\x03BBx.x" = "λx.x" I agree with bear that the semicolon is a bad choice - why not use the colon? "\Ax42:C" "\x41:\x42:\x43:" "\x03BB:x.x" > Using less-than and greater-than characters, which are not actual > brackets, avoids this problem: > > #\x<03BB> = #\λ Braces have been offered as an alternative: #\x{03BB} > However, they become somewhat more difficult to read when multiple > escape appear in a string: > > "\x<41>\x<42>\x<43>" = "ABC" "\x{41}\x{42}\x{43}" > In either case, the trade-off is that Scheme strings are unlikely to be > compatible with any other language's string syntax. A consequence is > that there is additional burden on the programmer which must learn yet > another string and character syntax. I do think it's good that we don't go with bad decisions made by other languages just because the decision has been made by them. > Symbol characters > ----------------- > [...] > Meanwhile, the symbol escapes are similar yet not identical to the > escapes in strings and characters, so there is a potential for mistakes > if the programmer is not careful. For example one might expect a\nb to > be a valid symbol, but it is an error. Why not allow the same escapes in symbols and in strings? All in all I like the changes you propose (modulo the comments above). Thanks for the good work! Regards, -- Jorgen -- ((email . "forcer@xxxxxxxxx") (www . "http://www.forcix.cx/") (gpg . "1024D/028AF63C") (irc . "nick forcer on IRCnet"))