This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
The topic of Unicode support in a language is very difficult. I'm not sure the array of codepoints idea is the best solution, but I won't question it in this mail (mainly because I don't have a better proposal). Integers and Codepoints ======================= The SRFI defines the procedures CHAR->INTEGER and INTEGER->CHAR, but also defines the return value to be a Unicode codepoint. So it would be better to name them char->codepoint codepoint->char instead. The newline character ===================== #\NEWLINE always has been a problem, because a new line is a system-dependent sequence of octets. #\LINEFEED is the correct term. We also have (newline), which is the right thing to do, so we can just drop #\NEWLINE. x, u and U ========== The SRFI defines x, u and U for two-digit, four-digit and eight-digit hexadecimal codepoints in character literals and in strings. First of all, for character literals, this is unnecessary. It would be much more elegant to have (char=? #\xA #\d10 #\o12 #\b1010) analogous to (= #xA #d10 #o12 #b1010) Introducing characters which mark fixed-width tokens in strings strikes me as problematic as well. The obvious alternative is using delimiters, as already proposed on this list. Delimiters improve readability, and since using explicit codepoints in strings is rare, the extra length is not a problem. It is also not clear that hexadecimal encoding is always preferable. So I would propose a delimiter which allows for different bases. There are different approaches. The main goal is to make it readable. "A\(#x42)C" - This is _very_ readable, though verbose "A\x42;C" "A\x42:C" - This is also very readable "A\x42#C" "A\#x42#C" - This provides some consistency Analogous to the character syntax described above, the following could be possible: (apply char=? (string->list "\xA:\d10:\o12:\b1010:")) => #t Quoted Strings ============== I'm a bit confused as to why we need all those character shorthands. The rationale "it's what is provided in other places" doesn't sound right. Specifically, I have never seen \b or \v in use. I also wonder why \? and \' got added there - those are equivalent to the characters without the backslash, and neither the question mark nor the quote are ever used in a context where they have to be quoted. Newlines in strings =================== I like the \<newline><intraline-whitespace> syntax, as it allows for correctly-indented strings. I would dislike being prevented from using newlines in a string, though, and I don't see a reason to do so. Here Strings ============ The introduction of here strings poses a few problems. Allowing for any character in the delimiter does not seem useful (except for the Obfuscated Scheme Code Contest, of course :-)), so I would think it to be the correct choice to limit the number of allowed characters. For consistency, a symbol could be used there. After the delimiter, only whitespace may follow until the newline. This allows for here-strings to be normal tokens (instead of being possibly split up over several tokens), and doesn't lend itself to hiding errors as easily. Case-Insensitivity ================== Unicode defines case folding for case-insensitive comparisons. This works by mapping characters to specific case-folded characters - not necessarily upper-case or lower-case, but a special case-folded version. This allows, for example, the Greek sigma - which has two forms in the lower-case variant - to match correctly, as well as the German eszett to match with the double-s in uppercase form. The procedures that deal with case insensitivity in this SRFI - i.e. *-CI* - should use case folding, not downcasing. Normalization ============= This SRFI lacks a notion of normalization, which is important for any kind of string comparison. I don't see an easy way to integrate this besides providing STRING-NORMALIZE-NF{C,D,KC,KD}, though. It's What Others Do =================== The discussions in this thread seem to reiterate one argument from time to time which I find problematic. The argument is "This Is What Others Do", or even "This Is What People Coming From Other Languages Expect". Since when was that a good argument against a sensible solution in Scheme? I think it would be useful to stop pondering about how to copy questionable design decisions from other languages, and try to find good solutions - it wouldn't be the first time Scheme does something no one else did, because it is the right thing to do. Greetings, -- Jorgen -- ((email . "forcer@xxxxxxxxx") (www . "http://www.forcix.cx/") (gpg . "1024D/028AF63C") (irc . "nick forcer on IRCnet"))