This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
John.Cowan writes: > All other undefined codepoints are potentially definable: they correspond > to Unicode scalar values. Surrogate codepoints are not definable and > don't correspond to any Unicode scalar value. The difference is > architectural. FFFE is never (by architectural design) going to be defined either. Surrogate codepoints have a character property. They should be usable in a string, and individually can be considered a character. Most implementations won't see them: only library code that is reading/writing UTF-16 needs to worry about them in any significant way. Application code should not see them. They will see U+20069 as having the value 0x20069, not 0xD840DC69. In other words, I guess I'm saying that surrogates don't need to be special cased, because the existing Unicode property model accounts for them, and the generation/interpretation of them should be handled at a lower level. Special casing them just complicates everything for everyone. > > One question I've had: how are 8-bit (i.e., byte) strings handled > > here? Is there no distinction between operations on raw bytes and > > operations on characters? > > Those things are not strings: they are vectors of unsigned 8-bit integers. Of course. My Python hat is still on where 8-bit strings and Unicode strings are different beasts, and 8-bit strings are used for byte-strings. -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"