This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
Just US$0.02 worth from the lurking depths. Surrogates are no more than an elegant hack to extend the original 16-bit codespace to a 32-bit codespace. This talk of blocking the surrogate blocks as the range of character values is silly, IMHO. The implementation should be concerned with codepoints, in the range 0x000000 to 0x10FFFF. How these get mapped to bytes or words is an issue with whatever transcoder you have in place to generate a printable form of the abstract character. Looking at characters this way, any codepoint in the range 0xD800 through 0xDFFF is considered in invalid character. This conforms with section 3.8 of TUS, D26a and D27. These characters only show up when dealing with UTF-16. UCS-4, UTF-32, UTF-8, etc. don't use them. If you treat the surrogates as undefined within the character range, then you must (for consistency) treat all of the other undefined abstract characters as holes. This just complicates processing. From the programmer's perspective, I just want to deal with characters as single entities (combining forms aside for the moment.) It is up to me to knwo whether my string has been normalized or not, and deal with that situation. For most uses it doesn't matter. Using Unicode as the underlying character rep while using glyph semantics at the program level is, to me, a recipe for complete confusion. Then iteration over strings, and random string access, becomes difficult: <0054 0073 0068 0075 0308 00DF> would then have physical character indicies at 0, 1, 2, 3, 5. One question I've had: how are 8-bit (i.e., byte) strings handled here? Is there no distinction between operations on raw bytes and operations on characters? -tree -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"