This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
Alan Watson writes: > Hmm. That would seem to prevent an implementation representing strings > internally using UTF-8. This is convenient in some contexts as Scheme > strings can be trivially converted to UTF-8 C strings. You can create surrogate values in UTF-8, the result is just ill-formed. A conformant (Unicode) implementation shouldn't generate these, though one could argue that if you get garbage-in, you get garbage-out. Scenario 1: You have a text stream encoded in UTF-16. It contains a valid surrogate pair <D840,DD9B>. This is converted to the USV #x0002019B. If you represent the Unicode strings internally as UTF-8, this gets converted to the byte-sequence #xF0 #xA0 #x86 #x9B. When writing the text stream you pick the encoding and the USV gets written appropriately. Scenario 2: You have a text stream encoded in UTF-16. It contains a lone surrogate, <D840>. This is an invalid string. You have a couple of options: 2a: reject the input as invalid. 2b: replace the surrogate value with the replacement character U+FFFD (converted to #xEF #xBF #xBD in UTF-8 rep land) 2c: keep the character, encode internally in UTF-8 (#xED #xA1 #xB0). On output this gets converted back. 2d: ignore that value completely, not preserving it on input. Of these, 2c is non-conforming and not recommended, but avoids data loss in cases where that is important. Representing strings internally in UTF-8 is a loss though, since you lose random access to the string. For some applications this isn't a big deal, but in general using UTF-8 as an internal representation is a bad idea. -tree -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"