This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.
Per Bothner scripsit: > A little knowledge is a dangerous thing ... A little Learning is a dang'rous Thing; Drink deep, or taste not the Pierian Spring: There shallow Draughts intoxicate the Brain, And drinking largely sobers us again. --Pope, "Essay on Criticism" > You're contradicting yourself: I asked about a use-case for *character* > as a separate *data type*. The advantage of mapping the "character" datatype to Unicode default grapheme clusters is that it insulates the programmer from the issues around Unicode normalization. The disadvantage is that there are a countable infinity of possible DGCs. Not all languages have a distinct character datatype, however, and this has real advantages in a Unicode world: you do not have to think about just how strings are represented, any more than you have to think about how bignums are. > We know that. However, there is still no need for "character" [in the > Unicode sense] as a separate data type: As I noted in my previous posting, "characters in the Unicode sense" is not a well-defined notion. > Code that works on compound characters *as a unit* can and should use a > string type. Code that needs to look *inside* a compound character, > needs to works with codepoints. > > In Java, "character" is actually a Unicode code-point. This is how it > should be in Scheme, though we might want to replace the 16-bit size > by a 20-bit size to avoid the complexities of surrogate characters. Java uses 16-bit code units (not code points), not because the architects didn't foresee the eventual use of the Astral Planes, but because the benefits of uniform width were deemed by them to outweigh the necessity of dealing with surrogate characters by hand. Java now has some standard library routines that hide surrogate characters. However, there are ways to keep uniform-width strings without sacrificing the codepoint view, provided you are willing to give up on string mutability (which Java does not have). One well-known approach is to store 8-bit code units for strings that contain no codepoint above U+00FF, 16-bit code units for strings that contain no codepoint above U+FFFF, and 32-bit code units for all other strings. -- Si hoc legere scis, nimium eruditionis habes.