This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.
John Cowan wrote:
Per Bothner scripsit:A little knowledge is a dangerous thing ...A little Learning is a dang'rous Thing; --Pope, "Essay on Criticism"
We know that. However, there is still no need for "character" [in the Unicode sense] as a separate data type:As I noted in my previous posting, "characters in the Unicode sense" is not a well-defined notion.
Yes - and that's why I'm arguing against trying to model anything except codepoints in Scheme,
Java uses 16-bit code units (not code points), not because the architects didn't foresee the eventual use of the Astral Planes, but because the benefits of uniform width were deemed by them to outweigh the necessity of dealing with surrogate characters by hand. Java now has some standard library routines that hide surrogate characters.
Unfortunately, the end result is somewhat complex, especially since 99% of the time programmers can and will get away with ignoring non-basic- plane characters.
However, there are ways to keep uniform-width strings without sacrificing the codepoint view, provided you are willing to give up on string mutability (which Java does not have). One well-known approach is to store 8-bit code units for strings that contain no codepoint above U+00FF, 16-bit code units for strings that contain no codepoint above U+FFFF, and 32-bit code units for all other strings.
Personally, if I didn't have any compatibility constraints, I would just store everything as UTF-8 string, and allow indexing by code unit
(bytes). How often does non-library code need to deal with characters? Instead, the data types should be (immutable) "string" and "buffer". The latter allows insertions and deletions in addition to replacement. How often are strings in the sense of mutable fixed-length character arrays useful to application programmers, except as a low-level "chunk of memory" to implement other data types? Basically never, or as close to never as to render them unsuitable for Scheme. (Even parsers don't need to deal with characters, if you have regular-expression lexing. I.e. try to match the current input position against a regular expression. On success, return the matched string, and move the position forwards.) -- --Per Bothner email@example.com http://per.bothner.com/