[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
> Tom Lord wrote:
>> [*] What exactly is a "Unicode character?" The answer can vary
>> depending on context. In some contexts it might mean a Unicode
>> abstract character -- the kind of value to which a codepoint
>> (integer in the range 0..10ffff) is assigned. In other contexts,
>> it may mean certain kinds of sequences of abstract characters.
>> One goal for SRFI-52 is to remain agnostic about the answer
>> to that question.
Robby Findler wrote:
> I'm still relatively new to unicode, so I apologize if this is a
> foolish question (rtfm ptrs welcome!), but I wonder why you would want
> to remain agnostic on this point. Can you explain why unicode-code
> points would be a bad choice, and what other choices might exist?
Short version: In general, a single character on your screen may
actually be made of several Unicode code points. For example, the
grapheme[*] é (small E with acute accent) can be encoded as a base
character (small E) plus a combining mark (acute accent).
Most internal Unicode encodings use code points as the basic "character"
unit. In those systems, the letter é is one symbol on screen but two
"character" units in memory. Other systems combine the code points much
earlier, such that é is only one "character" unit both on-screen and
in-memory. (For example, Bear's scheme stores characters as bignums with
each code point stored as a "big digit.")
There are advantages and disadvantages to both approaches. The "unit is
code point" method makes string indexing and mutation more difficult,
and it makes procedures like char-upcase nonsensical (because a
character is only a partial thing, in general). The "unit is grapheme"
approach avoids most of that -- although letters like ß are still a
problem for case-folding -- but generally requires more space to store
the same data.
[*] "Grapheme" is the name for "what humans think of when you talk about
characters," more or less.
Bradd W. Szonye