[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: introduction

This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.

To: srfi-52@xxxxxxxxxxxxxxxxx
Subject: Re: introduction
From: "Bradd W. Szonye" <bradd+srfi@xxxxxxxxxx>
Date: Tue, 10 Feb 2004 16:26:07 -0800
Delivered-to: srfi-52@xxxxxxxxxxxxxxxxx
In-reply-to: <20040210210235.8DC976C10D@xxxxxxxxxxxxxxxxxxxxx>
Mail-followup-to: srfi-52@xxxxxxxxxxxxxxxxx
References: <200402102106.NAA13314@xxxxxxxxxxxxxxxxxxxxxxx> <20040210210235.8DC976C10D@xxxxxxxxxxxxxxxxxxxxx>
User-agent: Mutt/1.4.1i

> Tom Lord wrote:
>> [*] What exactly is a "Unicode character?"  The answer can vary
>>     depending on context.  In some contexts it might mean a Unicode
>>     abstract character -- the kind of value to which a codepoint
>>     (integer in the range 0..10ffff) is assigned.  In other contexts,
>>     it may mean certain kinds of sequences of abstract characters.
>> 
>>     One goal for SRFI-52 is to remain agnostic about the answer 
>>     to that question.

Robby Findler wrote:
> I'm still relatively new to unicode, so I apologize if this is a
> foolish question (rtfm ptrs welcome!), but I wonder why you would want
> to remain agnostic on this point. Can you explain why unicode-code
> points would be a bad choice, and what other choices might exist?

Short version: In general, a single character on your screen may
actually be made of several Unicode code points. For example, the
grapheme[*] é (small E with acute accent) can be encoded as a base
character (small E) plus a combining mark (acute accent).

Most internal Unicode encodings use code points as the basic "character"
unit. In those systems, the letter é is one symbol on screen but two
"character" units in memory. Other systems combine the code points much
earlier, such that é is only one "character" unit both on-screen and
in-memory. (For example, Bear's scheme stores characters as bignums with
each code point stored as a "big digit.")

There are advantages and disadvantages to both approaches. The "unit is
code point" method makes string indexing and mutation more difficult,
and it makes procedures like char-upcase nonsensical (because a
character is only a partial thing, in general). The "unit is grapheme"
approach avoids most of that -- although letters like ß are still a
problem for case-folding -- but generally requires more space to store
the same data.

[*] "Grapheme" is the name for "what humans think of when you talk about
    characters," more or less.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd

References:
- introduction
  - From: Tom Lord
- Re: introduction
  - From: Robby Findler

Prev by Date: Re: terminology
Next by Date: Re: terminology
Previous by thread: Re: introduction
Next by thread: terminology
Index(es):
- Date
- Thread