[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: Alan Watson <a.watson@xxxxxxxxxxxxxxxx>
Subject: Re: Surrogates and character representation
From: Tom Emerson <tree@xxxxxxxxxxxxx>
Date: Sun, 24 Jul 2005 16:18:39 -0400
Cc: srfi-75@xxxxxxxxxxxxxxxxx
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
In-reply-to: <42E3DAA1.5060502@xxxxxxxxxxxxxxxx>
References: <1122002894.6607.29.camel@xxxxxxxxxxxxxx> <17120.28178.788826.533753@xxxxxxxxxxxxxxxxxxxxxx> <20050722040917.GB7576@NYCMJCOWA2> <17120.30080.768671.539970@xxxxxxxxxxxxxxxxxxxxxx> <878xzykn0y.fsf@xxxxxxxxxxxxxxxxx> <17122.31220.22073.72951@xxxxxxxxxxxxxxxxxxxxxx> <20050724053713.GM2784@NYCMJCOWA2> <42E3D086.90403@xxxxxxxxxxxxxxxx> <17123.54753.371934.424875@xxxxxxxxxxxxxxxxxxxxxx> <42E3DAA1.5060502@xxxxxxxxxxxxxxxx>
Reply-to: tree@xxxxxxxxxxxxx

Alan Watson writes:
> Using UTF-8 internally for a Scheme on a Plan 9 system is not obviously 
> a bad idea. Sure, you don't have direct indexing, but you avoid 
> conversion when you talk to the C library and OS.

True enough.

> Using UTF-16 internally doesn't give you direct indexing because of 
> characters outside the BMP, but it might make sense on Windows boxes for 
> precisely the same reason.

This is a valid point. Python took the view that by default UTF-16 is
used internally then direct indexing into a string could yield part of
a surrogate pair. The feeling (as I remember, I may be wrong) was that
astral plane characters are rare-enough that the common-case (i.e.,
BMP) should not be penalized.

> Using UCS-32 internally in these cases would involve translation to talk 
> to the library and OS and would further make my emacs use about four 
> times as much memory as it does now (which brings us back the the 
> representation for infinity).

Yes, though the glibc folks decided that the wchar_t type be a 4-byte
Unicode value. Python gives you the option of building with a 4-byte
or 2-byte "Unicode" character. (In Python Unicode and "narrow" strings
are separate types.)

> In general, any single representation is a bad idea in some circumstances.

Absolutely.

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

References:
- Re: the "Unicode Background" section
  - From: Thomas Lord
- Surrogates and character representation
  - From: Tom Emerson
- Re: Surrogates and character representation
  - From: John.Cowan
- Re: Surrogates and character representation
  - From: Tom Emerson
- Re: Surrogates and character representation
  - From: Thomas Bushnell BSG
- Re: Surrogates and character representation
  - From: Tom Emerson
- Re: Surrogates and character representation
  - From: John.Cowan
- Re: Surrogates and character representation
  - From: Alan Watson
- Re: Surrogates and character representation
  - From: Tom Emerson
- Re: Surrogates and character representation
  - From: Alan Watson

Prev by Date: Re: Surrogates and character representation
Next by Date: Re: Surrogates and character representation
Previous by thread: Re: Surrogates and character representation
Next by thread: Re: Surrogates and character representation
Index(es):
- Date
- Thread