[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: tree@xxxxxxxxxxxxx
Subject: Re: Surrogates and character representation
From: Per Bothner <per@xxxxxxxxxxx>
Date: Wed, 27 Jul 2005 20:43:43 -0700
Cc: srfi-75@xxxxxxxxxxxxxxxxx
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
In-reply-to: <17128.20298.707693.881280@xxxxxxxxxxxxxxxxxxxxxx>
References: <y9lu0ig46v8.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <17127.44572.207464.724852@xxxxxxxxxxxxxxxxxxxxxx> <5fb7e0870507271853a6defce@xxxxxxxxxxxxxx> <17128.19464.258589.23946@xxxxxxxxxxxxxxxxxxxxxx> <5fb7e08705072720162f6a8d1a@xxxxxxxxxxxxxx> <17128.20298.707693.881280@xxxxxxxxxxxxxxxxxxxxxx>
User-agent: Mozilla Thunderbird 1.0.6-1.1.fc4 (X11/20050720)

Tom Emerson wrote:

Let's look at how I handle these in Python right now: the UTF-8 data
is read and transcoded to the internal Unicode string format.


Ah, so you're not doing random-access on "multimegabyte text files" as
as we assumed from  your initial message.

If you have the luxury of reading your entire file into memory (and in
the process expanding its size by a good bit) you can of course do all
kinds of processing and index-building.

It appears (from http://www.jorendorff.com/articles/unicode/python.html)
that Python unicode strings are UTF-16 strings, so character offsets
will break as soon as you go beyond the Basic Multilingual Plane.
Scheme implementations can of course fix this, though it means using
4 bytes per character.  Hence the discussion.
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/

Follow-Ups:
- Re: Surrogates and character representation
  - From: Tom Emerson

References:
- Re: Surrogates and character representation
  - From: William D Clinger
- Re: Surrogates and character representation
  - From: Tom Emerson
- Re: Surrogates and character representation
  - From: Alex Shinn
- Re: Surrogates and character representation
  - From: Tom Emerson
- Re: Surrogates and character representation
  - From: Alex Shinn
- Re: Surrogates and character representation
  - From: Tom Emerson

Prev by Date: Re: Surrogates and character representation
Next by Date: Re: Surrogates and character representation
Previous by thread: Re: Surrogates and character representation
Next by thread: Re: Surrogates and character representation
Index(es):
- Date
- Thread