[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



Per Bothner writes:
> If you have the luxury of reading your entire file into memory (and in
> the process expanding its size by a good bit) you can of course do all
> kinds of processing and index-building.

I have text files containing 100MB worth of UTF-8 encoded text with
character offsets in supplemental files. This happens regularly in
corpus linguistics.

> It appears (from http://www.jorendorff.com/articles/unicode/python.html)
> that Python unicode strings are UTF-16 strings, so character offsets
> will break as soon as you go beyond the Basic Multilingual Plane.
> Scheme implementations can of course fix this, though it means using
> 4 bytes per character.  Hence the discussion.

Yes, it falls apart with Astral plane characters, but these are
fortunately rare.  When you build the Python interpreter you can set
the size of internal Unicode characters: 2-bytes or 4-bytes. I use a
4-byte Unicode build of the interpreter when I deal with Astral plane.

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"