[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



Alex Shinn writes:
> Do either of those actually supply UTF-32 files along with data
> files holding codepoint offsets?  UTF-8 is by far the most common
> storage format for Unicode, and required by most network protocols.

Character offsets, irrespective of encoding. Generally these are UTF-8
encoded. If I have a Chinese file the first three characters will have
character offsets 0,1,2, but when encoded in UTF-8 these will be at 0,
3, 6. If, as is often the case, ASCII-range characters exist too, I
cannot assume any given underlying character width. I don't have byte
offsets. The standoff markup will work regardless of the character
encoding of the original file.

> Regardless, this has nothing to do with strings.  This involves
> seeking to a byte position in a file, and extracting (and optionally
> converting to the internal encoding) a chunk of text.

I'm not sure how you can say that.

Let's look at how I handle these in Python right now: the UTF-8 data
is read and transcoded to the internal Unicode string format. From
there I can use the offsets read from the standoff markup to access
the characters directly. Very simple. All the ugly transcoding is done
at the library level: I don't worry about it. If the original file
isn't in UTF-8, but in say CP936, and I have the appropriate
transcoder to convert to the internal Unicode string, the offsets
continue to work.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"