[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



Alex Shinn writes:
> Do either of those actually supply UTF-32 files along with data
> files holding codepoint offsets?  UTF-8 is by far the most common
> storage format for Unicode, and required by most network protocols.

Character offsets, irrespective of encoding. Generally these are UTF-8
encoded. If I have a Chinese file the first three characters will have
character offsets 0,1,2, but when encoded in UTF-8 these will be at 0,
3, 6. If, as is often the case, ASCII-range characters exist too, I
cannot assume any given underlying character width. I don't have byte
offsets. The standoff markup will work regardless of the character
encoding of the original file.

> Regardless, this has nothing to do with strings.  This involves
> seeking to a byte position in a file, and extracting (and optionally
> converting to the internal encoding) a chunk of text.

I'm not sure how you can say that.

Let's look at how I handle these in Python right now: the UTF-8 data
is read and transcoded to the internal Unicode string format. From
there I can use the offsets read from the standoff markup to access
the characters directly. Very simple. All the ugly transcoding is done
at the library level: I don't worry about it. If the original file
isn't in UTF-8, but in say CP936, and I have the appropriate
transcoder to convert to the internal Unicode string, the offsets
continue to work.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"