[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



On 7/28/05, Tom Emerson <tree@xxxxxxxxxxxxx> wrote:
> 
> I'm not missing his point, actually. The stand-off markup may be
> generated by someone else, say the data provider (in the case of data
> acquired from the LDC or ELDA) and hence I do not have any Scheme
> serialized data, rather character offsets into a UTF-8 scheme.

Do either of those actually supply UTF-32 files along with data
files holding codepoint offsets?  UTF-8 is by far the most common
storage format for Unicode, and required by most network protocols.

Regardless, this has nothing to do with strings.  This involves
seeking to a byte position in a file, and extracting (and optionally
converting to the internal encoding) a chunk of text.

-- 
Alex