[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



On 7/28/05, Tom Emerson <tree@xxxxxxxxxxxxx> wrote:
> 
> I'm not missing his point, actually. The stand-off markup may be
> generated by someone else, say the data provider (in the case of data
> acquired from the LDC or ELDA) and hence I do not have any Scheme
> serialized data, rather character offsets into a UTF-8 scheme.

Do either of those actually supply UTF-32 files along with data
files holding codepoint offsets?  UTF-8 is by far the most common
storage format for Unicode, and required by most network protocols.

Regardless, this has nothing to do with strings.  This involves
seeking to a byte position in a file, and extracting (and optionally
converting to the internal encoding) a chunk of text.

-- 
Alex