[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Surrogates and character representation
On 7/28/05, Tom Emerson <tree@xxxxxxxxxxxxx> wrote:
> I'm not missing his point, actually. The stand-off markup may be
> generated by someone else, say the data provider (in the case of data
> acquired from the LDC or ELDA) and hence I do not have any Scheme
> serialized data, rather character offsets into a UTF-8 scheme.
Do either of those actually supply UTF-32 files along with data
files holding codepoint offsets? UTF-8 is by far the most common
storage format for Unicode, and required by most network protocols.
Regardless, this has nothing to do with strings. This involves
seeking to a byte position in a file, and extracting (and optionally
converting to the internal encoding) a chunk of text.