[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



Alex Shinn writes:
> You're missing Per's point.  Those features have to have been
> assigned by some previous text processing, which had to know
> the location in the text in order to choose a tag.  Those locations
> could just as easily be represented by opaque pointers as by
> codepoint offsets.  To store these pointers in a separate file they
> just need to be serializable.  The obvious pointer representation
> for UTF-8 strings would be the byte offset, an integer, which
> serializes as is.

I'm not missing his point, actually. The stand-off markup may be
generated by someone else, say the data provider (in the case of data
acquired from the LDC or ELDA) and hence I do not have any Scheme
serialized data, rather character offsets into a UTF-8 scheme.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"