[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



Alex Shinn writes:
> You're missing Per's point.  Those features have to have been
> assigned by some previous text processing, which had to know
> the location in the text in order to choose a tag.  Those locations
> could just as easily be represented by opaque pointers as by
> codepoint offsets.  To store these pointers in a separate file they
> just need to be serializable.  The obvious pointer representation
> for UTF-8 strings would be the byte offset, an integer, which
> serializes as is.

I'm not missing his point, actually. The stand-off markup may be
generated by someone else, say the data provider (in the case of data
acquired from the LDC or ELDA) and hence I do not have any Scheme
serialized data, rather character offsets into a UTF-8 scheme.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"