This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
Alex Shinn writes: > Do either of those actually supply UTF-32 files along with data > files holding codepoint offsets? UTF-8 is by far the most common > storage format for Unicode, and required by most network protocols. Character offsets, irrespective of encoding. Generally these are UTF-8 encoded. If I have a Chinese file the first three characters will have character offsets 0,1,2, but when encoded in UTF-8 these will be at 0, 3, 6. If, as is often the case, ASCII-range characters exist too, I cannot assume any given underlying character width. I don't have byte offsets. The standoff markup will work regardless of the character encoding of the original file. > Regardless, this has nothing to do with strings. This involves > seeking to a byte position in a file, and extracting (and optionally > converting to the internal encoding) a chunk of text. I'm not sure how you can say that. Let's look at how I handle these in Python right now: the UTF-8 data is read and transcoded to the internal Unicode string format. From there I can use the offsets read from the standoff markup to access the characters directly. Very simple. All the ugly transcoding is done at the library level: I don't worry about it. If the original file isn't in UTF-8, but in say CP936, and I have the appropriate transcoder to convert to the internal Unicode string, the offsets continue to work. -tree -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"