[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



If you have large UTF-8 text files, clearly the most efficient solution
is to use byte indexes.  That allows you to:
(1) use random-access on the actual text files, without first reading
them in in memory and expanding them to UTF-32.
(2) map the file as-is into memory and index into the resulting buffer
without any conversion of the data or the indexes.
It follows that the most efficient internal representation is also
UTF-8, since it matches the files, and allows you to use the same
byte indexes without conversion.

This argument assumes you're willing to standardize on UTF-8 for
your text files, which is a reasonable thing to do, but may be
difficult to agree on.  If you don't agree that the canonical
representation is UTF-8, then using character indexes may be better.

Another argument for using codepoint offsets rather than byte offsets
is if they're going to be used by humans, perhaps in email or journal
articles, since people unfamiliar with UTF-8 may be confused by UTF-8
offsets.  However, this is a fairly weak argument, since you have the
same issue with composite characters: in that case codepoint offsets
will also not match the characters that people see.
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/