This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
If you have large UTF-8 text files, clearly the most efficient solution is to use byte indexes. That allows you to: (1) use random-access on the actual text files, without first reading them in in memory and expanding them to UTF-32. (2) map the file as-is into memory and index into the resulting buffer without any conversion of the data or the indexes. It follows that the most efficient internal representation is also UTF-8, since it matches the files, and allows you to use the same byte indexes without conversion. This argument assumes you're willing to standardize on UTF-8 for your text files, which is a reasonable thing to do, but may be difficult to agree on. If you don't agree that the canonical representation is UTF-8, then using character indexes may be better. Another argument for using codepoint offsets rather than byte offsets is if they're going to be used by humans, perhaps in email or journal articles, since people unfamiliar with UTF-8 may be confused by UTF-8 offsets. However, this is a fairly weak argument, since you have the same issue with composite characters: in that case codepoint offsets will also not match the characters that people see. -- --Per Bothner per@xxxxxxxxxxx http://per.bothner.com/