This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
Per Bothner writes: > If you have large UTF-8 text files, clearly the most efficient solution > is to use byte indexes. That allows you to: > (1) use random-access on the actual text files, without first reading > them in in memory and expanding them to UTF-32. > (2) map the file as-is into memory and index into the resulting buffer > without any conversion of the data or the indexes. > It follows that the most efficient internal representation is also > UTF-8, since it matches the files, and allows you to use the same > byte indexes without conversion. Yes, this is great in theory, but the fact of the matter is that we have to deal with data that isn't like this, and cannot be converted to this. Again, as I said earlier, codepoint indexes are not tied to a particular encoding. When getting data from multiple sources you have to deal with these differences. > This argument assumes you're willing to standardize on UTF-8 for > your text files, which is a reasonable thing to do, but may be > difficult to agree on. If you don't agree that the canonical > representation is UTF-8, then using character indexes may be better. It is completely reasonable, but generally linguists (with some exceptions) don't know and don't care about encodings. They don't think about them: they exist below the level they are interested in. They create data sets as they find convenient, and I have to work with that. -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"