[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



Tom Emerson wrote:
Representing strings internally in UTF-8 is a loss though, since you
lose random access to the string.

Random access to a previously accessed position works just fine - just use the byte offset.

Random accesses to a position in a string that has not been previously accessed is not in itself useful.

For some applications this isn't a big deal, but in general using UTF-8
> as an internal representation is a bad idea.

It's the other way round. Using UTF-8 as in internal representation is just fine for *applications*. The problem is that certain *API*s have a concept of indexing into a string, and unfortunately R5RS is one of them. In itself indexing of strings is a useless feature, as it can be replaced by a sequential-access cursor/iterator API - but historically the Scheme cursor/iterator API uses integers for the "cursor". And existing code moves the "cursor" forwards by adding 1.
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/