[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



John.Cowan wrote:
Per Bothner scripsit:


It's the other way round. Using UTF-8 as in internal representation is just fine for *applications*. The problem is that certain *API*s have a concept of indexing into a string, and unfortunately R5RS is one of them. In itself indexing of strings is a useless feature, as it can be replaced by a sequential-access cursor/iterator API - but historically the Scheme cursor/iterator API uses integers for the "cursor". And existing code moves the "cursor" forwards by adding 1.


By the same token, random-access disks are a useless feature, for they
can be replaced by sequential-access DECtapes that can be rewound and
selectively rewritten.  But at a price.

You're misunderstanding my point, perhaps because I was unclear. There are very few applications where you want to "getting the N'th record of file", in the sense the N is semantically meaningful. There are lots of applications where you want to get to a record fast, using random-access given a "cookie": i.e. some way that the implementation can efficiently map the cookie into the disk location of the record. The cookie may be the disk address of the record, or its offset in a file, which may not have any direct relationship to N, especially if you have variable-length records.

Similarly, it is often useful to have random access in a long string, perhaps one representing an emacs buffer. However, you want to efficiently access sub-strings, not characters. Furthermore, you're interested in substrings defined in terms of previously-seen positions - or "marks" in the Emacs sense, not character indexes. E.g. the substring matching a regexp.

Specifically, can you think of any application where this suggestion would lead to performance problems:
http://srfi.schemers.org/srfi-75/mail-archive/msg00050.html
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/