[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Issues with Unicode
Date: Tue, 25 Apr 2006 20:27:07 -1000 (HST)
From: Shiro Kawai <shiro@xxxxxxxx>
Alternative implementations of strings have been discussed in
this list, and some threads in comp.lang.scheme, I think.
I'd like to draw attention to one point which hasn't been
raised, IIRC. (Maybe it is too trivial and everybody knows
about it; if so, sorry for the noise.)
I can't recall whether this ever came up here, but last year, when
this SRFI was still fresh and under heavy discussion, I wrote up an
alternative proposal for a Unicode-supporting -- although *not*
Unicode-mandating -- string API, where strings are collections of
grapheme clusters indexed by opaque cursors, not character indices,
and whose binary encoding is separated into BLOB->STRING and
STRING->BLOB[!] procedures and abstracted by text codec descriptors.
The text of the document is here:
It came out of many extensive discussions with John Cowan, Jorgen
Schaefer, and probably a number of other persons whom I've forgotten
by now. Here are some of the most important points about it, off the
top of my head:
1. It doesn't require Unicode support, for instance in the Scheme
system that runs on your doorknob. More seriously, the API simply
does not specify anything about particular code point mappings or
text codecs other than that ASCII must be supported.
2. It's high-level. We can sweep things like normalization wholly
under the rug with it. We needn't mandate a particular internal
string representation; the API would work just as well with all
strings as UTF-8 strings internally, as with all strings as UTF-32
strings internally, as with all strings as pairs of text codec and
actual storage internally.
3. Further on the point that we can use UTF-8 internally: not only
does it permit efficient variable-width string representations
such as UTF-8 -- because strings are indexed by opaque cursors
which may be stepped as octet indices for constant-time access,
while natural number indices of characters would require O(n)
access time --, but higher-level text structures, such as grapheme
clusters or words or sentences or paragraphs, would require
explicit stepping like with string cursors anyway, or O(n) access
4. Strings are immutable. The application of mutability in old
R5RS-style strings was extremely limited, anyway: you can change
existing characters, but you can't insert or delete, so you can't,
say, change a whole _word_ in a string, if the substitute has a
length different from the original. It just so happened that all
characters had the same width in all practical implementations, so
we could swap in new ones as we pleased, but this assumption
doesn't hold up very well if we want to extend our text-processing
capabilities beyond that limited world and to higher-level text
structures such as words and sentences. Also, because strings are
immutable, we can more safely share storage, and it is not
unreasonable to mandate the existence of an O(1) STRING-SLICE
procedure, like SRFI 13's SUBSTRING/SHARED.
There are, of course, still some problems with it. I couldn't think
of a good literal syntax, for instance. However, I think the basic
idea of the proposal is a considerable improvement over the current,
historically motivated, mutable character vector model of strings.
Some of the fancier implementations might not go well with
preemptive multithreads; if mutation of string touches more
than one place of the string objects, it creates a hazard.
While I agree that strings ought to be immutable, as you recommended
afterward, I don't think this is really a very good reason: I can't
imagine why anyone would *want* to share a mutable string between
threads badly enough for synchronization to be the default.
(It might be convenient to have mutable strings for editor-like
applications; which also allow length-changing mutation. I'd
rather think it to be another type of object that can be built
on top of immutable strings; e.g. a buffer object realized by
a balanced tree of string segments).
This would definitely be useful. It would also definitely fall
outside the scope of basic Unicode support in R6RS, so I think SRFI 75
shouldn't even try to specify any mutable string data in general.