[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: strings draft
On Wed, 21 Jan 2004, Tom Lord wrote:
> By somewhat reasonable expectation, there must be at least 256
> distinct Scheme characters and INTEGER->CHAR must be defined for all
> integers in the range `0..255'. There are many circumstances in
> which conversions between octets and characters are desirable and
> the requirements of this expectation say that such conversion is
> always possible. It is quite possible to imagine implementations in
> which this is not the case: in which, for example, a (fully general)
> octet stream can not be read and written using READ-CHAR and DISPLAY
> (applied to characters). Such an implementation might introduce
> non-standard procedures for reading and writing octets and
> representing arrays of octets. While such non-standard extensions
> may be desirable for independent reasons, I see no good reason not
> to define at least a subset of Scheme characters which is mapped to
> the set of octet values.
I think that this is a problem. We need a portable method of
reading/writing an arbitrary octet stream, full stop. As
characters become more complicated than octets, the two concepts
must be divorced from each other; otherwise there will be endless
hair as this exception or that rears its ugly head.
So I'd propose READ-CHAR and DISPLAY which read or write "a
character" abstracting away issues of encoding, multibyte character
sets, endianness, etc. according to either application defaults or
port properties, and two new routines, READ-OCTET and WRITE-OCTET,
which read and write binary values exactly eight bits wide and take
or return exact integers in the 0..255 range.
In fact, READ-OCTET and WRITE-OCTET would in that case become primitive,
since READ-CHAR and DISPLAY could be implemented in terms of them but
the reverse would not be true.
This neatly sidesteps the issue of needing character mappings for
every member of the range 128-255, and separates the ideas of octet
and character at the lowest level.
> Pika is of the "approximately 2^21 characters" variety.
> Specifically, the Pika CHAR? type will in effect be a _superset_ of
> the set of Unicode codepoints. Each 21-bit codepoint will
> correspond to a Pika character. For each such character, there
> will be (2^4-1) (15) additional related characters representing the
> basic code point modified by a combination of any of four
FWIW, I'm using unicode codepoints in the private use as combining
characters to represent buckybits. I think this is compatible, but
conversions between representations will introduce yet more hair.
> R5RS requires a partial ordering of characters in which upper and
> lower case variants of "the same character" are treated as equal.
> Most problematically: R5RS requires that every alphabetic character
> have both an upper and lower case variant. This is a problem
> because Unicode defines abstract characters which, at least
> intuitively, are alphabetic -- but which lack such case mappings.
This problem goes away in the infinite-character-set universe.
If we restrict discussion to the characters that can appear in
canonical string representation (meaning no ligatures), Every
cased character in unicode, with the single exception of eszett,
has a lower-case and an upper-case; the catch is that the
uppercase and lowercase versions of it may require different
numbers of combining codepoints to represent.
Eszett is in a class by itself, being a canonical lowercase character
and having an uppercase form which is, linguistically, a different
number of characters, as well as being a different number of codepoints.
I was initially driven to the multi-codepoint representation by the
attempt to solve this particular problem in reconciling the unicode
standard with R5RS, and I wound up with a 99.999% solution.
> We'll explore the topic further, later, but briefly: it does not
> appear that "good Unicode support" and "R5RS requirements for case
> mappings" are compatible -- at least not in a simple way.
There's simple and there's simple and there's simple. It works
"simply", at least from the programmer's POV, in the world where
strings are agressively canonicalized and using multi-codepoint
character representation, with the sole exception of the eszett
character. This is complicated to implement, but all the complication
is under the hood from the programmer's POV.
>** What _is_ a Character, Anyway
> /=== R6RS Recommendation:
> R6RS should explicitly define a _portable_character_set_
> containing the characters mentioned earlier: `a..z', `A..Z',
> space, formfeed, newline, carriage return, and the punctuation
> required by Scheme syntax.
> Additionally, R6RS should define an _optional_ syntax for
> Unicode codepoints. I propose:
> and in strings:
> where XXXXX is an (arbitrary length) string of hexadecimal digits.
It's important to note the terminating '.' in the representation
for use in strings. Otherwise there is an ambiguity introduced.
I would say that if the character set is not restricted to a known
width, I think it's handier with codepoint separators, especially
since most characters are in the 0...255 or 0..65535 range.
Instead of writing #\U+C32000000AF for a combining sequence
it would be handier and clearer, and easier to parse since you
don't have to do bignum modulus operations to work out where
to break up the number read, to write #\U+C32:AF or similar.
This has the additional advantage that it doesn't "guess" the
size of the codepoints (currently fixed at 21, but given 32
bits in the above example and the current most-general encoding).
It becomes handier and clearer yet if some of the most useful
entities have names, so you can write something like #\A:Macron
or equivalently #\A:AF as a character.
>* Scheme Strings Meet Unicode
> /=== R6RS Recommendation:
> R6RS should strongly encourage implementations to make the
> expected-case complexity of STRING-REF and STRING-SET! O(1).
I'd fail this. My strings are O(Log N) access where N is the length
of the string. All told, I'd say this is in fact a performance win
since it means I can do copy-on-write tricks with small substrings
(strands) of the string rather than copy the whole string every time
somebody wants to save both the original version and a slightly-
changed version of it (which happens a lot when people are editing
a multi-megabyte document and there's an undo stack).
> Most of the possible answers to "what is a Scheme character" are
> consistent with the view that characters correspond to (possibly a
> subset of) Unicode codepoints.
> One of the possible answers to that question has the CHAR? type
> correspond to a _sequence_ of Unicode code points.
> /=== R6RS Recommendation:
> While R6RS should not require that CHAR? be a subset of Unicode,
> it should specify the semantics of string indexes for strings
> which _are_ subsets of Unicode.
> Specifically, if a Scheme string consists of nothing but Unicode
> codepoints (including substrings which form combining sequences),
> string indexes _must_ be Unicode codepoint offsets.
> That proposed modification to R6RS presents a (hopefully small)
> problem for Ray Dillinger. He would like (for quite plausible
> reasons) to have CHAR? values which correspond to a _sequence_ of
> Unicode codepoints. While I have some ideas about how to
> _partially_ reconcile his ideas with this proposal, I'd like to hear
> his thoughts on the matter.
Computing the codepoint-index on demand would require a traversal
of the string, an O(N) operation, using my current representation.
That's clearly intolerable. But in the same tree structure where I
now just keep character indexes, I can add additional fields for
codepoint indexes as well, making it an O(log N) operation. This
would add a constant factor to the processing times for my string
operations, since I'd have to update two different sets of indexes
instead of one on a write, but it's feasible and it would add
other useful capabilities which would manifest on the scheme side,
where I'd be introducing new functions based on codepoint indexes,
in addition to the existing functions based on character indexes.
However, I do strongly feel that these additional routines should
be just that - additional. They do not deal with "characters" per
se, but specifically with a single method of representing characters.
So they're in the same spirit as things that apply binary bitmasks
to floating-point constants or other such representation-dependent