[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft

    > From: bear <bear@xxxxxxxxx>

    > On Wed, 21 Jan 2004, Tom Lord wrote:

    > >  By somewhat reasonable expectation, there must be at least 256
    > >  distinct Scheme characters and INTEGER->CHAR must be defined for all
    > >  integers in the range `0..255'.  There are many circumstances in
    > >  which conversions between octets and characters are desirable and
    > >  the requirements of this expectation say that such conversion is
    > >  always possible.  It is quite possible to imagine implementations in
    > >  which this is not the case: in which, for example, a (fully general)
    > >  octet stream can not be read and written using READ-CHAR and DISPLAY
    > >  (applied to characters).  Such an implementation might introduce
    > >  non-standard procedures for reading and writing octets and
    > >  representing arrays of octets.  While such non-standard extensions
    > >  may be desirable for independent reasons, I see no good reason not
    > >  to define at least a subset of Scheme characters which is mapped to
    > >  the set of octet values.

    > I think that this is a problem.  We need a portable method of
    > reading/writing an arbitrary octet stream, full stop.  As
    > characters become more complicated than octets, the two concepts
    > must be divorced from each other; otherwise there will be endless
    > hair as this exception or that rears its ugly head.

    > So I'd propose READ-CHAR and DISPLAY which read or write "a
    > character" abstracting away issues of encoding, multibyte character
    > sets, endianness, etc. according to either application defaults or
    > port properties, and two new routines, READ-OCTET and WRITE-OCTET,
    > which read and write binary values exactly eight bits wide and take
    > or return exact integers in the 0..255 range.

    > In fact, READ-OCTET and WRITE-OCTET would in that case become primitive,
    > since READ-CHAR and DISPLAY could be implemented in terms of them but
    > the reverse would not be true.

    > This neatly sidesteps the issue of needing character mappings for
    > every member of the range 128-255, and separates the ideas of octet
    > and character at the lowest level.

Hmm.  Well, an example of what it fails to sidestep is the issue of
making the values representable by the C `char' type a subset of CHAR?
It's also a fairly sorry approach to take for implementing many
network protocols in a way that is simple, direct, "tolerant of what
it receives".

In the name of intellectual honesty I have to admit that the various
places where the 256-chars requirement shows up in the recommendations
for R6RS and a portable FFI are pretty separable to everything else.
I could "lose this fight", so to speak, and it wouldn't have much
impact on the proposals.  But I will continue to argue that we
shouldn't go down that route.

    > >  R5RS requires a partial ordering of characters in which upper and
    > >  lower case variants of "the same character" are treated as equal.

    > >  Most problematically: R5RS requires that every alphabetic character
    > >  have both an upper and lower case variant.  This is a problem
    > >  because Unicode defines abstract characters which, at least
    > >  intuitively, are alphabetic -- but which lack such case mappings.

    > This problem goes away in the infinite-character-set universe.

I don't think so.  You may minimize it -- perhaps even reduce it to a
single character (at the cost of insisting on some canonicalization
forms rather than others, as I recall) -- but the problem doesn't
entirely go away.

As you admit:

    > If we restrict discussion to the characters that can appear in
    > canonical string representation (meaning no ligatures)

Which is fine for certain implementations but not for the standard.

    > Every
    > cased character in unicode, with the single exception of eszett,
    > has a lower-case and an upper-case; 

A single exception is an exception nonetheless.   I think that what we
should take from such exceptions is not their singularity (who knows
what tomorrow may bring) but rather the "in general" programmatic
structure that they require.

    > the catch is that the
    > uppercase and lowercase versions of it may require different
    > numbers of combining codepoints to represent.

Yeah, I think you ought to alpha-rename STRING-REF and STRING-SET! and
various other procedures in your code.  I think you want
COMBINING-SEQUENCE? not CHAR? for this infinite set you have in mind.

    > Eszett is in a class by itself, being a canonical lowercase character
    > and having an uppercase form which is, linguistically, a different
    > number of characters, as well as being a different number of codepoints.
    > I was initially driven to the multi-codepoint representation by the
    > attempt to solve this particular problem in reconciling the unicode
    > standard with R5RS, and I wound up with a 99.999% solution.

Hmm.  Well, of course it's not a very scientific statement -- but I
think my proposals lead to a 100% solution.   Neither of us work for
Jack Welsh, as far as I know so we aren't constrained by his 9-nines
compromise :-)

    > >    and in strings:
    > >
    > >		\U+XXXXX.
    > >
    > >    where XXXXX is an (arbitrary length) string of hexadecimal digits.

    > It's important to note the terminating '.' in the representation
    > for use in strings.  Otherwise there is an ambiguity introduced.

I wish I'd typed ";" instead of "." but -- yeah.

    > I would say that if the character set is not restricted to a known
    > width, I think it's handier with codepoint separators, especially
    > since most characters are in the 0...255 or 0..65535 range.

    > Instead of writing #\U+C32000000AF for a combining sequence

I don't think you get to do that.   I don't think combining sequences
can be CHAR?.  Sorry.   Useful data type -- yes.  CHAR? -- no.

    > >* Scheme Strings Meet Unicode

    > >  /=== R6RS Recommendation:

    > >    R6RS should strongly encourage implementations to make the
    > >    expected-case complexity of STRING-REF and STRING-SET! O(1).

    > >  \========

    > I'd fail this. My strings are O(Log N) access where N is the length
    > of the string. All told, I'd say this is in fact a performance win
    > since it means I can do copy-on-write tricks with small substrings
    > (strands) of the string rather than copy the whole string every time
    > somebody wants to save both the original version and a slightly-
    > changed version of it (which happens a lot when people are editing
    > a multi-megabyte document and there's an undo stack).

Really?  You'd fail?   On small strings you almost certainly wouldn't,
just because they're small.   If your trees are at all balanced, you
probably wouldn't fail on "medium" strings either.   And meanwhile, I
think you should look at some form of self-balancing tree -- then you
won't fail at all.  (Assuming, that is, that you don't come away from
"STRING? is a tree" as I think you probably should --- the tree is
something else, not STRING?.)

(Total off-topic aside: for various reasons, I think you want your
trees in C.   My libhackerlab is starting to implement this.   Maybe
we should collaborate.)

    > >  Most of the possible answers to "what is a Scheme character" are
    > >  consistent with the view that characters correspond to (possibly a
    > >  subset of) Unicode codepoints.

    > >  One of the possible answers to that question has the CHAR? type
    > >  correspond to a _sequence_ of Unicode code points.
    > >
    > >  /=== R6RS Recommendation:
    > >
    > >   While R6RS should not require that CHAR? be a subset of Unicode,
    > >   it should specify the semantics of string indexes for strings
    > >   which _are_ subsets of Unicode.
    > >
    > >   Specifically, if a Scheme string consists of nothing but Unicode
    > >   codepoints (including substrings which form combining sequences),
    > >   string indexes _must_ be Unicode codepoint offsets.
    > >
    > >  \========
    > >
    > >
    > >  That proposed modification to R6RS presents a (hopefully small)
    > >  problem for Ray Dillinger.   He would like (for quite plausible
    > >  reasons) to have CHAR? values which correspond to a _sequence_ of
    > >  Unicode codepoints.   While I have some ideas about how to
    > >  _partially_ reconcile his ideas with this proposal, I'd like to hear
    > >  his thoughts on the matter.

    > Computing the codepoint-index on demand would require a traversal
    > of the string, an O(N) operation, using my current representation.
    > That's clearly intolerable. But in the same tree structure where I
    > now just keep character indexes, I can add additional fields for
    > codepoint indexes as well, making it an O(log N) operation.  

And if you were to use self-balancing trees, it would be an
expected-case O(1) operation.

    > This would add a constant factor to the processing times for my
    > string operations, since I'd have to update two different sets
    > of indexes instead of one on a write, but it's feasible and it
    > would add other useful capabilities which would manifest on the
    > scheme side, where I'd be introducing new functions based on
    > codepoint indexes, in addition to the existing functions based
    > on character indexes.

    > However, I do strongly feel that these additional routines should
    > be just that - additional.  They do not deal with "characters" per
    > se, but specifically with a single method of representing
    > characters.

Only because you are in a mindset where you want CHAR? to be a
combining char sequence.  There are so many problems with permitting
that as a conformant Scheme that I think it has to be rejected.  You
need to pick a different type for what you currently call CHAR?.