[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft





    > From: bear <bear@xxxxxxxxx>

    > >    > >   While R6RS should not require that CHAR? be a subset of Unicode,
    > >    > >   it should specify the semantics of string indexes for strings
    > >    > >   which _are_ subsets of Unicode.

    > I'll go along with that, and I have no trouble conforming.  In
    > that case the "large characters" may simply be regarded as lying
    > outside unicode.

An interesting interpretation that I hadn't intended to permit but I
don't see any problem with it (for R6RS).   None at all.  Sweet,
sweet, sweet.

Maybe beyond R6RS we can duke it out later over some Unicode SRFIs :-)
(Actually, my hunch is there isn't even _much_ (a little, perhaps) to
disagree about there.)

    > >    > Computing the codepoint-index on demand would require a traversal
    > >    > of the string, an O(N) operation, using my current representation.
    > >    > That's clearly intolerable. But in the same tree structure where I
    > >    > now just keep character indexes, I can add additional fields for
    > >    > codepoint indexes as well, making it an O(log N) operation.

    > > And if you were to use self-balancing trees, it would be an
    > > expected-case O(1) operation.

    > ???  Balanced trees are still trees, and if the tree is exp(n) long
    > there are still O(n) links to follow from the root to a leaf.  Can
    > you explain what you mean?

I mean that you should (maybe, don't know enough about your apps and
environment) use splay trees so that string algorithms displaying a
reasonable degree of locality will be able to locate indexes in
expected-case O(1).  Of course if memory constraints and usage
guarantee that your particular non-splay trees or guaranteed to be
shallow -- that's just as good.


    > > Only because you are in a mindset where you want CHAR? to be a
    > > combining char sequence.  There are so many problems with permitting
    > > that as a conformant Scheme that I think it has to be rejected.  You
    > > need to pick a different type for what you currently call CHAR?.

    > I want char? to be a character, and to never have to care, on the
    > scheme side of things, about how exactly it's represented, whether
    > it's in unicode as a single codepoint or several, or etc.  I don't
    > give a flying leap whether unicode regards it as a combining
    > sequence or not, and I don't want to *have* to give a flying leap
    > about whether it's represented as a combining sequence or not.  If
    > it functions linguistically as a character, I want to be able to
    > write scheme code that treats it as a character.  I don't want to
    > ever have to worry or think about the representation, until I'm doing
    > I/O or passing a value across an FFI and the representation becomes
    > important.

I think that the "In bear's Scheme, a string containing a
(non-unitary) combining sequence simply doesn't count as a string of
Unicode characters" interpretation eliminates all the conflicts I was
worried about between my proposal and your implementation.  Perfect.
Excellent.  Sweet, sweet, sweet.


-t