[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.

    > From: bear <bear@xxxxxxxxx>

    > >    > >   While R6RS should not require that CHAR? be a subset of Unicode,
    > >    > >   it should specify the semantics of string indexes for strings
    > >    > >   which _are_ subsets of Unicode.

    > I'll go along with that, and I have no trouble conforming.  In
    > that case the "large characters" may simply be regarded as lying
    > outside unicode.

An interesting interpretation that I hadn't intended to permit but I
don't see any problem with it (for R6RS).   None at all.  Sweet,
sweet, sweet.

Maybe beyond R6RS we can duke it out later over some Unicode SRFIs :-)
(Actually, my hunch is there isn't even _much_ (a little, perhaps) to
disagree about there.)

    > >    > Computing the codepoint-index on demand would require a traversal
    > >    > of the string, an O(N) operation, using my current representation.
    > >    > That's clearly intolerable. But in the same tree structure where I
    > >    > now just keep character indexes, I can add additional fields for
    > >    > codepoint indexes as well, making it an O(log N) operation.

    > > And if you were to use self-balancing trees, it would be an
    > > expected-case O(1) operation.

    > ???  Balanced trees are still trees, and if the tree is exp(n) long
    > there are still O(n) links to follow from the root to a leaf.  Can
    > you explain what you mean?

I mean that you should (maybe, don't know enough about your apps and
environment) use splay trees so that string algorithms displaying a
reasonable degree of locality will be able to locate indexes in
expected-case O(1).  Of course if memory constraints and usage
guarantee that your particular non-splay trees or guaranteed to be
shallow -- that's just as good.

    > > Only because you are in a mindset where you want CHAR? to be a
    > > combining char sequence.  There are so many problems with permitting
    > > that as a conformant Scheme that I think it has to be rejected.  You
    > > need to pick a different type for what you currently call CHAR?.

    > I want char? to be a character, and to never have to care, on the
    > scheme side of things, about how exactly it's represented, whether
    > it's in unicode as a single codepoint or several, or etc.  I don't
    > give a flying leap whether unicode regards it as a combining
    > sequence or not, and I don't want to *have* to give a flying leap
    > about whether it's represented as a combining sequence or not.  If
    > it functions linguistically as a character, I want to be able to
    > write scheme code that treats it as a character.  I don't want to
    > ever have to worry or think about the representation, until I'm doing
    > I/O or passing a value across an FFI and the representation becomes
    > important.

I think that the "In bear's Scheme, a string containing a
(non-unitary) combining sequence simply doesn't count as a string of
Unicode characters" interpretation eliminates all the conflicts I was
worried about between my proposal and your implementation.  Perfect.
Excellent.  Sweet, sweet, sweet.