[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: strings draft
> From: bear <bear@xxxxxxxxx>
> > > > While R6RS should not require that CHAR? be a subset of Unicode,
> > > > it should specify the semantics of string indexes for strings
> > > > which _are_ subsets of Unicode.
> I'll go along with that, and I have no trouble conforming. In
> that case the "large characters" may simply be regarded as lying
> outside unicode.
An interesting interpretation that I hadn't intended to permit but I
don't see any problem with it (for R6RS). None at all. Sweet,
sweet, sweet.
Maybe beyond R6RS we can duke it out later over some Unicode SRFIs :-)
(Actually, my hunch is there isn't even _much_ (a little, perhaps) to
disagree about there.)
> > > Computing the codepoint-index on demand would require a traversal
> > > of the string, an O(N) operation, using my current representation.
> > > That's clearly intolerable. But in the same tree structure where I
> > > now just keep character indexes, I can add additional fields for
> > > codepoint indexes as well, making it an O(log N) operation.
> > And if you were to use self-balancing trees, it would be an
> > expected-case O(1) operation.
> ??? Balanced trees are still trees, and if the tree is exp(n) long
> there are still O(n) links to follow from the root to a leaf. Can
> you explain what you mean?
I mean that you should (maybe, don't know enough about your apps and
environment) use splay trees so that string algorithms displaying a
reasonable degree of locality will be able to locate indexes in
expected-case O(1). Of course if memory constraints and usage
guarantee that your particular non-splay trees or guaranteed to be
shallow -- that's just as good.
> > Only because you are in a mindset where you want CHAR? to be a
> > combining char sequence. There are so many problems with permitting
> > that as a conformant Scheme that I think it has to be rejected. You
> > need to pick a different type for what you currently call CHAR?.
> I want char? to be a character, and to never have to care, on the
> scheme side of things, about how exactly it's represented, whether
> it's in unicode as a single codepoint or several, or etc. I don't
> give a flying leap whether unicode regards it as a combining
> sequence or not, and I don't want to *have* to give a flying leap
> about whether it's represented as a combining sequence or not. If
> it functions linguistically as a character, I want to be able to
> write scheme code that treats it as a character. I don't want to
> ever have to worry or think about the representation, until I'm doing
> I/O or passing a value across an FFI and the representation becomes
> important.
I think that the "In bear's Scheme, a string containing a
(non-unitary) combining sequence simply doesn't count as a string of
Unicode characters" interpretation eliminates all the conflicts I was
worried about between my proposal and your implementation. Perfect.
Excellent. Sweet, sweet, sweet.
-t