[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft

On Thu, 22 Jan 2004, Tom Lord wrote:

>As you admit:
>    > If we restrict discussion to the characters that can appear in
>    > canonical string representation (meaning no ligatures)
>Which is fine for certain implementations but not for the standard.
>    > Every
>    > cased character in unicode, with the single exception of eszett,
>    > has a lower-case and an upper-case;
>A single exception is an exception nonetheless.   I think that what we
>should take from such exceptions is not their singularity (who knows
>what tomorrow may bring) but rather the "in general" programmatic
>structure that they require.

True, and this single exception means that I will need, like every
other implementor, to caution users that case changing operations
may change the length of a string and the indexes of codepoints after
the problematic character.  However, I can make the actual occurrence
of this relevant only to german-using audiences, for whom it will be
well known and well understood.

>Yeah, I think you ought to alpha-rename STRING-REF and STRING-SET! and
>various other procedures in your code.  I think you want
>COMBINING-SEQUENCE? not CHAR? for this infinite set you have in mind.

I'm pretty sure I don't.  I'm talking about characters, not codepoints.
Codepoints are representation; characters are ideas.  Some characters
take more than one codepoint to represent.

>    > I would say that if the character set is not restricted to a known
>    > width, I think it's handier with codepoint separators, especially
>    > since most characters are in the 0...255 or 0..65535 range.
>    > Instead of writing #\U+C32000000AF for a combining sequence
>I don't think you get to do that.   I don't think combining sequences
>can be CHAR?.  Sorry.   Useful data type -- yes.  CHAR? -- no.

Eh, okay.  There's no real difference in what can be expressed.  The
U+ notation strongly suggests unicode, but I can compatibly use it
for a character set whose first 2^21 mappings are the same mappings as

>    > >  /=== R6RS Recommendation:
>    > >    R6RS should strongly encourage implementations to make the
>    > >    expected-case complexity of STRING-REF and STRING-SET! O(1).
>    > >  \========
>    > I'd fail this. My strings are O(Log N) access where N is the length
>    > of the string.

>Really?  You'd fail?   On small strings you almost certainly wouldn't,

strings up to ~1000 general characters, or 4080 latin-1 chars - O(1).
   These are a single vector or 'strand'.
strings up to ~3000 general characters, or 12240 latin-1 chars - O(2).
    Traverse root node, operate in one of three child node strands.
strings up to ~9000 general characters, or 36720 latin-1 chars - O(3).
    root node, branch node, strand.
strings up to ~27000 general characters, or 110160 latin-1 chars - 0(4).
    root node, branch node, branch node, strand.

and so on.  It's a 2/3 B-tree of "strands" where each strand is
a 4 kbyte vector.  4 32-bit values per strand are used for overhead.
If the characters are in the latin-1 range, a strand can contain up
to 4080 characters. But any strand which contains any characters outside
latin-1 is represented using a 32-bit representation, and potentially
several consecutive 32-bit values within such a strand may belong to the
same character.

The loss is that referring to an indexed character in a string is
O(n).  The win is that inserting or deleting characters is also
O(n) and I can avoid spending a lot of storage on copies and a lot
of CPU on making copies since different strings can share strands.

>just because they're small.   If your trees are at all balanced, you
>probably wouldn't fail on "medium" strings either.

It's a self-balancing tree structure, but the length of the path
from root to strand is still proportional to the logarithm base 3
of the string length. This is a law of the universe with trees.

>think you should look at some form of self-balancing tree -- then you
>won't fail at all.  (Assuming, that is, that you don't come away from
>"STRING? is a tree" as I think you probably should --- the tree is
>something else, not STRING?.)

A string is a sequence of characters.  I looked at strings and
the operations I'm doing on them in language processing.  I
insert, delete, prepend, append, and copy strings a lot.  When
insertion or deletion requires copying the rest of the string
to shift it forward or back one or two places, and every string
has to have its own storage for data 99% of which is copied
wholesale from other strings, and you have to do all that copying,
it's intolerably slow.  The balanced-tree representation with
potentially shared leaves supports those operations better than
the vector representation, _especially_ as strings grow larger
than a few kilobytes. I'm working with text corpora, so the
average string length is large - anything from a ten-kilobyte
wall street journal article to a two-megabyte novel.

I'll cheerfully create routines that copy an entire string into
a single vector of a specific representation for an FFI to
play with, and I'll create routines that copy them back into
trees when the FFI says it's done playing, but I won't go back
to using vectors internally and winding up waiting while
megabytes of data get copied back and forth to move something
up or down a few character spaces every time one word is
inserted or deleted near the front.

>    > >   While R6RS should not require that CHAR? be a subset of Unicode,
>    > >   it should specify the semantics of string indexes for strings
>    > >   which _are_ subsets of Unicode.

I'll go along with that, and I have no trouble conforming.  In
that case the "large characters" may simply be regarded as lying
outside unicode.

>    > Computing the codepoint-index on demand would require a traversal
>    > of the string, an O(N) operation, using my current representation.
>    > That's clearly intolerable. But in the same tree structure where I
>    > now just keep character indexes, I can add additional fields for
>    > codepoint indexes as well, making it an O(log N) operation.
> And if you were to use self-balancing trees, it would be an
> expected-case O(1) operation.

???  Balanced trees are still trees, and if the tree is exp(n) long
there are still O(n) links to follow from the root to a leaf.  Can
you explain what you mean?

> Only because you are in a mindset where you want CHAR? to be a
> combining char sequence.  There are so many problems with permitting
> that as a conformant Scheme that I think it has to be rejected.  You
> need to pick a different type for what you currently call CHAR?.

I want char? to be a character, and to never have to care, on the
scheme side of things, about how exactly it's represented, whether
it's in unicode as a single codepoint or several, or etc.  I don't
give a flying leap whether unicode regards it as a combining
sequence or not, and I don't want to *have* to give a flying leap
about whether it's represented as a combining sequence or not.  If
it functions linguistically as a character, I want to be able to
write scheme code that treats it as a character.  I don't want to
ever have to worry or think about the representation, until I'm doing
I/O or passing a value across an FFI and the representation becomes