[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft



Thanks for the detailed reply.  Now I'm getting the point.

 * An implementation are free to have non-Unicode-compatible
   char/string, as far as it shares the mimimum requirement,
   which is not much more than current R5RS with some
   clearification (case mapping issues aside).

 * _If_ an implementation can also have a subset of Unicode-
   compatible char/string, this subset of char/string should
   follow the codepoint-index.  The index handling of the rest
   of char/string is up to the implementation.

Did I get it right?

So, when the EUCJP Scheme reads a string

 "\U+30AB.\U+309A."

Then it can produce a string which consists of a single characetr
EUCJP #xA5F7.  It is outside of the scope of your document,
so the implementation is free to imlement such as

 (define x "\U+30AB.\U+309A.")
 (string-length x) => 1
 (string-ref x 1)  => <character EUCJP #xA5F7>
 (let ((y (string-copy x)))
   (string-set! y 0 #\a)
   y) => "a"

If so, I have no problem to adopt the "codepoint index" proposal.


[About O(1) property]

From: Tom Lord <lord@xxxxxxx>
Subject: Re: strings draft
Date: Fri, 23 Jan 2004 16:45:16 -0800 (PST)

>     > No.  String search, regexp match, or precalculated prefix/suffix
>     > database, all can return some sort of reference that directly
>     > points into the string, so that the subsequent use of such 
>     > reference wouldn't need to count characters.
>     > (The implementation that shares substrings and uses write-on-copy
>     > for string mutation, those basic operations even can efficiently
>     > return substring directly.)
> 
> Well, I don't think it's that simple.
> 
> It would be hard to implement those "string reference objects" to
> preserve the O(1) property in the face of STRING-SET! given a flat, 
> variable-width, string representation.
>
> And if you have a tree representation or something like what I
> described for Pika -- then you don't need those "string reference
> objects" after all.   They might be nice for indepenent reasons -- but
> you won't need them to get O(1) string-ops.

I'm not sure we're talking about the same issue.
Probably I mixed up two issues.

 * For STRING-REF and SUBSTRING, a string pointer object
   will allow O(1) access property to known locations of a string,
   in variable-width character string representation.  And we hardly
   lose anything on "array of characters" implementation, since
   such implementation can just use integer index as a string pointer
   object---it doesn't need to be a disjoint object at all.

   What it loses is an ability to extract a character/string using
   index without prior knowledge of the target string.  And what I'm
   saying is it is not a common case (maybe only when you're
   parsing fixed-column syntax?)   But I might be missing something,
   and I'll appreciate if a concrete example is given.

 * For STRIGN-SET!, the copy-on-write of whole string implementation
   can't have O(1) property, regardless of whether it uses "array
   of characters", variable-width charcter, rope or other tree
   representation (you can be close though, if you use tree and only
   share the leaf, for exmaple).  And I argue that it wouldn't be a
   common case that you want to replace exactly one character within
   a string of specific location---it is rather a special case of
   generic string replacement as srfi-13's string-xcopy!.   There may
   be an application that uses such "one character replacement" heavily,
   but I don't think it is such a common case so that O(1) STRING-SET!
   should be a "strong recommendation".   Again, I may miss something,
   though.

You mentioned that you came to O(1) recommentation through your
experience.  If it's not too much trouble, I'd like to hear the
concrete experience that made you think so.

>     > It's OK to have STRING-REF as well---after all, we have LIST-REF
>     > and nobody complains its O(N) complexity.
> 
> In some sense, I think that the strong recommendation for O(1)
> string-ops is already present in the spec.   Were it not, why wouldn't
> the string syntax be a fancy way to write lists and STRING? and LIST?
> not disjoint?

The same argument can be done that why the string syntax wouldn't 
be a fancy way to write vectors and STRING? and VECTOR? not disjoint.

I don't know what the rrrs authors thought when they decided to have
disjoint string type.  Some old discussion, such as:
http://www.swiss.ai.mit.edu/ftpdir/scheme-mail/HTML/rrrs-1985/msg00002.html
suggests that they viewed a string as an array of characters.
But at least such a view isn't explicitly in R5RS, and I see it
fortunate.


[About character-set independence]

>     > What I felt ambiguous is the degree of "character-set independence"
>     > you're aiming at.   If we'd like to have a character-set independent
>     > language spec,  we need to be much more careful to separate
>     > Unicode-specific issues and character-set independent issues.
> 
> Hey, I'm partisan but fair, I think.
> 
> My recommendations suggest _requirements_ for the portable character
> set.  Those aren't Unicode specific.  My recommendations suggest
> _requirements_for_implementations_providing_optional_features_: and
> some of those are indeed Unicode specific.  

As far as it is clear that the portable Scheme can't rely on those
features,  I'll settle on it.

>     > > How would you remove that restriction in a way that supports writing
>     > > portable FFI-using code?
> 
>     > What I'm picking there is the word "must". 
>     > scm_extract_string8 can put answer in eucjp packed format into
>     > t_uchar* array if the implementation supports that, so I don't
>     > see why this restriction is needed.
> 
> I would not object to an addition to the portable FFI which is
> 
> 	scm_extract_string_opaque
> 	scm_enter_string_opaque
> 
> that returns/accepts the data from a string, plus its length, but says
> nothing about how the data is encoded.  It's purpose would be to
> extract that data in the "most convenient form" for a given
> implementation.   Would that do?

I don't object that scm_{extract|enter}_string_opaque, but
still fail to see why scm_{extract|enter}_string8 shouldn't
handle both.

>     > Of course using such encoding wouldn't be portable.  But so
>     > as iso8859_1 implementation is asked to convert the string
>     > into iso8859_2.
> 
> I don't see why it wouldn't be portable.   I was thinking it would be
> helpful to have a "libscheme-ffi-helpers.a" with the necessary tables.

Because iso8859-2 doesn't have INVERTED EXCLAMATION (iso8859-1 #xA1),
for example.  The implementation can return an error and it's fine,
but then, why not eucjp?

Alternatively, you can specify "when iso8859-1 implementation is
asked to extract the string in iso8859-2, then it can map iso8859-1
characters that don't have correspondence to iso8859-2 characters
to the iso8859-2 characters with the same codepoint".
Although it's a hack, it's what the CCS/CES-unaware software does
all the time.  But then again, there's no reason that iso8859-1
implementation can extract string as jisx0201, with the similar
rule described above.

> Neither the 0..256 mapping nor the O(1) access time are _required_
> in the proposed Scheme changes.
[...]
> Requiring the 0..256 mapping in the FFI means just that `char' can
> always be converted to CHAR? and back again.   Is that really so
> onerous?

C 'char' doesn't have encoding information, but merely an
integer with limited range.  If we want to have Scheme
character to be defined more strictly, the programmer should
be more conscious about distinction between octets and characters.

It wasn't requirement in the proposal, but "explicitly and
strongly encouraging" will in fact encourage the bad practice
that regards an octet and a character the same.  I'm afraid that
it encourages people to write a code that uses strings as a
buffer of octet stream, for example.

--shiro