[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.

    > From: Shiro Kawai <shiro@xxxxxxxx>

    > No.  String search, regexp match, or precalculated prefix/suffix
    > database, all can return some sort of reference that directly
    > points into the string, so that the subsequent use of such 
    > reference wouldn't need to count characters.
    > (The implementation that shares substrings and uses write-on-copy
    > for string mutation, those basic operations even can efficiently
    > return substring directly.)

Well, I don't think it's that simple.

It would be hard to implement those "string reference objects" to
preserve the O(1) property in the face of STRING-SET! given a flat, 
variable-width, string representation.

And if you have a tree representation or something like what I
described for Pika -- then you don't need those "string reference
objects" after all.   They might be nice for indepenent reasons -- but
you won't need them to get O(1) string-ops.

    > And to implement search, regexp, or prefix/suffix arrays, the
    > access of string is mostly sequential, or requires "random hopping"
    > in a small amount.  Sequential access can be efficiently implemented,
    > using string ports, for example, than using integer index.

    > It's OK to have STRING-REF as well---after all, we have LIST-REF
    > and nobody complains its O(N) complexity.

In some sense, I think that the strong recommendation for O(1)
string-ops is already present in the spec.   Were it not, why wouldn't
the string syntax be a fancy way to write lists and STRING? and LIST?
not disjoint?

    > [About character-set independence]

    > What I felt ambiguous is the degree of "character-set independence"
    > you're aiming at.   If we'd like to have a character-set independent
    > language spec,  we need to be much more careful to separate
    > Unicode-specific issues and character-set independent issues.

Hey, I'm partisan but fair, I think.

My recommendations suggest _requirements_ for the portable character
set.  Those aren't Unicode specific.  My recommendations suggest
_requirements_for_implementations_providing_optional_features_: and
some of those are indeed Unicode specific.  Perhaps not here on the
SRFI-50 list but I am willing to argue that, for Scheme, Unicode
deserves preferential treatment.

    > It would be nice that Scheme language spec allows a local
    > implementation that uses different CCS/CES.

I think that my recommendations are consistent with that.  (EBCDIC
excepted :-)

    > Using Unicode codepoints as the portable means of hex notation
    > (#\U+XXXX) is ok.

It should be fine for any implementation.   Implementations will only
be required to parse that syntax for the intersection of the
abstract characters they provide and the abstract characters of
Unicode.   Such a translation table can be compressed to a few 10K, I
think, for any of the character sets you are thinking of.

    > The integer indexing is an different issue.  EUCJP #xA5F7
    > character is mapped to two subsequent unicode codepoints,
    > U+30AB and U+309A.   On the other hand, U+30AB itself is
    > mapped to EUCJP #xA5AB, and U+309A doesn't have corresponding
    > character in EUCJP.
    > If STRING-REF has to be unicode codepoint index, I don't see
    > how it should work.

It only has to be a codepoint index for those strings which the
implementation agrees consist entirely of Unicode characters.   As
with bear's munging of combining sequences into (non-Unicode) CHAR?
values, you can declare that EUCJP #xA5F7 is not a Unicode character.

The only portable requirement that arises here concerns strings
containing nothing but characters in the "portable character set" of
Scheme -- a subset of ASCII.  (Beyond the requirement, the rule about
indexes provides a guideline for Unicode-centric implementors, bear
somewhat excepted.)

    > >   My proposal does _not_
    > >   require conforming implementations to use Unicode and does not
    > >   preclude implementations that include characters not found in
    > >   Unicode (Pika's support of buckybits is an example).

    > Requirements for unicode codepoint index and 256 character mapping
    > (as I explain later) implies the implementation to use
    > Unicode-compatible charset.

In portable code, the codepoint index rule effects only strings
containing nothing but characters from the "Scheme portable character
set" -- a subset of ASCII.

The 256 character mapping is a recommendation, not a requirement.

    > >     >     * In the portable FFI section, some APIs state the encoding
    > >     >       must be one of utf-8, iso8859-* or ascii, and I don't see
    > >     >       the reason of such restrictions.

    > > How would you remove that restriction in a way that supports writing
    > > portable FFI-using code?

    > What I'm picking there is the word "must". 
    > scm_extract_string8 can put answer in eucjp packed format into
    > t_uchar* array if the implementation supports that, so I don't
    > see why this restriction is needed.

I would not object to an addition to the portable FFI which is


that returns/accepts the data from a string, plus its length, but says
nothing about how the data is encoded.  It's purpose would be to
extract that data in the "most convenient form" for a given
implementation.   Would that do?

    >   indicated encoding (which must be one of `uni_utf8',
    >   `uni_iso8859_*', or `uni_ascii') 

    > Of course using such encoding wouldn't be portable.  But so
    > as iso8859_1 implementation is asked to convert the string
    > into iso8859_2.

I don't see why it wouldn't be portable.   I was thinking it would be
helpful to have a "libscheme-ffi-helpers.a" with the necessary tables.

    > Gauche can be compiled using EUCJP, and doesn't have a problem
    > communicating with Unicode world so far.  But I don't see
    > [0..256] mapping, "Unicode codepoint index", and O(1) accesses
    > are essential for such an implementation to communicate with
    > Unicode world.

Neither the 0..256 mapping nor the O(1) access time are _required_
in the proposed Scheme changes.

The 0..256 mapping _is_ required in the proposed portable FFI.

Recommending those things in R6RS would only encourage implementations
to provide them and to warn programmers relying on them that they may
have some performance surprises or functionality surprises using

Requiring the 0..256 mapping in the FFI means just that `char' can
always be converted to CHAR? and back again.   Is that really so