[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft



I think the goal of the document is a bit ambiguous.  Specifically,
I feel there are three issues intermixed; they are related each
other, but it'll be clearer to separate the discussion as much
as possible.

1. Defining unicode API: "If a Scheme implementation supports
   API that explicitly deals with Unicode, it should be so and so".

   This goal tries to define a common API to interface the outer
   Unicode world, while allowing the implemenation to choose their
   internal representations.

2. Addressing unicode-specific issues: "If a Scheme implementation
   uses Unicode as its native character representation, it should be
   so and so".

   Some issues raised in the document are based on the assumption
   that the implementation uses Unicode or iso-8859-* CCS (coded
   character set) in its native representation.

   If the document limit its scope to "the implementations that uses
   Unicode/iso8859-* internally", it's fine.  Is that the intention
   of the document?

    * If the implementation uses EUCJP as its internal CES, it
      will face difficulty for the recommendation of INTEGER->CHAR
      to support [0,255], since EUCJP does not have full mappings
      in this range, although it has much more characters than 256.
      I think it's possible that (integer->char #xa1) on such
      implementations returns a "pseudo character", which doesn't
      corresponds to any character in EUCJP CES but is guaranteed
      that to produce #xa1 when passed to char->integer.  But the
      effects would be undefined if such a character is used within
      a string.  (An implementation can also choose different
      integers than the codepoint value to fulfill this "256 character"
      requirements, but it wouldn't be intuitive).

    * "What _is_ a Character" section doesn't refer to an
      implementation where a CHAR? value corresponts to a
      codepoint of non-Unocde, non-iso8859-* CCS/CES.

    * In the portable FFI section, some APIs state the encoding
      must be one of utf-8, iso8859-* or ascii, and I don't see
      the reason of such restrictions.

3. Determining the least common set of assumptions about characters
   and strings the language/FFI spec should define.

   Mostly in "R6RS recommendation" section.  Some of them seem
   try to be codeset-independent, while some of them seem to
   assume Unicode/iso8859-* codeset implicitly.  So I wonder
   which is the intention of the document.


Another issue: is there a rationale about "strong encouragement"
of O(1) access of string-ref and string-set!?   There are
alrogithms that truly need random access, but in many cases,
index is used just to mark certain location of the string;
e.g. if you want (string-ref str 3), it's rare that you know
'3' is significant before you know about str---it's more likely
that somebody (string search function, regexp matcher, or suffix
database...) told you that the 3rd character of a particular
string in str is significant.  In such cases, the reason you
use index is not because the algorithm requires it, but just
one of the possible means to have a reference within a string.
I feel that accessing strings by index is a kind of premature
optimization, which came from the history when strings were
simply an array of bytes.    Some algorithms may still require
random access, but we already have a primitive datatype that
guarantee O(1) access---a vector.   I don't think we should
make a string just a vector with a restricted element type.

--shiro