[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft





    > From: Shiro Kawai <shiro@xxxxxxxx>

(last bit first:)

    > I feel that accessing strings by index is a kind of premature
    > optimization, which came from the history when strings were
    > simply an array of bytes.    

I used to think so too -- but that was before I started writing C code
that allows UTF-8 and UTF-16 (and soon UTF-32) to be mixed and matched
freely in a single application.

I think that the approach I'm taking with Pika -- internally using an
N-bit encoding when all characters in a string fit in N-bits -- does a
pretty good job of restoring the "a string is an array" simplicity in
an international context.


(and, the rest:)

    > I think the goal of the document is a bit ambiguous.  Specifically,
    > I feel there are three issues intermixed; they are related each
    > other, but it'll be clearer to separate the discussion as much
    > as possible.


You summarized the three issues that you see as intermixed this way:

    > 1. Defining unicode API: "If a Scheme implementation supports
    >    API that explicitly deals with Unicode, it should be so and
    >    so".  [i.e., FFI issues re Unicode] [....]

    > 2. Addressing unicode-specific issues: "If a Scheme implementation
    >    uses Unicode as its native character representation, it should be
    >    so and so". [i.e., Scheme language optional requirements]

    > 3. Determining the least common set of assumptions about characters
    >    and strings the language/FFI spec should define. 
    >    [i.e., Scheme Language and FFI required features]

and, overall, you're concerned (understandably) about the bias of the
draft towards Unicode and iso8859-* in general, and, in particular,
about proposals in the draft that may not be reasonable for
implementations using other character sets (such as EUCJP).

Ok, so, my reply.

I intended there to be three topics, each with two subtopics --
similar but not quite the same as as your summary:

   1. Scheme language changes needed to support Unicode
   1a. requirements
   1b. optional feature requirements

   2. String handling in a portable FFI
   2a. requirements
   2b. optional feature requirements

   3. Examination of a particular implementation
   3a. native FFI spec
   3b. implementation strategy

I think that, at this stage, those three topics do belong together.
On this list, we're most directly concerned with (2) ("portable FFI")
but I don't think we can design such an FFI usefully without
addressing (1) ("Scheme language").  For example, FFI-using code is
seriously impoverished if it can not reliably exchange integer string
indexes with Scheme -- so we need to nail down what those indexes mean
in the Scheme language.  Another example: portable FFI-using code can
not reliably enter strings into Scheme without stronger character set
guarantees than are found in R5RS.  And if we're concerned with (1)
and/or (2), then we also need at least one and hopefully more examples
of (3) ("particular implementation(s)") because we need to have some
confidence that we aren't specifying something that is unimplementable
or has too great an impact on implementations.

I am, indeed, _somewhat_ biased towards Unicode -- but I also
acknowledge the desirability of not ruling that an implementation
using a different system (such as EUCJP) is necessarily non-standard. 
I _think_ I did a good job at that but I'll address the specific
concerns that you raised below.

There is the question: Should future Revised Standards explicitly
mention Unicode in the core of the specification, as I've recommended,
or should such materials be separated from the core to an annex where
they might stand in parrallel with analogous sections for, for
example, EUCJP?  I suspect that such a question is mostly political,
not technical.  I'll admit my bias that I think the way forward for
computing generally is to converge on Unicode -- I'd like it in the
core -- but it's a minor point and a topic for another context, I
think.


    >    If the document limit its scope to "the implementations that uses
    >    Unicode/iso8859-* internally", it's fine.  Is that the intention
    >    of the document?

No, it's not -- in two ways:

First:

  The proposed changes to the Scheme standard do not require
  implementations to use Unicode or iso8859-*.  They do, however,
  require that portable Scheme programs manipulating multilingual text
  take some of their semantics (in particular, the meaning of string
  indexes and hex string constants) from Unicode.  Absent such a
  semantic, integer constant string indexes and string index
  adjustments (e.g., +1) will not have a portable meaning (for
  example).


Second:

  Yes, I am proposing that R6RS explicitly embrace Unicode, mostly in
  the form of optional features defining how characters beyond the
  minimal required character set are handled.   My proposal does _not_
  require conforming implementations to use Unicode and does not
  preclude implementations that include characters not found in
  Unicode (Pika's support of buckybits is an example).

  As I understand it, there is resistence in some places to Unicode
  for two reasons: (1) legacy systems using other character sets (not
  much anyone can do about that -- I don't think Scheme should bend
  over backwards to be EBCDIC-friendly, either); (2) still surviving
  controversy about Han Unification.

  (2) is a topic we shouldn't debate on this list or try to solve
  here.  For Scheme, though, I think we need to look at an overarching
  technical reason for embracing Unicode: it is simply the best
  structured design for an international character set that anybody
  has yet imagined.  The programmatic structure of Unicode and Unicode
  string encodings is the best that there is and that structure is
  orthogonal to the controversial issues.  (Contrast with, for
  example, shift-encodings.)  Unicode does define The Right Thing for
  the programmatic aspects of string handling -- Scheme should simply
  embrace that.

  (To make this clearer: _if_ unification really is a broken idea, it
  can be fixed in Unicode.   Were it fixed in Unicode, none of the 
  requirements and recommendations of the draft would change.  No 
  portable use of those features would break.)
  



    >     * If the implementation uses EUCJP as its internal CES, it
    >       will face difficulty for the recommendation of INTEGER->CHAR
    >       to support [0,255], since EUCJP does not have full mappings
    >       in this range, although it has much more characters than 256.
    >       I think it's possible that (integer->char #xa1) on such
    >       implementations returns a "pseudo character", which doesn't
    >       corresponds to any character in EUCJP CES but is guaranteed
    >       that to produce #xa1 when passed to char->integer.  But the
    >       effects would be undefined if such a character is used within
    >       a string.  (An implementation can also choose different
    >       integers than the codepoint value to fulfill this "256 character"
    >       requirements, but it wouldn't be intuitive).

You say that "the effects would be undefined if such a character
is used within a string".   I don't see why that would have to be the
case -- I only see how a particular string implementation could have
that bug.

There are 128 unassigned octet sequences available in EUCJP that won't
be mistaken for any real character, right?  Purely internally, those
sequences could be used in the string representations to represent the
effect of something like (STRING-SET! s (INTEGER->CHAR 161)).

Alternatively, you could use multiple string representations
internally, similar to those I described for Pika.

The problem you mention isn't unique to EUCJP -- it occurs for a
naively implemented UTF-8 or UTF-16-based Scheme as well.   The
solutions are analogous.



    >     * "What _is_ a Character" section doesn't refer to an
    >       implementation where a CHAR? value corresponts to a
    >       codepoint of non-Unocde, non-iso8859-* CCS/CES.

Most likely, depending on the details of the implementation you have
in mind, I would put it in the "roughly corresponds to a Unicode
codepoint" category.

The sample implementation described in the draft (Pika Scheme)
includes non-Unicode codepoints (characters with buckybits set).
I state elsewhere in the draft that is in the "roughly a Unicode
codepoint" class.


    >     * In the portable FFI section, some APIs state the encoding
    >       must be one of utf-8, iso8859-* or ascii, and I don't see
    >       the reason of such restrictions.

How would you remove that restriction in a way that supports writing
portable FFI-using code?

The only plausible alternative I see would be for the FFI to also
include a means to enter or extract strings using the Posix multibyte
support.   I think that this would be essentially impossible for
FFI-implementors to provide in anything other than implementations
which use the multibyte routines internally.   It would, in effect, be
a requirement that Scheme be based on the C multibyte routines rather
than Unicode.



    > 3. Determining the least common set of assumptions about characters
    >    and strings the language/FFI spec should define.

    >    Mostly in "R6RS recommendation" section.  Some of them seem
    >    try to be codeset-independent, while some of them seem to
    >    assume Unicode/iso8859-* codeset implicitly.  So I wonder
    >    which is the intention of the document.

The intention is to make optional Unicode support well-defined and
useful as a basis for writing portable Scheme code --- while saying no
more in the core specification of Scheme than is necessary to
accomplish that.

I believe it will be practical for implementations internally using,
for example, EUCJP to conform to the requirements -- and even to
provide a useful level of support for the optional Unicode facilities.



    > Another issue: is there a rationale about "strong encouragement"
    > of O(1) access of string-ref and string-set!?   There are
    > alrogithms that truly need random access, but in many cases,
    > index is used just to mark certain location of the string;
    > e.g. if you want (string-ref str 3), it's rare that you know
    > '3' is significant before you know about str---it's more likely
    > that somebody (string search function, regexp matcher, or suffix
    > database...) told you that the 3rd character of a particular
    > string in str is significant.  In such cases, the reason you
    > use index is not because the algorithm requires it, but just
    > one of the possible means to have a reference within a string.

I don't follow your arguments about the distinction between an integer
constant string index and one received as a parameter.   In both
cases, the question is "what is the computational complexity of
STRING-REF, STRING-SET!, and SUBSTRING  ?"


-t