[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.

Thanks for the detailed reply.  
First of all, I'm not intended to discuss effectiveness of Unicode
in multilingual environment and/or Han Unification at all.  My point
is to allow alternatives.

My reply goes as follows:

 * About O(1) string access
 * About character-set independence

[About O(1) string access]

From: Tom Lord <lord@xxxxxxx>
Subject: Re: strings draft
Date: Thu, 22 Jan 2004 09:45:53 -0800 (PST)

>     > I feel that accessing strings by index is a kind of premature
>     > optimization, which came from the history when strings were
>     > simply an array of bytes.    
> I used to think so too -- but that was before I started writing C code
> that allows UTF-8 and UTF-16 (and soon UTF-32) to be mixed and matched
> freely in a single application.
> I think that the approach I'm taking with Pika -- internally using an
> N-bit encoding when all characters in a string fit in N-bits -- does a
> pretty good job of restoring the "a string is an array" simplicity in
> an international context.

I see your approach (having multiple encodings) works, but don't
see the reason to "strongly recommend" it.

>     > Another issue: is there a rationale about "strong encouragement"
>     > of O(1) access of string-ref and string-set!?   There are
>     > alrogithms that truly need random access, but in many cases,
>     > index is used just to mark certain location of the string;
>     > e.g. if you want (string-ref str 3), it's rare that you know
>     > '3' is significant before you know about str---it's more likely
>     > that somebody (string search function, regexp matcher, or suffix
>     > database...) told you that the 3rd character of a particular
>     > string in str is significant.  In such cases, the reason you
>     > use index is not because the algorithm requires it, but just
>     > one of the possible means to have a reference within a string.
> I don't follow your arguments about the distinction between an integer
> constant string index and one received as a parameter.   In both
> cases, the question is "what is the computational complexity of

No.  String search, regexp match, or precalculated prefix/suffix
database, all can return some sort of reference that directly
points into the string, so that the subsequent use of such 
reference wouldn't need to count characters.
(The implementation that shares substrings and uses write-on-copy
for string mutation, those basic operations even can efficiently
return substring directly.)

And to implement search, regexp, or prefix/suffix arrays, the
access of string is mostly sequential, or requires "random hopping"
in a small amount.  Sequential access can be efficiently implemented,
using string ports, for example, than using integer index.

It's OK to have STRING-REF as well---after all, we have LIST-REF
and nobody complains its O(N) complexity.

[About character-set independence]

What I felt ambiguous is the degree of "character-set independence"
you're aiming at.   If we'd like to have a character-set independent
language spec,  we need to be much more careful to separate
Unicode-specific issues and character-set independent issues.

> I intended there to be three topics, each with two subtopics --
> similar but not quite the same as as your summary:
>    1. Scheme language changes needed to support Unicode
>    1a. requirements
>    1b. optional feature requirements
>    2. String handling in a portable FFI
>    2a. requirements
>    2b. optional feature requirements
>    3. Examination of a particular implementation
>    3a. native FFI spec
>    3b. implementation strategy
> I think that, at this stage, those three topics do belong together.
> On this list, we're most directly concerned with (2) ("portable FFI")
> but I don't think we can design such an FFI usefully without
> addressing (1) ("Scheme language").

We need to define some kind of minimum requirements on the language (1).
For example, your point here is a reasonable.

> Another example: portable FFI-using code can
> not reliably enter strings into Scheme without stronger character set
> guarantees than are found in R5RS.  

(Although I wonder the following:

> For example, FFI-using code is
> seriously impoverished if it can not reliably exchange integer string
> indexes with Scheme

If an opaque "string reference" liberates the Scheme code from integer
indexing strings, there's not much reason that we want to pass integer
index to FFI.)

> There is the question: Should future Revised Standards explicitly
> mention Unicode in the core of the specification, as I've recommended,
> or should such materials be separated from the core to an annex where
> they might stand in parrallel with analogous sections for, for
> example, EUCJP?  I suspect that such a question is mostly political,
> not technical.  I'll admit my bias that I think the way forward for
> computing generally is to converge on Unicode -- I'd like it in the
> core -- but it's a minor point and a topic for another context, I
> think.

I have no doubt that future information exchange will be done
in Unicode.  Still, there are niche areas that people wants to
handle weird character set.  For example:

 * Handling literature and historical document on computers
   requires more characters than Unicode, such as the one
   "Mojikyo" http://www.mojikyo.org/  provides.

 * Japanese local character set JISX-0213 added thousands of
   characters, of which a few hundreds of characters were
   undefined in Unicode at that time.  It took a couple of
   years until Unicode included them (except about a dozen
   characters which are not in Unicode yet).

 * Legacy documents.  You mentioned that Scheme shouldn't bend
   over backwards to be EBCDIC-friendly and it's true.  The
   issue here is that converting legacy encoding to Unicode
   is not a loss-less conversion.  If you convert EUCJP to
   Unicode in one program, and then covert back by another
   program, it is not guaranteed that you get the same document
   as in original.  So, in some areas, one need to use
   the legacy encodings.

It would be nice that Scheme language spec allows a local
implementation that uses different CCS/CES.

>   The proposed changes to the Scheme standard do not require
>   implementations to use Unicode or iso8859-*.  They do, however,
>   require that portable Scheme programs manipulating multilingual text
>   take some of their semantics (in particular, the meaning of string
>   indexes and hex string constants) from Unicode.  Absent such a
>   semantic, integer constant string indexes and string index
>   adjustments (e.g., +1) will not have a portable meaning (for
>   example).

Using Unicode codepoints as the portable means of hex notation
(#\U+XXXX) is ok.
The integer indexing is an different issue.  EUCJP #xA5F7
character is mapped to two subsequent unicode codepoints,
U+30AB and U+309A.   On the other hand, U+30AB itself is
mapped to EUCJP #xA5AB, and U+309A doesn't have corresponding
character in EUCJP.
If STRING-REF has to be unicode codepoint index, I don't see
how it should work.

>   My proposal does _not_
>   require conforming implementations to use Unicode and does not
>   preclude implementations that include characters not found in
>   Unicode (Pika's support of buckybits is an example).

Requirements for unicode codepoint index and 256 character mapping
(as I explain later) implies the implementation to use
Unicode-compatible charset.

>   As I understand it, there is resistence in some places to Unicode
>   for two reasons: (1) legacy systems using other character sets (not
>   much anyone can do about that -- I don't think Scheme should bend
>   over backwards to be EBCDIC-friendly, either); (2) still surviving
>   controversy about Han Unification.

As I mentioned, there are niche area that wants to use other
encodings, and there are some areas that want to deal with (1).

I don't see much problem about (2)---no other encodings completely
solve the problem, so we have to live with it anyway.  However,
there are encodings that solve the problem in different way
(e.g. iso2022), which is why I want to stick to codeset-independent

>     >     * If the implementation uses EUCJP as its internal CES, it
>     >       will face difficulty for the recommendation of INTEGER->CHAR
>     >       to support [0,255], since EUCJP does not have full mappings
>     >       in this range, although it has much more characters than 256.
>     >       I think it's possible that (integer->char #xa1) on such
>     >       implementations returns a "pseudo character", which doesn't
>     >       corresponds to any character in EUCJP CES but is guaranteed
>     >       that to produce #xa1 when passed to char->integer.  But the
>     >       effects would be undefined if such a character is used within
>     >       a string.  (An implementation can also choose different
>     >       integers than the codepoint value to fulfill this "256 character"
>     >       requirements, but it wouldn't be intuitive).
> You say that "the effects would be undefined if such a character
> is used within a string".   I don't see why that would have to be the
> case -- I only see how a particular string implementation could have
> that bug.
> There are 128 unassigned octet sequences available in EUCJP that won't
> be mistaken for any real character, right?  Purely internally, those
> sequences could be used in the string representations to represent the
> effect of something like (STRING-SET! s (INTEGER->CHAR 161)).

Strictly speaking, yes.  I can map integer #x80 to #xff into a
undefined region of EUCJP.  That's what I said "it wouldn't be

Here's EUCJP packed encoding:

  Character               EUCJP packed encoding.
  ASCII                   [#xNN]           NN in 0..7f
  JISX0201                [#x8e #xNN]      NN in 0..ff
  JISX0213 1st plane      [#xNN #xMM]      NN in a0..ff, MM in a0..ff
  JISX0213 2nd plane      [#x8f #xNN #xMM] NN in a0..ff, MM in a0..ff

So, I can use something like [#x8f #x20 #xNN] to represent
(integer->char #xNN) where NN is 80 to ff.   However, you said.

  There are many circumstances in which conversions between
  octets and characters are desirable 

And I don't see such circumstances.  What exactly does this
requirement to solve?  Even if I use the pseudo character
representation like above, there's no way to read
and write such a character from/to a port reliably.

> The problem you mention isn't unique to EUCJP -- it occurs for a
> naively implemented UTF-8 or UTF-16-based Scheme as well.   The
> solutions are analogous.

Not really.   Character encoding scheme (CES) of eastern Asia
languages typically includes more than one coded character set (CCS).
So, the "codepoint" in particular CCS isn't the same as the
encoded value in CES---e.g. JISX0201 defines code in #x00 to #x7f
region which conflicts with ASCII.  In programs you have to
stick to CES code value.

In Unicode world, one CCS (Unicode) has many CES (utf-8,
utf-16, ...) but you can always map one codepoint to the other.
So you can treat them consistently in CCS codepoint.

>     >     * "What _is_ a Character" section doesn't refer to an
>     >       implementation where a CHAR? value corresponts to a
>     >       codepoint of non-Unocde, non-iso8859-* CCS/CES.
> Most likely, depending on the details of the implementation you have
> in mind, I would put it in the "roughly corresponds to a Unicode
> codepoint" category.

Maybe EUCJP can fall to the category where CHAR? value corresponds
to a "combining sequence"---because of EUCJP #xA5F7 <-> [U+30AB U+309A]
mapping.  But you can't derive "integer range [0..256] can map to
characters" from that.

>     >     * In the portable FFI section, some APIs state the encoding
>     >       must be one of utf-8, iso8859-* or ascii, and I don't see
>     >       the reason of such restrictions.
> How would you remove that restriction in a way that supports writing
> portable FFI-using code?

What I'm picking there is the word "must". 
scm_extract_string8 can put answer in eucjp packed format into
t_uchar* array if the implementation supports that, so I don't
see why this restriction is needed.

  indicated encoding (which must be one of `uni_utf8',
  `uni_iso8859_*', or `uni_ascii') 

Of course using such encoding wouldn't be portable.  But so
as iso8859_1 implementation is asked to convert the string
into iso8859_2.

>     > 3. Determining the least common set of assumptions about characters
>     >    and strings the language/FFI spec should define.
>     >    Mostly in "R6RS recommendation" section.  Some of them seem
>     >    try to be codeset-independent, while some of them seem to
>     >    assume Unicode/iso8859-* codeset implicitly.  So I wonder
>     >    which is the intention of the document.
> The intention is to make optional Unicode support well-defined and
> useful as a basis for writing portable Scheme code --- while saying no
> more in the core specification of Scheme than is necessary to
> accomplish that.
> I believe it will be practical for implementations internally using,
> for example, EUCJP to conform to the requirements -- and even to
> provide a useful level of support for the optional Unicode facilities.

Gauche can be compiled using EUCJP, and doesn't have a problem
communicating with Unicode world so far.  But I don't see
[0..256] mapping, "Unicode codepoint index", and O(1) accesses
are essential for such an implementation to communicate with
Unicode world.