[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: strings draft
I think the goal of the document is a bit ambiguous. Specifically,
I feel there are three issues intermixed; they are related each
other, but it'll be clearer to separate the discussion as much
as possible.
1. Defining unicode API: "If a Scheme implementation supports
API that explicitly deals with Unicode, it should be so and so".
This goal tries to define a common API to interface the outer
Unicode world, while allowing the implemenation to choose their
internal representations.
2. Addressing unicode-specific issues: "If a Scheme implementation
uses Unicode as its native character representation, it should be
so and so".
Some issues raised in the document are based on the assumption
that the implementation uses Unicode or iso-8859-* CCS (coded
character set) in its native representation.
If the document limit its scope to "the implementations that uses
Unicode/iso8859-* internally", it's fine. Is that the intention
of the document?
* If the implementation uses EUCJP as its internal CES, it
will face difficulty for the recommendation of INTEGER->CHAR
to support [0,255], since EUCJP does not have full mappings
in this range, although it has much more characters than 256.
I think it's possible that (integer->char #xa1) on such
implementations returns a "pseudo character", which doesn't
corresponds to any character in EUCJP CES but is guaranteed
that to produce #xa1 when passed to char->integer. But the
effects would be undefined if such a character is used within
a string. (An implementation can also choose different
integers than the codepoint value to fulfill this "256 character"
requirements, but it wouldn't be intuitive).
* "What _is_ a Character" section doesn't refer to an
implementation where a CHAR? value corresponts to a
codepoint of non-Unocde, non-iso8859-* CCS/CES.
* In the portable FFI section, some APIs state the encoding
must be one of utf-8, iso8859-* or ascii, and I don't see
the reason of such restrictions.
3. Determining the least common set of assumptions about characters
and strings the language/FFI spec should define.
Mostly in "R6RS recommendation" section. Some of them seem
try to be codeset-independent, while some of them seem to
assume Unicode/iso8859-* codeset implicitly. So I wonder
which is the intention of the document.
Another issue: is there a rationale about "strong encouragement"
of O(1) access of string-ref and string-set!? There are
alrogithms that truly need random access, but in many cases,
index is used just to mark certain location of the string;
e.g. if you want (string-ref str 3), it's rare that you know
'3' is significant before you know about str---it's more likely
that somebody (string search function, regexp matcher, or suffix
database...) told you that the 3rd character of a particular
string in str is significant. In such cases, the reason you
use index is not because the algorithm requires it, but just
one of the possible means to have a reference within a string.
I feel that accessing strings by index is a kind of premature
optimization, which came from the history when strings were
simply an array of bytes. Some algorithms may still require
random access, but we already have a primitive datatype that
guarantee O(1) access---a vector. I don't think we should
make a string just a vector with a restricted element type.
--shiro