This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
I think the goal of the document is a bit ambiguous. Specifically, I feel there are three issues intermixed; they are related each other, but it'll be clearer to separate the discussion as much as possible. 1. Defining unicode API: "If a Scheme implementation supports API that explicitly deals with Unicode, it should be so and so". This goal tries to define a common API to interface the outer Unicode world, while allowing the implemenation to choose their internal representations. 2. Addressing unicode-specific issues: "If a Scheme implementation uses Unicode as its native character representation, it should be so and so". Some issues raised in the document are based on the assumption that the implementation uses Unicode or iso-8859-* CCS (coded character set) in its native representation. If the document limit its scope to "the implementations that uses Unicode/iso8859-* internally", it's fine. Is that the intention of the document? * If the implementation uses EUCJP as its internal CES, it will face difficulty for the recommendation of INTEGER->CHAR to support [0,255], since EUCJP does not have full mappings in this range, although it has much more characters than 256. I think it's possible that (integer->char #xa1) on such implementations returns a "pseudo character", which doesn't corresponds to any character in EUCJP CES but is guaranteed that to produce #xa1 when passed to char->integer. But the effects would be undefined if such a character is used within a string. (An implementation can also choose different integers than the codepoint value to fulfill this "256 character" requirements, but it wouldn't be intuitive). * "What _is_ a Character" section doesn't refer to an implementation where a CHAR? value corresponts to a codepoint of non-Unocde, non-iso8859-* CCS/CES. * In the portable FFI section, some APIs state the encoding must be one of utf-8, iso8859-* or ascii, and I don't see the reason of such restrictions. 3. Determining the least common set of assumptions about characters and strings the language/FFI spec should define. Mostly in "R6RS recommendation" section. Some of them seem try to be codeset-independent, while some of them seem to assume Unicode/iso8859-* codeset implicitly. So I wonder which is the intention of the document. Another issue: is there a rationale about "strong encouragement" of O(1) access of string-ref and string-set!? There are alrogithms that truly need random access, but in many cases, index is used just to mark certain location of the string; e.g. if you want (string-ref str 3), it's rare that you know '3' is significant before you know about str---it's more likely that somebody (string search function, regexp matcher, or suffix database...) told you that the 3rd character of a particular string in str is significant. In such cases, the reason you use index is not because the algorithm requires it, but just one of the possible means to have a reference within a string. I feel that accessing strings by index is a kind of premature optimization, which came from the history when strings were simply an array of bytes. Some algorithms may still require random access, but we already have a primitive datatype that guarantee O(1) access---a vector. I don't think we should make a string just a vector with a restricted element type. --shiro