This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
> From: Shiro Kawai <shiro@xxxxxxxx> (last bit first:) > I feel that accessing strings by index is a kind of premature > optimization, which came from the history when strings were > simply an array of bytes. I used to think so too -- but that was before I started writing C code that allows UTF-8 and UTF-16 (and soon UTF-32) to be mixed and matched freely in a single application. I think that the approach I'm taking with Pika -- internally using an N-bit encoding when all characters in a string fit in N-bits -- does a pretty good job of restoring the "a string is an array" simplicity in an international context. (and, the rest:) > I think the goal of the document is a bit ambiguous. Specifically, > I feel there are three issues intermixed; they are related each > other, but it'll be clearer to separate the discussion as much > as possible. You summarized the three issues that you see as intermixed this way: > 1. Defining unicode API: "If a Scheme implementation supports > API that explicitly deals with Unicode, it should be so and > so". [i.e., FFI issues re Unicode] [....] > 2. Addressing unicode-specific issues: "If a Scheme implementation > uses Unicode as its native character representation, it should be > so and so". [i.e., Scheme language optional requirements] > 3. Determining the least common set of assumptions about characters > and strings the language/FFI spec should define. > [i.e., Scheme Language and FFI required features] and, overall, you're concerned (understandably) about the bias of the draft towards Unicode and iso8859-* in general, and, in particular, about proposals in the draft that may not be reasonable for implementations using other character sets (such as EUCJP). Ok, so, my reply. I intended there to be three topics, each with two subtopics -- similar but not quite the same as as your summary: 1. Scheme language changes needed to support Unicode 1a. requirements 1b. optional feature requirements 2. String handling in a portable FFI 2a. requirements 2b. optional feature requirements 3. Examination of a particular implementation 3a. native FFI spec 3b. implementation strategy I think that, at this stage, those three topics do belong together. On this list, we're most directly concerned with (2) ("portable FFI") but I don't think we can design such an FFI usefully without addressing (1) ("Scheme language"). For example, FFI-using code is seriously impoverished if it can not reliably exchange integer string indexes with Scheme -- so we need to nail down what those indexes mean in the Scheme language. Another example: portable FFI-using code can not reliably enter strings into Scheme without stronger character set guarantees than are found in R5RS. And if we're concerned with (1) and/or (2), then we also need at least one and hopefully more examples of (3) ("particular implementation(s)") because we need to have some confidence that we aren't specifying something that is unimplementable or has too great an impact on implementations. I am, indeed, _somewhat_ biased towards Unicode -- but I also acknowledge the desirability of not ruling that an implementation using a different system (such as EUCJP) is necessarily non-standard. I _think_ I did a good job at that but I'll address the specific concerns that you raised below. There is the question: Should future Revised Standards explicitly mention Unicode in the core of the specification, as I've recommended, or should such materials be separated from the core to an annex where they might stand in parrallel with analogous sections for, for example, EUCJP? I suspect that such a question is mostly political, not technical. I'll admit my bias that I think the way forward for computing generally is to converge on Unicode -- I'd like it in the core -- but it's a minor point and a topic for another context, I think. > If the document limit its scope to "the implementations that uses > Unicode/iso8859-* internally", it's fine. Is that the intention > of the document? No, it's not -- in two ways: First: The proposed changes to the Scheme standard do not require implementations to use Unicode or iso8859-*. They do, however, require that portable Scheme programs manipulating multilingual text take some of their semantics (in particular, the meaning of string indexes and hex string constants) from Unicode. Absent such a semantic, integer constant string indexes and string index adjustments (e.g., +1) will not have a portable meaning (for example). Second: Yes, I am proposing that R6RS explicitly embrace Unicode, mostly in the form of optional features defining how characters beyond the minimal required character set are handled. My proposal does _not_ require conforming implementations to use Unicode and does not preclude implementations that include characters not found in Unicode (Pika's support of buckybits is an example). As I understand it, there is resistence in some places to Unicode for two reasons: (1) legacy systems using other character sets (not much anyone can do about that -- I don't think Scheme should bend over backwards to be EBCDIC-friendly, either); (2) still surviving controversy about Han Unification. (2) is a topic we shouldn't debate on this list or try to solve here. For Scheme, though, I think we need to look at an overarching technical reason for embracing Unicode: it is simply the best structured design for an international character set that anybody has yet imagined. The programmatic structure of Unicode and Unicode string encodings is the best that there is and that structure is orthogonal to the controversial issues. (Contrast with, for example, shift-encodings.) Unicode does define The Right Thing for the programmatic aspects of string handling -- Scheme should simply embrace that. (To make this clearer: _if_ unification really is a broken idea, it can be fixed in Unicode. Were it fixed in Unicode, none of the requirements and recommendations of the draft would change. No portable use of those features would break.) > * If the implementation uses EUCJP as its internal CES, it > will face difficulty for the recommendation of INTEGER->CHAR > to support [0,255], since EUCJP does not have full mappings > in this range, although it has much more characters than 256. > I think it's possible that (integer->char #xa1) on such > implementations returns a "pseudo character", which doesn't > corresponds to any character in EUCJP CES but is guaranteed > that to produce #xa1 when passed to char->integer. But the > effects would be undefined if such a character is used within > a string. (An implementation can also choose different > integers than the codepoint value to fulfill this "256 character" > requirements, but it wouldn't be intuitive). You say that "the effects would be undefined if such a character is used within a string". I don't see why that would have to be the case -- I only see how a particular string implementation could have that bug. There are 128 unassigned octet sequences available in EUCJP that won't be mistaken for any real character, right? Purely internally, those sequences could be used in the string representations to represent the effect of something like (STRING-SET! s (INTEGER->CHAR 161)). Alternatively, you could use multiple string representations internally, similar to those I described for Pika. The problem you mention isn't unique to EUCJP -- it occurs for a naively implemented UTF-8 or UTF-16-based Scheme as well. The solutions are analogous. > * "What _is_ a Character" section doesn't refer to an > implementation where a CHAR? value corresponts to a > codepoint of non-Unocde, non-iso8859-* CCS/CES. Most likely, depending on the details of the implementation you have in mind, I would put it in the "roughly corresponds to a Unicode codepoint" category. The sample implementation described in the draft (Pika Scheme) includes non-Unicode codepoints (characters with buckybits set). I state elsewhere in the draft that is in the "roughly a Unicode codepoint" class. > * In the portable FFI section, some APIs state the encoding > must be one of utf-8, iso8859-* or ascii, and I don't see > the reason of such restrictions. How would you remove that restriction in a way that supports writing portable FFI-using code? The only plausible alternative I see would be for the FFI to also include a means to enter or extract strings using the Posix multibyte support. I think that this would be essentially impossible for FFI-implementors to provide in anything other than implementations which use the multibyte routines internally. It would, in effect, be a requirement that Scheme be based on the C multibyte routines rather than Unicode. > 3. Determining the least common set of assumptions about characters > and strings the language/FFI spec should define. > Mostly in "R6RS recommendation" section. Some of them seem > try to be codeset-independent, while some of them seem to > assume Unicode/iso8859-* codeset implicitly. So I wonder > which is the intention of the document. The intention is to make optional Unicode support well-defined and useful as a basis for writing portable Scheme code --- while saying no more in the core specification of Scheme than is necessary to accomplish that. I believe it will be practical for implementations internally using, for example, EUCJP to conform to the requirements -- and even to provide a useful level of support for the optional Unicode facilities. > Another issue: is there a rationale about "strong encouragement" > of O(1) access of string-ref and string-set!? There are > alrogithms that truly need random access, but in many cases, > index is used just to mark certain location of the string; > e.g. if you want (string-ref str 3), it's rare that you know > '3' is significant before you know about str---it's more likely > that somebody (string search function, regexp matcher, or suffix > database...) told you that the 3rd character of a particular > string in str is significant. In such cases, the reason you > use index is not because the algorithm requires it, but just > one of the possible means to have a reference within a string. I don't follow your arguments about the distinction between an integer constant string index and one received as a parameter. In both cases, the question is "what is the computational complexity of STRING-REF, STRING-SET!, and SUBSTRING ?" -t