This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
> From: Shiro Kawai <shiro@xxxxxxxx> > No. String search, regexp match, or precalculated prefix/suffix > database, all can return some sort of reference that directly > points into the string, so that the subsequent use of such > reference wouldn't need to count characters. > (The implementation that shares substrings and uses write-on-copy > for string mutation, those basic operations even can efficiently > return substring directly.) Well, I don't think it's that simple. It would be hard to implement those "string reference objects" to preserve the O(1) property in the face of STRING-SET! given a flat, variable-width, string representation. And if you have a tree representation or something like what I described for Pika -- then you don't need those "string reference objects" after all. They might be nice for indepenent reasons -- but you won't need them to get O(1) string-ops. > And to implement search, regexp, or prefix/suffix arrays, the > access of string is mostly sequential, or requires "random hopping" > in a small amount. Sequential access can be efficiently implemented, > using string ports, for example, than using integer index. > It's OK to have STRING-REF as well---after all, we have LIST-REF > and nobody complains its O(N) complexity. In some sense, I think that the strong recommendation for O(1) string-ops is already present in the spec. Were it not, why wouldn't the string syntax be a fancy way to write lists and STRING? and LIST? not disjoint? > [About character-set independence] > What I felt ambiguous is the degree of "character-set independence" > you're aiming at. If we'd like to have a character-set independent > language spec, we need to be much more careful to separate > Unicode-specific issues and character-set independent issues. Hey, I'm partisan but fair, I think. My recommendations suggest _requirements_ for the portable character set. Those aren't Unicode specific. My recommendations suggest _requirements_for_implementations_providing_optional_features_: and some of those are indeed Unicode specific. Perhaps not here on the SRFI-50 list but I am willing to argue that, for Scheme, Unicode deserves preferential treatment. > It would be nice that Scheme language spec allows a local > implementation that uses different CCS/CES. I think that my recommendations are consistent with that. (EBCDIC excepted :-) > Using Unicode codepoints as the portable means of hex notation > (#\U+XXXX) is ok. It should be fine for any implementation. Implementations will only be required to parse that syntax for the intersection of the abstract characters they provide and the abstract characters of Unicode. Such a translation table can be compressed to a few 10K, I think, for any of the character sets you are thinking of. > The integer indexing is an different issue. EUCJP #xA5F7 > character is mapped to two subsequent unicode codepoints, > U+30AB and U+309A. On the other hand, U+30AB itself is > mapped to EUCJP #xA5AB, and U+309A doesn't have corresponding > character in EUCJP. > If STRING-REF has to be unicode codepoint index, I don't see > how it should work. It only has to be a codepoint index for those strings which the implementation agrees consist entirely of Unicode characters. As with bear's munging of combining sequences into (non-Unicode) CHAR? values, you can declare that EUCJP #xA5F7 is not a Unicode character. The only portable requirement that arises here concerns strings containing nothing but characters in the "portable character set" of Scheme -- a subset of ASCII. (Beyond the requirement, the rule about indexes provides a guideline for Unicode-centric implementors, bear somewhat excepted.) > > My proposal does _not_ > > require conforming implementations to use Unicode and does not > > preclude implementations that include characters not found in > > Unicode (Pika's support of buckybits is an example). > Requirements for unicode codepoint index and 256 character mapping > (as I explain later) implies the implementation to use > Unicode-compatible charset. In portable code, the codepoint index rule effects only strings containing nothing but characters from the "Scheme portable character set" -- a subset of ASCII. The 256 character mapping is a recommendation, not a requirement. > > > * In the portable FFI section, some APIs state the encoding > > > must be one of utf-8, iso8859-* or ascii, and I don't see > > > the reason of such restrictions. > > How would you remove that restriction in a way that supports writing > > portable FFI-using code? > What I'm picking there is the word "must". > scm_extract_string8 can put answer in eucjp packed format into > t_uchar* array if the implementation supports that, so I don't > see why this restriction is needed. I would not object to an addition to the portable FFI which is scm_extract_string_opaque scm_enter_string_opaque that returns/accepts the data from a string, plus its length, but says nothing about how the data is encoded. It's purpose would be to extract that data in the "most convenient form" for a given implementation. Would that do? > indicated encoding (which must be one of `uni_utf8', > `uni_iso8859_*', or `uni_ascii') > Of course using such encoding wouldn't be portable. But so > as iso8859_1 implementation is asked to convert the string > into iso8859_2. I don't see why it wouldn't be portable. I was thinking it would be helpful to have a "libscheme-ffi-helpers.a" with the necessary tables. > Gauche can be compiled using EUCJP, and doesn't have a problem > communicating with Unicode world so far. But I don't see > [0..256] mapping, "Unicode codepoint index", and O(1) accesses > are essential for such an implementation to communicate with > Unicode world. Neither the 0..256 mapping nor the O(1) access time are _required_ in the proposed Scheme changes. The 0..256 mapping _is_ required in the proposed portable FFI. Recommending those things in R6RS would only encourage implementations to provide them and to warn programmers relying on them that they may have some performance surprises or functionality surprises using Gauche. Requiring the 0..256 mapping in the FFI means just that `char' can always be converted to CHAR? and back again. Is that really so onerous? -t