This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.
I'm going to try to organize what I have to say and just make it a bullet list instead of responding to a half-dozen different things. Immutable strings - With Unicode and threads, it's the only viable implementation strategy. The hairy underpinnings and implementation strategies in a threaded unicode environment, if you don't want locking to absolutely thrash, will use purely functional structures and lock-free algorithms. But the interface is not the implementation. Once you've done the legwork for immutable strings, providing string-set! and similar is a very short further trip. Constructing a new string every time a mutation happens seems like a big chore, but if you're using a good implementation, you'll be able to do it in O(1) or O(log(n)) with a log base that's pretty large, and reuse 99%+ of the structure of the extant string. Under the heading of "what is efficient should be very easy to use", and given that immutable strings *will* be more efficient in unicode-enabled multithreaded schemata, we need a family of string functions that explicitly return new strings leaving the extant string unchanged rather than doing any kind of mutation. But there is plenty of time for this; we don't have to get them into R6RS, and in fact we need R6RS' unicode specs before they can be written. Removing string-set! would be way too much of a flag-day for existing scheme code. You'd have to spend R7RS implementing all kinds of functional-string operations that we don't have now because they're not sensible for mutable array-based strings, add some you probably missed in R8RS, start deprecating string-set! in R9RS, and then you can *think* about making strings immutable starting with R10RS or R11RS. It would be far more sensible for a new dialect of lisp to just not have string-set! than for scheme to try to get rid of it. Someone has asserted that indexing by grapheme cluster implies stepping and is O(n). This is only true for strings implemented as C implements them. It is perfectly feasible to have characters inside your implementation always represented by constant-width values, using a subrange of those values as indexes into a table of interned characters. You intern any characters falling into that subrange and any characters too wide for your fixed-width units. Then you index in O(1) and don't worry at all about whether some characters are big. Of course, in such an implementation indexing by codepoint would be problematic... the double-indexing (both codepoint and grapheme cluster are indexed) methods I can think of right now are all O(log(n)). Regarding what ought to be legal as an identifier: I think control characters, whitespace (properties Zs, Zl, Zp) and delimiters (properties Ps, and Pc) ought not appear in identifiers. I wouldn't be at all upset if a standard also forbade combining characters; after all, identifiers and symbol names don't need the full functionality of strings. I wouldn't be at all upset of a standard also forbade all characters not yet assigned as of Unicode 4.1.0, with the implication that this forbidding would be permanent across Scheme report revisions, even though later Unicode versions doubtless will come along. I think you ought to be able to tell in the first three characters that a symbol is not a number. Right now the standard says the first one character, and I'm okay with that too. I think that the standard needs to explicitly specify a method for partitioning the set of all strings into a few groups: * things an implementation is required to accept as identifiers * things an implementation is free to accept as identifiers * things an implementation may not accept as numbers or identifiers (reserved for non-numeric * things an implementation is free to accept as numbers * things an implementation is required to accept as numbers (etc... required and permitted syntaxes for booleans, characters, other constant formats. ) The first category would be simple, like the BNF from RNRS. The second category would be things that, *if* they appear in a program, *must* be either identifiers or syntax errors. You shove 'em off here because some folks may want them as identifiers whilst others hate what the pattern-matching for them does to their parsers. (an example would be my alternate three-characters rule for determining that something isn't a number). The third category is for extensions and unspecified kinds of data; here's where I'd be using those openings and closings and initial and final punctuations I wanted kept out of identifiers. The fourth category would be for classes of strings reserved for numeric extensions and numeric constant formats that implementations aren't required to implement. And the fifth would be for numeric formats that all implementations *are* required to implement. Right now we have a minimal set of identifiers guaranteed to be legal identifiers (the BNF for symbols in R5RS) and a very minimal set of numeric syntax that's required to be accepted (a limited range of integers, with the limit unspecified). And every implementation makes its own calls as to how to interpret all other tokens. 21/22 may be an identifier in a scheme that doesn't support rationals, and #q2+2i+j+4k may be a number in a scheme that supports quaternions, and #33.28 may be a boolean in a system with fuzzy logic extensions. There is absolutely *nothing* you can do to define a syntactic class of objects that is guaranteed to *not* step on number or identifier syntax in any scheme. This is a mess, inhibits innovation, and introduces bugs into code that works fine elsewhere, and the standard *does* need to help sort it out.