This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
> From: tb@xxxxxxxxxx (Thomas Bushnell, BSG) > There should be string-id=? (or some other name) which implements the > Scheme identifier matching rules, which should be specified for the > required character set, and left unspecified for all other > characters. > None of this requires or even implicitly uses a case mapping function. >> The standard would still need to specify CHAR-DOWNCASE. > Why? Is there some government bureau that will shut us down if the > next RnRS eleminates it? > I don't mind STRING-DOWNCASE, of course, which should have a locale > argument and be specified to permit the Correct Unicode Thing. Ok -- I think we can agree on some things. You're roughly right, I think. We should also point readers in general to: http://www.unicode.org/reports/tr15/#Programming_Language_Identifiers which is Annex 7 ("Programming Language Identifiers") of Unicode Technical Report 15 ("Unicode Normalization Forms"). Enclosed is a more fleshed-out and improved description of the approach you're advocating, plus its reconciliation with my suggestions for R6RS (which, frankly, don't need to change very much -- mostly this just involves adding new material). For SRFI-50 list relevence: let me point out that this doesn't change the proposed char/string FFI at all. On the other hand, the fact the recommendations for R6RS continue to work out nicely is confirmation that the analysis that leads to those FFI recommendations is sound. So far we've more or less made peace with R5RS, my recommendations for R6RS, Thomas Bushnell's thoughts on supporting linguistically sane Scheme identifiers, Shiro's concerns about implementations using character sets other than Unicode and its subsets/extensions, Bear's work on infinite character sets, and the emerging design of Pika. I think what Thomas B. is suggesting is better provided by this: * (identifier? s) => <bool> Return #f unless `s' is a legal identifier name. It is required that: (identifier? (symbol->string s)) => #t for all symbols s. * (fold-identifier name) => folded Where NAME is a string containing an identifier name and FOLDED is a string containing an equivalent identifier name. Two identifiers are equivalent if and only if: (string=? (fold-identifier a) (fold-identifier b)) FOLD-IDENTIFIER is required to be idempotent: (string=? (fold-identifier a) (fold-identifier (fold-identifier a))) => #t ; for all identifiers a and, of course, IDENTIFIER? is closed under FOLD-IDENTIFIER: (or (not (identifier? s)) (identifier? (fold-identifier s))) => #t ; for all strings s The definition of FOLD-IDENTIFIER must be consistent with the recommendations of Annex 7 ("Programming Language Identifiers" of Unicode Technical Report 15 for identifier names comprised entirely of Unicode characters. For this purpose, the characters of the portable Scheme character set are considered to be Unicode characters. (A short summary of the implications of this requirement for portable identifiers is that given a portable identifier, FOLD-IDENTIFIER must map #\A..#\Z to #\a..#\z.) (FOLD-IDENTIFIER is preferable to STRING-ID=? because it produces a canonical form of each identifier explicitly rather than implicitly. The canonical form is useful because it can be hashed, stored in a trie, etc. It would be impractical to implement, for example, a symbol table in a compiler given only STRING-ID=?.) * (concatenate-identifiers s0 s1 ...) => id Return a string ID, containing an identifier name which is the concatenation of the arguments which must themselves be identifier names. If all of the arguments are portable Scheme identifiers, then this function must behave identically to STRING-APPEND (As nearly as I can tell, CONCATENATE-IDENTIFIERS is needed because IDENTIFIER? won't be closed under STRING-APPEND -- but I could be mistake about that. More research is needed.) Now, what becomes of the character class procedures such as CHAR-NUMERIC? I think that these should be retained and corrected so that one can write a portable Scheme lexical analyzer which can accept as input programs using the character set extensions of its host implementation. From what I can tell, that would require the new procedures: * (char-id-start? c) => <bool> Return #t if C is a valid first character in an identifier. * (char-id-extend? c) => <bool> Return #t if C is a valid non-first character in an identifier. * (canonicalize-identifier s) => ID | #f Given a string S comprised of at least one CHAR-ID-START? character followed by any number of CHAR-ID-EXTEND? characters, return a valid identifier name (in the sense of IDENTIFIER?) corresponding to S or #f if no such identifier name can be constructed. If S consists only of portable Scheme characters, the result must be STRING=? to S and not EQ? to S. * (string->parsed-symbol s) S must be an IDENTIFIER? string. Return the symbol denoted by that identifier if it were used in a quoted context in a Scheme expression. (Note how this differs from STRING->SYMBOL.) * (string->parsed-character s) => <char> | #f Given a string S whose contents are syntactically a character constant, return the character that constant denotes or #f if there is no such character. If we want to permit extended string syntaxes, at least this is needed: * (string->parsed-string s) => <string> | #f S must be a string whose contents are syntactically a string constant, return a string that constant denotes or #f if there is no such string. Perhaps we'd also want similar procedures for other areas of syntactic extensibility. Now, what about the character ordering procedures (e.g. CHAR<?, STRING<? etc.)? I think these should remain unchanged -- they should relate to the integer mappings of characters. (Implementations or future standards are free to add locale parameters or introduce alternative procedures which are linguistically sensative.) What about case independent character ordering (e.g., CHAR-CI<? and STRING-CI<?)? I see no compelling reason to eliminate them at this stage -- they're still useful. I think they should be specified to be consistent with the single-character default case foldings of Unicode, where the portable character set is considered to consist of Unicode characters. This will allow portable Scheme programs to use these procedures to write programs which accurately manipulate Scheme programs that use nothing but the portable character set. It would, for example, allow a portable-character-set implementation of FOLD-IDENTIFIER. It also reifies into Scheme a sanctioned (even if non-preferred) sense of Unicode character case -- while Scheme should _also_ evolve facilities for linguistically preferrable case handling, these facilities will be useful for Scheme programs communicating with other systems that use only the single-character case mappings. (Again, implementations and future standards are not precluded from adding additional parameters or new procedures for default or locale-specific case handling). What about case mappings (CHAR-UPCASE and CHAR-DOWNCASE). Again: retain them; specify them as using the Unicode single character mappings; permit implementations to add parameters are new procedures -- the result allows portable Scheme programs to handle portable Scheme program texts and captures a useful Unicode text process. In terms of my "strings draft" -- there is one R6RS recommendation that should change more substantially than the tweaks suggested above. I wanted to modify 6.3.4 to say: These procedures [the character classes] return #t if their arguments are alphabetic, numeric, whitespace, upper case, or lower case characters, respectively, otherwise they return #f. These procedures _must_ be consistent with the procedure READ provided by the implementation. For example, if a character is CHAR-ALPHABETIC?, then it must also be suitable for use as the first character of an identifier. `a..z' and `A..Z' _must_ be alphabetic and _must_ be respectively lower and upper case. #\space, #\tab, and #\formfeed _must_ be CHAR-WHITESPACE?. `0..9' _must_ be CHAR-NUMERIC?. No character may cause more than one the procedures CHAR-ALPHABETIC?, CHAR-NUMERIC? and CHAR-WHITESPACE? to return #t. No character may cause more than one of the procedures CHAR-UPPER-CASE? and CHAR-LOWER-CASE? to return #t. Programmer's are advised that these procedures are unlikely to be suitable for linguistic programming in portable code while implementors are strongly encouraged to define them in ways that make them a reasonable approximation of their linguistic counterparts. It should say: These procedures [the character classes] return #t if their arguments are valid identifier start characters, valid identifier extension characters, alphabetic, numeric, whitespace, upper case, or lower case characters, respectively, otherwise they return #f. These procedures _must_ be consistent with the procedure READ provided by the implementation. For example, if a character is CHAR-ID-START?, then it must also be suitable for use as the first character of an identifier. `a..z' and `A..Z' _must_ be id-start and id-extend characters and _must_ be respectively lower and upper case. `a..z' and `A..Z' _must_ be alphabetic. If the argument to CHAR-ALPHABETIC? is a Unicode character, the it must return #t if and only-if the character is in one of the Unicode general categories Lu Ll Lt Lm Lo Nl #\space, #\tab, and #\formfeed _must_ be CHAR-WHITESPACE?. `0..9' _must_ be CHAR-NUMERIC?. No character may cause more than one the procedures CHAR-ID-START?, CHAR-NUMERIC? and CHAR-WHITESPACE? to return #t. No character may cause more than one of the procedures CHAR-UPPER-CASE? and CHAR-LOWER-CASE? to return #t. Programmer's are advised that these procedures are unlikely to be suitable for linguistic programming in portable code while implementors are strongly encouraged to define them in ways that make them a reasonable approximation of their linguistic counterparts. A final note: the desirability of the -CI, -UPCASE, and -DOWNCASE procedures hinges on the assumption that the portable Scheme character set is a proper subset of Unicode. One can imagine a Scheme standard that insisted on Unicode, and that requires a much larger set of valid identifier characters. Though abstractly attractive, such requirements would preclude tiny implementations of Scheme. Having a small and simply structured portable character set, and then adding on to that a level of _optional_ conformance for all of Unicode, is a far more practical idea. -t