This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
> From: tb@xxxxxxxxxx (Thomas Bushnell, BSG) > > * (identifier? s) => <bool> > This is fine. An implementation should be allowed to always return #t > from this function, even though not every such string could be parsed > as an identifier by the reader. (This for the sake of eval, at least.) Hmm.... I don't think so. It should deal with source texts -- eval'able forms being something else. Which makes me realize, incidentally, that this requirement that I stated: It is required that: (identifier? (symbol->string s)) => #t for all symbols s. is wrong (and should just be dropped). > > The definition of FOLD-IDENTIFIER must be consistent with the > > recommendations of Annex 7 ("Programming Language Identifiers" of > > Unicode Technical Report 15 for identifier names comprised > > entirely of Unicode characters. > Again, I would suggest that we merely advocate this, but not require it. Things like that can be split into the R6RS part and parts for SRFIs or later standards. The key thing is to make sure that nothing R6RS requires is inconsistent with that report. The secondary thing is to guide implementors towards that report. > > (FOLD-IDENTIFIER is preferable to STRING-ID=? because it > > produces a canonical form of each identifier explicitly > > rather than implicitly. The canonical form is useful because > > it can be hashed, stored in a trie, etc. It would be > > impractical to implement, for example, a symbol table in a > > compiler given only STRING-ID=?.) > I think my worry is that it is not obvious that an implementation even > has an implicit folding available, at least, not cheaply. There > should perhaps be a hash function to go with string-id=? to help. > Many implementations will of course implement these things by > folding. But if you think that really string-id=? should be allowed > to implement arbitrary equivalence classes (provided that the standard > character set works right), it isn't obvious to me that > fold-identifier can be cheap, and that it might well be more expensive > than whatever straightforward test is used. I'm having trouble imagining an implementation that doesn't have or couldn't trivially implement a FOLD-IDENTIFIER procedure. Mathematically, such a procedure is always possible. The combination of those cause me to prefer the more general FOLD-IDENTIFIER. > > * (concatenate-identifiers s0 s1 ...) => id > > Return a string ID, containing an identifier name which > > is the concatenation of the arguments which must themselves > > be identifier names. > > (As nearly as I can tell, CONCATENATE-IDENTIFIERS is needed > > because IDENTIFIER? won't be closed under STRING-APPEND -- but > > I could be mistake about that. More research is needed.) > In the cases where identifier? isn't closed under string-append, > concatenate-identifiers might need to do more work than just > concatenate. That's right. That's the rationale for having it instead of relying on STRING-APPEND. > (What does "the concatenation of the arguments" mean, if > not string-append?) It means "do those extra things". I specifically want to ensure a mechanism for doing things like making structure access procedure names derived from structure names. Absent CONCATENATE-IDENTIFIERS, this does not appear to be possible except over the portable character set. > > * (char-id-start? c) => <bool> > > Return #t if C is a valid first character in an identifier. > > * (char-id-extend? c) => <bool> > > Return #t if C is a valid non-first character in an identifier. > These may be contextual. A character may be allowed in the beginning > of an identifier but only if, something else is true later on. > (Consider the "if it's not a number, it's an identifier" rule of the > current standard.) > Perhaps a system might want to have functions like this, but I'd like > to see more experience before standardizing something. Disagree. These are consistent both with Unicode "best practice" and Scheme syntax. Recall that CANONICALIZE-IDENTIFIER is permitted to return #f (analogously to STRING->NUMBER). (It might be worth explicitly requiring that any numeric syntax extensions made by an implementation are such that they are consistent with these definitions. It's not absolutely necessary but it would simplify lexing. In other words: (or (not (string->number s)) (= 0 (length s)) (not (char-id-start? (string-ref s 0))) (not (map-and char-id-extend? (string->list (substring s 1))))) => #t for all strings s.) > > What about case independent character ordering (e.g., CHAR-CI<? and > > STRING-CI<?)? I see no compelling reason to eliminate them at this > > stage -- they're still useful. I think they should be specified to be > > consistent with the single-character default case foldings of Unicode, > > where the portable character set is considered to consist of Unicode > > characters. This will allow portable Scheme programs to use these > > procedures to write programs which accurately manipulate Scheme > > programs that use nothing but the portable character set. > string-ci<? is fine, but must have a locale argument. If you want to > have a standardly specified "default case foldings of Unicode" locale, > that's fine with me. Ditto for char-ci<?. Unicode provides roughly three classes of case conversion/folding/matching: ~ default length preserving -- linguistically suboptimal but have useful closure properties and compatability properties ~ default length varying -- locale independent, linguistically very good. ~ locale length varying -- locale dependent, linguistically perfect wrt. a given locale. (I suppose in theory there are also implied locale-specific, single-character mappings -- these can be seen, for my purposes here, as a special case of locale length varying.) Scheme's STRING-CI<? should use the first (default length preserving) because it is maximally upward compatible with R5RS, sufficient for processing programs that use only the portable character set, is a needed tool to put in the Unicode toolbox, and is the interpretation that best preserves the simple quasi-algebraic properties relating character and string orderings (such as one might want for implementing a trie of identifiers). Nothing about that requirement precludes adding additional parameters or procedures to handle the other two (or three) kinds of case mapping. > > What about case mappings (CHAR-UPCASE and CHAR-DOWNCASE). Again: > > retain them; specify them as using the Unicode single character > > mappings; permit implementations to add parameters are new procedures > > -- the result allows portable Scheme programs to handle portable > > Scheme program texts and captures a useful Unicode text process. > No, no, no. Don't make functions that are known to be wrong. This is > a bad idea. It's like requiring < to work for complex numbers, and > then comparing magnitude, and saying "well, that's close enough". > It's not. It's not like complex numbers. Characters are, at best, quasi-algebraic. Numbers are algebraic. Comparing complex numbers that way is usually nonsensical; comparing characters this way is a standardized text process with many uses. Character and string orderings over the portable character set relate on the basis of a partial ordering of characters (defined in terms of the case of the portable characters) serving as the basis of a lexical ordering of strings. Regardless of any linguistic interpretation, these are handy things to keep around for processing portable Scheme source texts. The Unicode extension (via single-character default case mappings) of the partial order that applies to the portable Scheme character set is the one that is both maximally upward compatible and the most carefully thought-about/negotiated for approximating text processing. A "systems programming" Scheme with full Unicode support will _need_ the default length preserving case mappings --- to talk with other systems, if nothing else. Any Scheme with full Unicode support and length-varying case mappings can provide the default length preserving mappings nearly for free. At _most_, while we _should_ presumably be in full agreement about what functionality should be available (all three kinds of case mapping), we're arguing over the ridiculous question of which of those functionalities forms like: (string-ci<? a b) refer to. The choice I'm advocating is the most upward compatible one, by far. > You can case map strings, and this should certainly be allowed. It > should also have a locale argument. That functionality should be present in a good Unicode Scheme, I agree. My R6RS recommendations are perfectly consistent with that. > You cannot sensible case-map characters except in the "unicode single > character mappings" locale; and why should we have special privileged > functions there? It will only encourage people to *use* the > functions, and their code will then be non-portable precisely when it > matters. > At the very least, make it allowed for char-upcase to simply fail to > give any answer, and provide a locale argument. Or allow char-upcase > to return a string. I haven't precluded char-upcase from being extended to except an optional locale argument, or from returning strings when that argument is provided. Of the behaviors one might request with a locale argument, I've picked precisely the one defined by the Unicode standards for situations where casemapping a character must return a character. > > A final note: the desirability of the -CI, -UPCASE, and -DOWNCASE > > procedures hinges on the assumption that the portable Scheme character > > set is a proper subset of Unicode. > I'm assuming that (or at least, I want to make it possible), but I do > *not* think that char-upcase and char-downcase are good ideas. They are valuable because they provide a simple model for processing texts written using the portable Scheme character set and because they can be compatibly extended to implement a standard Unicode text process. > string-upcase and string-downcase, by contrast, are unobjectionable, > provided they get a locale argument. Linguistic text processing is a separate matter from character-based text processing and from processing portable Scheme source texts. Character-based text processing is computationally useful and makes perfectly good sense wrt. Unicode. By non-coincidence, it is a superset of what's needed for processing portable Scheme source texts. Meanwhile, extensions such as FOLD-IDENTIFIER provide sufficient mechanism for implementations and future standards to extend their lexical syntax in linguistically sensitive ways without, at the same time, requiring linguistic text processing facilities in the core of Scheme. Meanwhile, linguistic text processing facilities can be added as libraries and extensions to standard procedures. -t