This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
Tom Lord <lord@xxxxxxx> writes: > We should also point readers in general to: > > http://www.unicode.org/reports/tr15/#Programming_Language_Identifiers > > which is Annex 7 ("Programming Language Identifiers") of Unicode > Technical Report 15 ("Unicode Normalization Forms"). Yes. I think the Unicode suggestions for programming language identifiers are good ones, and we should both point to them and strongly suggest their use. I'm not quite prepared to say that we should standardize Scheme to require it (even on Unicode places) > * (identifier? s) => <bool> This is fine. An implementation should be allowed to always return #t from this function, even though not every such string could be parsed as an identifier by the reader. (This for the sake of eval, at least.) > The definition of FOLD-IDENTIFIER must be consistent with the > recommendations of Annex 7 ("Programming Language Identifiers" of > Unicode Technical Report 15 for identifier names comprised > entirely of Unicode characters. Again, I would suggest that we merely advocate this, but not require it. > For this purpose, the characters > of the portable Scheme character set are considered to be Unicode > characters. (A short summary of the implications of this > requirement for portable identifiers is that given a portable > identifier, FOLD-IDENTIFIER must map #\A..#\Z to #\a..#\z.) On the other hand, we should certainly specify exactly the behavior of the function for the required character set, agreed. > (FOLD-IDENTIFIER is preferable to STRING-ID=? because it > produces a canonical form of each identifier explicitly > rather than implicitly. The canonical form is useful because > it can be hashed, stored in a trie, etc. It would be > impractical to implement, for example, a symbol table in a > compiler given only STRING-ID=?.) I think my worry is that it is not obvious that an implementation even has an implicit folding available, at least, not cheaply. There should perhaps be a hash function to go with string-id=? to help. Many implementations will of course implement these things by folding. But if you think that really string-id=? should be allowed to implement arbitrary equivalence classes (provided that the standard character set works right), it isn't obvious to me that fold-identifier can be cheap, and that it might well be more expensive than whatever straightforward test is used. > * (concatenate-identifiers s0 s1 ...) => id > > Return a string ID, containing an identifier name which > is the concatenation of the arguments which must themselves > be identifier names. > (As nearly as I can tell, CONCATENATE-IDENTIFIERS is needed > because IDENTIFIER? won't be closed under STRING-APPEND -- but > I could be mistake about that. More research is needed.) In the cases where identifier? isn't closed under string-append, concatenate-identifiers might need to do more work than just concatenate. (What does "the concatenation of the arguments" mean, if not string-append?) > * (char-id-start? c) => <bool> > Return #t if C is a valid first character in an identifier. > > * (char-id-extend? c) => <bool> > Return #t if C is a valid non-first character in an identifier. These may be contextual. A character may be allowed in the beginning of an identifier but only if, something else is true later on. (Consider the "if it's not a number, it's an identifier" rule of the current standard.) Perhaps a system might want to have functions like this, but I'd like to see more experience before standardizing something. > What about case independent character ordering (e.g., CHAR-CI<? and > STRING-CI<?)? I see no compelling reason to eliminate them at this > stage -- they're still useful. I think they should be specified to be > consistent with the single-character default case foldings of Unicode, > where the portable character set is considered to consist of Unicode > characters. This will allow portable Scheme programs to use these > procedures to write programs which accurately manipulate Scheme > programs that use nothing but the portable character set. string-ci<? is fine, but must have a locale argument. If you want to have a standardly specified "default case foldings of Unicode" locale, that's fine with me. Ditto for char-ci<?. > What about case mappings (CHAR-UPCASE and CHAR-DOWNCASE). Again: > retain them; specify them as using the Unicode single character > mappings; permit implementations to add parameters are new procedures > -- the result allows portable Scheme programs to handle portable > Scheme program texts and captures a useful Unicode text process. No, no, no. Don't make functions that are known to be wrong. This is a bad idea. It's like requiring < to work for complex numbers, and then comparing magnitude, and saying "well, that's close enough". It's not. You can case map strings, and this should certainly be allowed. It should also have a locale argument. You cannot sensible case-map characters except in the "unicode single character mappings" locale; and why should we have special privileged functions there? It will only encourage people to *use* the functions, and their code will then be non-portable precisely when it matters. At the very least, make it allowed for char-upcase to simply fail to give any answer, and provide a locale argument. Or allow char-upcase to return a string. > A final note: the desirability of the -CI, -UPCASE, and -DOWNCASE > procedures hinges on the assumption that the portable Scheme character > set is a proper subset of Unicode. I'm assuming that (or at least, I want to make it possible), but I do *not* think that char-upcase and char-downcase are good ideas. string-upcase and string-downcase, by contrast, are unobjectionable, provided they get a locale argument. Thomas