[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft

    > From: Shiro Kawai <shiro@xxxxxxxx>

    > > but all implementations must either refuse to read

    > > 	"\U+30AB.\U+309A."

    > > or have

    > > 	(string-length "\U+30AB.\U+309A.") => 2

    > I see.  I think it's reasonable and acceptable.   EUCJP
    > implementation can inform the user that it can't read the constant.  
    > There are a couple of edge cases that I'd like to be clearer.
    > Can it map U+30AB to EUCJP #xA5AB, and U+309A to some
    > alternative character that designates unrecognized character?
    > (U+3013 is used in Japan traditionally).   It'll satisfy
    > codepoint index requirements.  Though
    > (string-ref "\U+30AB.\U+309A." 1) would be a surprise.

    > This can be either way---if it's not allowed in the proposal,
    > I can provide a flag so the implementation can behave either
    > "strictly conforming Unicode API" or "loose mode".

If your implementation can read:


doesn't that mean it should also read:

        (list #\U+30AB #\U+309A)

I'm not sure how to reconcile those.

    > Another edge case.  Suppose U+30AB and U+309A codepoints are
    > written directly (without escaping) in the source code.
    > EUCJP implementation can still load such a file, if it is informed
    > that the source is in one of Unicode CES.   It will convert
    > those two codepoints into one EUCJP #xA5AB character during
    > reading, so it'll produce a string of one character.
    > Is it an out of scope of the Unicode API?

I specifically mean the R6RS recommendations to _not_ preclude that
interpretation.  Yes, you should be able to read that string constant
from some Unicode stream and wind up with a one character string

If someone writes a non-portable program that says "This program
assumes that all string constants are Unicode [and, in such and such a
canonicalization form, etc.]" then that program wouldn't necessarily
run correctly on your implementation.