[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: strings draft
> From: Shiro Kawai <shiro@xxxxxxxx>
> > but all implementations must either refuse to read
> > "\U+30AB.\U+309A."
> > or have
> > (string-length "\U+30AB.\U+309A.") => 2
> I see. I think it's reasonable and acceptable. EUCJP
> implementation can inform the user that it can't read the constant.
>
> There are a couple of edge cases that I'd like to be clearer.
>
> Can it map U+30AB to EUCJP #xA5AB, and U+309A to some
> alternative character that designates unrecognized character?
> (U+3013 is used in Japan traditionally). It'll satisfy
> codepoint index requirements. Though
> (string-ref "\U+30AB.\U+309A." 1) would be a surprise.
> This can be either way---if it's not allowed in the proposal,
> I can provide a flag so the implementation can behave either
> "strictly conforming Unicode API" or "loose mode".
If your implementation can read:
"\U+30AB.\U+309A."
doesn't that mean it should also read:
(list #\U+30AB #\U+309A)
I'm not sure how to reconcile those.
> Another edge case. Suppose U+30AB and U+309A codepoints are
> written directly (without escaping) in the source code.
> EUCJP implementation can still load such a file, if it is informed
> that the source is in one of Unicode CES. It will convert
> those two codepoints into one EUCJP #xA5AB character during
> reading, so it'll produce a string of one character.
> Is it an out of scope of the Unicode API?
I specifically mean the R6RS recommendations to _not_ preclude that
interpretation. Yes, you should be able to read that string constant
from some Unicode stream and wind up with a one character string
constant.
If someone writes a non-portable program that says "This program
assumes that all string constants are Unicode [and, in such and such a
canonicalization form, etc.]" then that program wouldn't necessarily
run correctly on your implementation.
-t