[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: strings draft
From: Tom Lord <lord@xxxxxxx>
Subject: Re: strings draft
Date: Fri, 23 Jan 2004 20:31:32 -0800 (PST)
> > So, when the EUCJP Scheme reads a string
>
> > "\U+30AB.\U+309A."
>
> > Then it can produce a string which consists of a single characetr
> > EUCJP #xA5F7.
>
> Eh... no. The final language should be such that that string
> constant denotes a string of two Unicode codepoints.
[...]
> but all implementations must either refuse to read
>
> "\U+30AB.\U+309A."
>
> or have
>
> (string-length "\U+30AB.\U+309A.") => 2
I see. I think it's reasonable and acceptable. EUCJP
implementation can inform the user that it can't read the constant.
There are a couple of edge cases that I'd like to be clearer.
Can it map U+30AB to EUCJP #xA5AB, and U+309A to some
alternative character that designates unrecognized character?
(U+3013 is used in Japan traditionally). It'll satisfy
codepoint index requirements. Though
(string-ref "\U+30AB.\U+309A." 1) would be a surprise.
This can be either way---if it's not allowed in the proposal,
I can provide a flag so the implementation can behave either
"strictly conforming Unicode API" or "loose mode".
Another edge case. Suppose U+30AB and U+309A codepoints are
written directly (without escaping) in the source code.
EUCJP implementation can still load such a file, if it is informed
that the source is in one of Unicode CES. It will convert
those two codepoints into one EUCJP #xA5AB character during
reading, so it'll produce a string of one character.
Is it an out of scope of the Unicode API?
> > If so, I have no problem to adopt the "codepoint index" proposal.
>
> Well, how about if I agree to every bit of that except for the syntax
> you used for the string constant?
I can agree with the "codepoint index" proposal, given the above
points are clearified.
It became much clear to me anyway. Thanks.
--shiro