[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft

From: Tom Lord <lord@xxxxxxx>
Subject: Re: strings draft
Date: Fri, 23 Jan 2004 20:31:32 -0800 (PST)

>     > So, when the EUCJP Scheme reads a string
>     >  "\U+30AB.\U+309A."
>     > Then it can produce a string which consists of a single characetr
>     > EUCJP #xA5F7.  
> Eh... no.   The final language should be such that that string
> constant denotes a string of two Unicode codepoints.
> but all implementations must either refuse to read
> 	"\U+30AB.\U+309A."
> or have
> 	(string-length "\U+30AB.\U+309A.") => 2

I see.  I think it's reasonable and acceptable.   EUCJP
implementation can inform the user that it can't read the constant.  

There are a couple of edge cases that I'd like to be clearer.

Can it map U+30AB to EUCJP #xA5AB, and U+309A to some
alternative character that designates unrecognized character?
(U+3013 is used in Japan traditionally).   It'll satisfy
codepoint index requirements.  Though
(string-ref "\U+30AB.\U+309A." 1) would be a surprise.

This can be either way---if it's not allowed in the proposal,
I can provide a flag so the implementation can behave either
"strictly conforming Unicode API" or "loose mode".

Another edge case.  Suppose U+30AB and U+309A codepoints are
written directly (without escaping) in the source code.
EUCJP implementation can still load such a file, if it is informed
that the source is in one of Unicode CES.   It will convert
those two codepoints into one EUCJP #xA5AB character during
reading, so it'll produce a string of one character.
Is it an out of scope of the Unicode API?

>     > If so, I have no problem to adopt the "codepoint index" proposal.
> Well, how about if I agree to every bit of that except for the syntax
> you used for the string constant?

I can agree with the "codepoint index" proposal, given the above
points are clearified.
It became much clear to me anyway.  Thanks.