[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.

To: shiro@xxxxxxxx
Subject: Re: strings draft
From: Tom Lord <lord@xxxxxxx>
Date: Sat, 24 Jan 2004 11:01:07 -0800 (PST)
Cc: srfi-50@xxxxxxxxxxxxxxxxx
Delivered-to: srfi-50@xxxxxxxxxxxxxxxxx
In-reply-to: <20040123.184907.184810441.shiro@xxxxxxxx> (message from Shiro Kawai on Fri, 23 Jan 2004 18:49:07 -1000 (HST))
References: <200401240045.QAA28248@xxxxxxxxxxxxxxxxxxxxxxx> <20040123.172656.899859146.shiro@xxxxxxxx> <200401240431.UAA28992@xxxxxxxxxxxxxxxxxxxxxxx> <20040123.184907.184810441.shiro@xxxxxxxx>



    > From: Shiro Kawai <shiro@xxxxxxxx>

    > > but all implementations must either refuse to read

    > > 	"\U+30AB.\U+309A."

    > > or have

    > > 	(string-length "\U+30AB.\U+309A.") => 2

    > I see.  I think it's reasonable and acceptable.   EUCJP
    > implementation can inform the user that it can't read the constant.  
    > 
    > There are a couple of edge cases that I'd like to be clearer.
    > 
    > Can it map U+30AB to EUCJP #xA5AB, and U+309A to some
    > alternative character that designates unrecognized character?
    > (U+3013 is used in Japan traditionally).   It'll satisfy
    > codepoint index requirements.  Though
    > (string-ref "\U+30AB.\U+309A." 1) would be a surprise.

    > This can be either way---if it's not allowed in the proposal,
    > I can provide a flag so the implementation can behave either
    > "strictly conforming Unicode API" or "loose mode".

If your implementation can read:

	"\U+30AB.\U+309A."

doesn't that mean it should also read:

        (list #\U+30AB #\U+309A)

I'm not sure how to reconcile those.


    > Another edge case.  Suppose U+30AB and U+309A codepoints are
    > written directly (without escaping) in the source code.
    > EUCJP implementation can still load such a file, if it is informed
    > that the source is in one of Unicode CES.   It will convert
    > those two codepoints into one EUCJP #xA5AB character during
    > reading, so it'll produce a string of one character.
    > Is it an out of scope of the Unicode API?

I specifically mean the R6RS recommendations to _not_ preclude that
interpretation.  Yes, you should be able to read that string constant
from some Unicode stream and wind up with a one character string
constant.

If someone writes a non-portable program that says "This program
assumes that all string constants are Unicode [and, in such and such a
canonicalization form, etc.]" then that program wouldn't necessarily
run correctly on your implementation.

-t

Follow-Ups:
- Re: strings draft
  - From: Shiro Kawai

References:
- Re: strings draft
  - From: Tom Lord
- Re: strings draft
  - From: Shiro Kawai
- Re: strings draft
  - From: Tom Lord
- Re: strings draft
  - From: Shiro Kawai

Prev by Date: Re: Parsing Scheme [was Re: strings draft]
Next by Date: Re: Parsing Scheme [was Re: strings draft]
Previous by thread: Re: strings draft
Next by thread: Re: strings draft
Index(es):
- Date
- Thread