[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Response to SRFI 75.

Okay, I have a few things about this SRFI that I want to point out.

First, I feel that SRFI's are not the proper forum for pre-publishing
R6RS material.  But this is a matter of taste, and if people disagree
I'll shut up about it.

Second, I see no point in limiting the representation of unicode
characters to 2, 4, or 8 hexadecimal digits.  In using the last
format, one would be constrained always to pad with two constant zero
digits which carry no information.  To read hexadecimal numbers of
unfixed length is code that every implementor supporting hex numbers
has to have already written, and since a trailing delimiter is
required in the new syntax (a move I agree with, btw), the limited
selection of fixed lengths avoids no confusion.

Third, I think that char-upcase, char-downcase, string-upcase,
and string-downcase should be added to the list of functions that
"may not produce the results an end-user would consider sensible
with a particular locale," mainly to clarify what the document
elsewhere says; that they implement the case mappings from
UnicodeData.txt, rather than locale-dependent case-mappings.

Fourth, in general there are still problems if you're sticking to the
simplistic "codepoint equals character" model.  Particularly, some
characters, particularly accented characters, have uppercase and
lowercase versions which are different numbers of codepoints.  Thus,
in the "codepoint equals character" model, one case is a character and
the other case -- isn't.  The other case, in fact, is something
impossible to return from a routine whose return value is a
"character."  This introduces range confusion in both char-downcase
and char-upcase, and this in turn (I believe) hoses your suggested
implementations of char-ci=?, char-ci<?, char-ci>? char-ci<=? and
char-ci>=?.  You need to either remove the restriction and allow
multi-codepoint characters, or embrace the restriction and explicitly
state that the results of these functions are undefined in cases where
the lowercase form is in fact not a single codepoint.

Fifth, I think you need to add to the general set of character
predicates defined by SRFI-14 one additional predicate: char-unused?
which returns true for characters which are inside the valid range
but which are not actually mapped to any character in Unicode.

Sixth, is there any way for a scheme implementation to support
characters and strings in addutional encodings different from
unicode and not necessarily subsets of it, and remain compliant?
For example several schemes have character sets that more
accurately describe keystrokes than characters, containing
entities such as "META-J" and "SHIFT-F10" and similar that
have no corresponding unicode entities.  For another example
there are several asian scripts that Unicode is observed to
make a hash of, representing the same character at several
different codepoints, and people who work with these scripts
prefer other encodings.