[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parsing Scheme [was Re: strings draft]

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.




    > From: tb@xxxxxxxxxx (Thomas Bushnell, BSG)

    > Tom Lord <lord@xxxxxxx> writes:

    > > On the other hand, if [a], [b], and [c] are all portable, equivalent,
    > > standard Scheme programs -- then in Turkish implementations,
    > > CHAR-UPCASE, CHAR-DOWNCASE and friends must behave in a linguistically
    > > odd manner.  

    > Not true!  

    > You can make [a], [b], and [c] all do the Right Thing, and not even
    > *have* CHAR-UPCASE or CHAR-DOWNCASE at all!

    > What they require is string-ci=? to behave Properly, in the contexts
    > where the Scheme reader uses it.

CHAR-UPCASE and CHAR-DOWNCASE are mandatory and STRING-CI=? is defined
in terms of CHAR-CI=?

If [a], [b], and [c] are all portable, equivalent, standard Scheme
programs then this portable, standard program:


    (let loop ((c (read-char)))
      (if (not eof-object? c)
          (begin
            (display (char-downcase c))
            (loop (read-char)))))

must be able to read any one of them and write as output a scheme
program with identical meaning, at _least_ if the resulting program is
read by the same implementation running the conversion.

There are two choices.   Either that program is permitted to convert
[b] and [c] into something other than [a] (such as by including some
dotless i's in the output) or it must convert [b] and [c] to [a].

In the latter case, CHAR-DOWNCASE behaves in a linguistically odd for
Turkish speakers because it either converts #\I to #\i or #\I to #\I.

In the former case, the Turkish implementation must provide that:

	(char-ci=? dotless-i #\i)

which is again, linguistically odd.

    > The question the reader needs to ask is "are these sequences of
    > characters the same identifier".  

Yes, and in R5RS that means "Are the constiuent characters of the identifier
equal in a case independent sense?"   The rest follows from that.

You say R5RS should not define identifier equivalence that way:

    > > I'm not so sure that that's terrible (and my proposals
    > > for R6RS reflect that assessment): those procedures are doomed to
    > > behave in a linguistically odd manner for a substantial number of
    > > reasons, in many other contexts besides Turkish implementations.

    > So punt them.  CHAR-UPCASE and CHAR-DOWNCASE are entirely unnecessary,
    > and since they cannot be sensibly implemented, and are entirely
    > unneeded, drop them!

The character casemappings would still need to be defined to specify
Scheme.  Reifying that definition into Scheme in the form of those
procedures is only natural.



    > > Rather, I propose that the standard character procedures be explicitly
    > > related to both the syntax of portable standard Scheme and the syntax
    > > of particular implementations.  For example, R6RS should require that:

    > > 	(char-downcase #\I) => #\i

    > Why?  R6RS should not have char-downcase at all.

The standard would still need to specify CHAR-DOWNCASE.   It would
still need to be possible to write portable CHAR-DOWNCASE with
whatever machinery the standard did provide.   There is no good reason
not to stick to the simple route of simply directly reifying
CHAR-DOWNCASE into Scheme.   There is a very good reason to do so:  so
that portable programs can accurately manipulate non-portable source texts.

-t