[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SRFI withdrawn; comments on the possible future

Matthew Flatt <mflatt@xxxxxxxxxxx> writes:

>     * `string-normalize-nfd', `string-normalize-nfkd,
>       `string-normalize-nfc', and `string-normalize-nfkc', which each
>       accept a string and produce its normalization according to normal
>       form D, KD, C, or KC, respectively.

If the basic concept of the SRFI - a string being a sequence of
code points - does not change, I do think these procedures are
useful (contrary to bear and Alex Shinn). An implementation can
still normalize internally in the "usual case", and if the
programmer enforces a different normalization, that's eir problem.

STRING=? and similar procedures need to define which kind of
normalization they work on (or just "the same normalization for
all arguments").

STRING-DOWNCASE, STRING-APPEND etc. need to define whether they
may normalize their arguments, and if so, which normalization they
return. If the normalization shouldn't be prescribed, another
procedure, STRING-NORMALIZE (or similar), needs to be added to
return the normalization the implementation prefers.

A higher-level string API can (and should) be built on top of the
strings defined in this SRFI.

> The #\newline character
> -----------------------
> It is likely that #\newline will be removed from Scheme leaving only
> #\linefeed. Since R6RS will pin down characters to Unicode scalar
> values, the right name for the character is #\linefeed.

I'm always in favor of breaking stuff to get a clean result.

> Another view is that #\newline can serve as an abstaction of the
> end-of-line character sequence which is returned by read-char
> when the end-of-line character sequence is read (be it
> #\linefeed, or #\return, or # \return followed by #\linefeed).
> So even though #\newline and #\linefeed are the same characters,
> Scheme programs might use #\newline to highlight that the
> character is being used to denote the end-of-line sequence. The
> name #\newline would also reinforce the link with the escape
> sequence "\n" in strings.

If #\newline is considered to be some kind of abstraction of the
end-of-line character sequence, please remember that Unicode
canonical new line code points, to finally get rid of all these

> Escape sequences
> ----------------

>    with semi-colon terminator          without terminator
>    "A\x42;C" = "ABC"                   "A\x42\x43" = "ABC"
>    "\x41;\x42;\x43;" = "ABC"           "\x41\x42\x43" = "ABC"
>    "\x03BB;x.x" = "λx.x"               "\x03BBx.x" = "λx.x"

I agree with bear that the semicolon is a bad choice - why not use
the colon?


> Using less-than and greater-than characters, which are not actual
> brackets, avoids this problem:
>     #\x<03BB> = #\λ

Braces have been offered as an alternative:


> However, they become somewhat more difficult to read when multiple
> escape appear in a string:
>    "\x<41>\x<42>\x<43>" = "ABC"


> In either case, the trade-off is that Scheme strings are unlikely to be
> compatible with any other language's string syntax. A consequence is
> that there is additional burden on the programmer which must learn yet
> another string and character syntax.

I do think it's good that we don't go with bad decisions made by
other languages just because the decision has been made by them.

> Symbol characters
> -----------------
> [...]
> Meanwhile, the symbol escapes are similar yet not identical to the
> escapes in strings and characters, so there is a potential for mistakes
> if the programmer is not careful. For example one might expect a\nb to
> be a valid symbol, but it is an error.

Why not allow the same escapes in symbols and in strings?

All in all I like the changes you propose (modulo the comments
above). Thanks for the good work!

        -- Jorgen

((email . "forcer@xxxxxxxxx") (www . "http://www.forcix.cx/";)
 (gpg   . "1024D/028AF63C")   (irc . "nick forcer on IRCnet"))