[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Allowing ASCII only, string escapes, and normalization

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: srfi-75@xxxxxxxxxxxxxxxxx
Subject: Allowing ASCII only, string escapes, and normalization
From: Jorgen Schaefer <forcer@xxxxxxxxx>
Date: Thu, 28 Jul 2005 20:47:52 +0200
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (gnu/linux)

Hi there!
Some more comments from my side.

Allowing ASCII only
===================
The current draft summarizes two problems of the SRFI as mentioned
on this list as both mandating too much for systems targeted to
small devices, and as mandating not enough for more sophisticated
implementations. I think the SRFI is a good middle ground and
allows a transition from the old string processing to newer and
more sophisticated designs. So the latter problem can only be
addressed by SRFIs which specify the better interfaces.

To mitigate the former problem, I just went over the draft again
with an eye for where it precludes an implementation to just use
ASCII. There's not much. If an implementation were allowed to
signal an error on unsupported code points, it would be trivial
for an implementation to just support ASCII (or Latin-1), as the
code points 0-127 (0-255) are equivalent in Unicode and ASCII
(Latin-1). This would open the specification for small devices.
(And even for other character sets, you only need a simple
translation table and signal errors on other code points)

This would mean that an implementation can support Unicode
code points fully or partially, just as implementations can support
the numeric tower fully or partially.


String Escapes
==============
My biggest problem with current draft is still xuU. More and more,
I come to think that delimited escapes are the way to go.
Specifically, parented escapes. I.e. "Foo\x(0A)Bar"

This has a number of advantages. We don't need u and U anymore, as
there's no ambiguity on what is part of the escape and what is
not. It is easy to read. And it is even friendly to users from
other languages: If a \x escape is not followed by a parenthesis,
an appropriate syntax error can be signalled, even explaining the
correct syntax.

If the latter is deemed less important than being able to write
\x0A itself, the parenthesises might be only required for hex
strings of a different length than two.

That problem does not exist for character constants, as those are
delimited otherwise anyways, so #\xA20 is always unambiguous.
Hence we can drop u and U from character constants as well.

This (type of) syntax even has precedence, in Perl 6 of all
languages. Apparently, they use \x{263A} in strings, and allow
\x[263A] and \x<263A> as well in regular expressions. All types of
bracketing are optional and only used for disambiguation. Cf.
http://www.perl.com/pub/a/2002/06/04/apo5.html?page=7 and
http://www.mail-archive.com/perl6-documentation@xxxxxxxx/msg00140.html

(I don't think we should adopt such a DWIM attitude - requiring
the parenthesis, and using only a single kind, looks like the best
way to me.)


Normalization
=============
String comparison on code point vectors without normalization is
useless. Hence, normalization will often be implemented right
away. Therefore, it might be useful to provide
STRING-NORMALIZE-NF{C,D} (maybe even NFKC/NFKD).
Cf. http://www.unicode.org/faq/normalization.html#1

If this is not included, a rationale should be added to the
document. At least it should mention normalization somewhere.


Greetings,
        -- Jorgen

-- 
((email . "forcer@xxxxxxxxx") (www . "http://www.forcix.cx/";)
 (gpg   . "1024D/028AF63C")   (irc . "nick forcer on IRCnet"))

Follow-Ups:
- Re: Allowing ASCII only, string escapes, and normalization
  - From: John.Cowan

Prev by Date: Re: Surrogates and character representation
Next by Date: Re: Allowing ASCII only, string escapes, and normalization
Previous by thread: Re: Surrogates and character representation
Next by thread: Re: Allowing ASCII only, string escapes, and normalization
Index(es):
- Date
- Thread