[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Simplifying SRFI 109, part 1: entities



[Sorry for sitting on this one for a while.  I didn't forget, but
I needed to get some other things out of the way first.]

On 02/10/2013 12:04 AM, John Cowan wrote:
This is the first of two posts proposing simplifications (reductions in
scope) for SRFI 109.  The idea is that by removing variable elements,
this SRFI (unlike SRFIs 107 and 108) becomes purely lexical in scope:
the output of the a SRFI-109-capable reader returns the same thing for
a SRFI-109 string literal and a regular string, viz. an immutable Scheme
string object.

See my reply to part 2: Enclosed expressions are IMO a prime
feature of strings *quasi*-literals.  Thus in general the reader
can't return a literal string.

The reader *could* return a literal string in cases where
there are no enclosed expressions, but I feel uncomfortable with
that - it seems a bit hacky and inconsistent.  For read/write
round-tripping we have the traditional string literals, so I
think it is cleaner to have the &{...} always return a ($string$ ...)
form.

In this first post, I argue against the provision of user-defined
entity names.  Currently, when an entity reference appears in a SRFI
109 string literal, it is expanded into the identifier $entity:<name>$,
where <name> is the entity referred to.  Thus &{Rom&acirc;nia} expands
to ($string$ "Rom" $entity:acirc$ "nia").  In principle, this permits a
user to rebind $entity:acirc$ to something else.  However, there seems no
reason why this should be allowed; it is only productive of confusion.
Such entity references should just expand directly to the character, so
that &{Rom&acirc;nia} becomes ($string$ "România"), or just "România".

If we accept that we always get a ($string$ ...) form, that much reduces
the benefit of the reader expanding named characters.  And there are
advantages to deferring it.

Deferring character name lookup allows user-defined character names
- or in general entity names (which can be longer strings).

Not hard-wiring in entity names is especially important for STFI-107,
since the XML/SGML model does allow user-defined entity names.
Having these be hard-wired into the reader is not IMO in the spirit
of XML.  Even if the reader uses a user-programmable table it would
be information-losing for the reader to expand the entity names.
Even then using using a programmable read-time lookup table is
clearly less "Schemey" than using regular expand-time name-lookup.

If we defer entity name lookup for SRFI-107 then I think we should
do the same for SRFI-108 and SRFI-109, for simplicity.

Nor is it likely that anyone will need character entities past the 2237
already provided by the standard W3C list.  It is already a requirement
that systems not add names that conflict with any of these.  True, you
cannot write (say) Hindi in the Devanagari script using character entity
references only.  But if you are going to do that, you will probably
want to use a UTF-8 compatible editor with appropriate fonts.

I therefore believe that character entities should be expanded directly
into characters by the implementation.  This eliminates one of the
use cases for requiringing SRFI 109 string literals to expand into calls
on $string$.

I would also strengthen, from a MAY to a SHOULD, the
recommendation to implement the whole standard list.

I did that in the new draft.  Let me know what you think of the change.
(I've also implemented this for Kawa.)  I should probably also state
that an implementation MUST support the standard Scheme character names.
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/