[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Why are byte ports "ports" as such?

This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.



John Cowan wrote:
Per Bothner scripsit:

A little knowledge is a dangerous thing ...

	A little Learning is a dang'rous Thing;
		--Pope, "Essay on Criticism"

Touché ...

We know that.  However, there is still no need for "character" [in the
Unicode sense] as a separate data type:

As I noted in my previous posting, "characters in the Unicode sense" is
not a well-defined notion.

Yes - and that's why I'm arguing against trying to model anything except
codepoints in Scheme,

Java uses 16-bit code units (not code points), not because the architects
didn't foresee the eventual use of the Astral Planes, but because the
benefits of uniform width were deemed by them to outweigh the necessity
of dealing with surrogate characters by hand.  Java now has some standard
library routines that hide surrogate characters.

Unfortunately, the end result is somewhat complex, especially since 99%
of the time programmers can and will get away with ignoring non-basic-
plane characters.

However, there are ways to keep uniform-width strings without sacrificing
the codepoint view, provided you are willing to give up on string
mutability (which Java does not have).  One well-known approach is to
store 8-bit code units for strings that contain no codepoint above U+00FF,
16-bit code units for strings that contain no codepoint above U+FFFF,
and 32-bit code units for all other strings.

Personally, if I didn't have any compatibility constraints, I would just store everything as UTF-8 string, and allow indexing by code unit
(bytes).  How often does non-library code need to deal with characters?
Instead, the data types should be (immutable) "string" and "buffer".
The latter allows insertions and deletions in addition to replacement.
How often are strings in the sense of mutable fixed-length character
arrays useful to application programmers, except as a low-level
"chunk of memory" to implement other data types?  Basically never,
or as close to never as to render them unsuitable for Scheme.

(Even parsers don't need to deal with characters, if you have
regular-expression lexing.  I.e. try to match the current input
position against a regular expression.  On success, return the
matched string, and move the position forwards.)
--
	--Per Bothner
per@bothner.com   http://per.bothner.com/