[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Why are byte ports "ports" as such?

This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.



Thomas Bushnell BSG scripsit:

> I would probably have two different sorts of characters, just as most
> scheme systems have two different kinds of integers.  Most characters
> can be encoded unboxed as single unicode codepoints.  Some, which
> require more than one code point, would either need to be larger
> unboxed values (if the system permits), or boxed objects.  

Ah, now we come down to cases.

Okay, what can be an abstract character?  Is it system-dependent entirely?
Could the string "Emacs" consist of a single abstract character?
How about YOD+HE+VAV+HE?  How about the string "When in the Course of
human events...our Lives, our Fortunes, and our sacred Honour"?
Is there to be a constructor that takes a sequence of codepoints and
returns an abstract character?  What about abstract characters that
cannot be represented with any sequence of Unicode code points?

(These questions are intended to be funny, serious, and not sarcastic.)

> Strings could easily be arrays of Unicode code points, though I'm not
> certain that this is the best option, because it would impede random
> access to characters.  

If you need such a thing, you can certainly create your own index.
It's a question whether such indexes are generally useful.

> I would have no objection to strings having two interfaces, one that
> operates on the characters and one that operates on the code points,
> though I'm hesitant about standardizing that.

Java has standardized two interfaces, one that operates on code points,
one that operates on 16-bit code units, and has a third interface
(BreakIterator) that operates on a variety of higher-level objects,
and lazily creates the kind of index I mentioned above over grapheme
clusters, words, lines, paragraphs, ....

> As for reading a file in UTF-8, that's like reading a file in any
> encoding.  The process of taking a sequence of bytes and mapping them
> to a sequence of characters requires a mapping function.  A splufty
> system would need to be able to read UTF-8, ascii, ISO Latin 1, ISO
> Latin 2, etc.  There is no encoding-generic implementation of
> read-char, you need to know the encoding of the input stream to
> implement it correctly.

+1

-- 
It was impossible to inveigle           John Cowan <cowan@ccil.org>
Georg Wilhelm Friedrich Hegel           http://www.ccil.org/~cowan
Into offering the slightest apology
For his Phenomenology.                      --W. H. Auden, from "People" (1953)