[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Why are byte ports "ports" as such?

This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 are here. Eventually, the entire history will be moved there, including any new messages.

Thomas Bushnell BSG <tb@becket.net> writes:

>> Then it's impossible to implement a UTF-8 encoder. There is an
>> infinite number of potential characters, and there is no way to
>> examine what a given character means.
> What exactly makes it impossible?  There are an infinity of possible
> integers, and this hasn't hampered the implementation of <.

Integers support arithmetic. Individual bits or larger digits of an
integer can be counted and examined. You can index dictionaries by
integers. Any reasonable function on integers can be expressed by
composing primitive operations.

What operations would your characters support? I guess that operations
similar to today's strings, i.e. determining the length, extracing
individual code points, and some way to build them from code points
(e.g. making a singleton from the given code point and appending a
code point to a character, assuming that characters are immutable).
With some rules of normalization; if NFC and NFD are indistinguishable,
then extracting individual code points would not necessarily yield
code points used to construct a character.

But wouldn't it be simpler to just use strings of code points for what
you would use characters? Strings of code points are needed anyway
when we work on a lower level, e.g. when we care whether the output is
NFC or NFD. So why don't just make a library which provides iteration
over strings using substrings representing characters, normalization
etc. - the same functionality, but without calling some groups of code
points "characters"?

>> established practice of using code points or even lower level code
>> units as Scheme characters.
> There is no "established practice" of doing this.  The established
> practice is to pretend that code points and abstract characters are
> the same.

This is exactly what I said. The established practice is to work in
terms of code points or even lower level.

You want to call characters of this simplified view using the more
formal term "code points", and to call some strings of well-formed
sequences of combining characters "characters". Apart from changing
names, what does it accomply? Changing practice just to have nicer
procedure names is a weak excuse.

   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/