[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Why are byte ports "ports" as such?



Per Bothner <per@bothner.com> writes:

>>> What does char->integer return?  How does char<? work?  What is your
>>> proposed implementation for a "character" in the Unicode world, given
>>> that it is not a code-point?  How would you store characters in a
>>> string?
>> Storage is irrelevant.  An implementation would be free to store
>> characters however it wished.  char->integer and char<? can return
>> whatever the implementation pleases.  I would rather drop them, since
>> they have nothing really to do with characters.  They are functions on
>> *code points*, which are there because the R5RS authors did not bother
>> to distinguish code points from characters.
>
> I'm asking how *you* would implement a "character" data type.
> Assume you have 32-bit "scheme values".  Would you make characters
> immediate/unboxed values?  In that case, assume you have 28 bits.
> Or are characters pointers to objects in memory?  If so, how are
> they managed?  Are equal characters eq?  Suppose I have a UTF-8
> input file.  What does read-char do?  What is a string - an array
> of 32-bit Scheme values or could it be more compact?

I would probably have two different sorts of characters, just as most
scheme systems have two different kinds of integers.  Most characters
can be encoded unboxed as single unicode codepoints.  Some, which
require more than one code point, would either need to be larger
unboxed values (if the system permits), or boxed objects.  I suspect
it would be efficient to attempt a uniquization of the boxed objects
when characters are being used as isolated values (though I'm not
certain of this).  

Strings could easily be arrays of Unicode code points, though I'm not
certain that this is the best option, because it would impede random
access to characters.  (On the other hand, since you would like to
call the code points "characters", you also would not be able to have
random access to Unicode's abstract characters.)  I would have no
objection to strings having two interfaces, one that operates on the
characters and one that operates on the code points, though I'm
hesitant about standardizing that.

As for reading a file in UTF-8, that's like reading a file in any
encoding.  The process of taking a sequence of bytes and mapping them
to a sequence of characters requires a mapping function.  A splufty
system would need to be able to read UTF-8, ascii, ISO Latin 1, ISO
Latin 2, etc.  There is no encoding-generic implementation of
read-char, you need to know the encoding of the input stream to
implement it correctly.

As suggested above, since we agree that a string is *implemented* as a
sequence of code points, perhaps in UTF-8, we can both implement
Unicode strings the same way.

Thomas