[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Octet vs Char (Re: strings draft)

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.



I gave more thought about the "[0..255] integer <-> char mapping"
issue and its consequences if I adopt it in multibyte-string
implementation. 

In short, it can be implemented in multibyte CES Scheme, and
is useful in the cases such as to call C FFI that takes C 'char'
argument (I think it is the original intent), and also to
represent invalid octets in the input character stream.

For multibyte CES that doesn't have sensible character mapping
from integer range [128..255], it can use "illegal" byte sequence
to represent these exceptional values.

Having such a character within a string may cause some penalty:

 * Two strings can't be compared directly in mb implementation;
   since for two integers x and y,
    (< x y)  <=>  (char< (integer->char x) (integer->char y))
   is required, and it's not easy to encode the characters
   corresponding to integers [128..255] into the illegal byte
   sequences with preserving the order to other codepoints.

 * Displaying a string can no longer directly writes out internal mb
   representation (which is one of main advantage having mb string),
   if the string contains these illegal byte sequences.

But these are minimized if such a string is flagged specially,
and the use of such special characters are relatively rare.

I think using strings for binary I/O should be explicitly
discouraged, even though an octet sequence can be represented by
using such special characters.  It can be very inefficient
on some implementations, and it may cause problems on ports
that deals with character encodings.   Pure binary I/O can
be done better using srfi-4 uniform vector and special
binary I/O primitives.

--shiro