[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encodings.

This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.

It has long been my opinion that scheme needs binary I/O capabilities
to be standardized.  Character ports should be distinguished from
binary ports at a low level, and binary I/O operations should be
different operations from character I/O operations.

I'm entirely happy with (read-char) and (write-char) reading "a
character" (although I'll note that opinions vary on what a character
is, nuff said) or having (read) or (write) read or write a data object
in the external representation form which is made of characters.

Those are character operations, and when dealing strictly with
character operations, the appropriate place for concerns about
encoding, endianness, and external canonicalization are below the
level of the program's notice.  Fold all that stuff into the port code
for character ports and don't bother the programmer with it.  As far
as text processing code is concerned, a character is a character is a
character, or at least that's how it should be.

That leaves implementors the freedom to implement their character
ports in terms of whatever abstraction their particular platform uses
for characters, whether its utf8be or utf32le or ascii or ebcdic or
latin1 or ISO-foo or iscii or PETscii or the ZX81 charset or some
other embedded processor weirdness.  This is not a bug, it's a
feature.  If there are multiple encodings/canonicalizations/etc in use
on a system, let schemes on those systems implement multiple kinds of
character ports.

But it follows that there is NO WAY we should rely on I/O of
characters through character ports to read or write a particular
binary representation for "raw" data such as sound and image files.
Attempting to do so is bad design, because it breaks an abstraction
barrier and presumes things which are beyond the program's proper
control or knowledge.

The only reason programmers want to write characters that aren't in
the "normal" encoding/canonicalization/etc, is when they need really
close control of the exact format of I/O.  But when you need control
*that* close, you're not talking about a "character" port at all any
more; you're talking about binary I/O.  Rather than breaking the
abstraction barrier on character ports, you need a different kind of
port.  We need binary ports that support operations like (read-bytes)
and (write-bytes).

It may be needful to read and write "characters" on these ports; but
character fields inside binary data formats tend to be both very rigid
and diverse in their encoding/etc, so character operations on binary
ports, if supported at all, should IMO have mandatory arguments to
specify their encoding/etc.