This SRFI is currently in withdrawn status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-186@nospamsrfi.schemers.org
. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.
When reading from (or writing to) files, devices, pipes, sockets, or other sources (or sinks) of data, it's sometimes useful or necessary to handle textual data that are encoded differently from the system default encoding.
This can be done at a level above Scheme ports, but with some
loss in convenience. In particular, the high-level Scheme procedures
like read
, write
, and display
only accept port arguments. By making it possible to create transcoded ports
that accept a binary port and return a corresponding textual port,
convenience is served.
Several different character encoding schemes exist that describe standard ways to encode characters and strings as byte sequences and to decode those sequences. Within this document, a codec is an immutable Scheme object that represents a Unicode or similar encoding scheme.
An end-of-line style is a symbol that
describes how a textual port transcodes representations of
line endings.
The symbols lf
and none
mean
that no newline conversion is done.
The symbol crlf
means that after decoding
with a codec, the sequence CRLF
, that is,
a #\return
followed by a #\newline
,
is replaced by #\newline
.
Correspondingly, the reverse is done before encoding.
Implementations may support additional symbols.
An error-handling mode is a symbol that specifies the behavior of textual I/O operations in the presence of encoding or decoding errors.
If a textual input operation encounters an invalid or incomplete
character encoding,
then if the error-handling mode is replace
, the replacement
it is treated as the character U+FFFD (or if that character is
not representable by the implementation or is not permitted in
strings, then by the character U+003F (question mark).
But if the error-handling mode is raise
,
an error satisfying i/o-decoding-error?
is signaled,
an appropriate number of bytes are ignored, and decoding
continues with the following bytes.
If a textual output operation encounters a character it cannot encode,
then if the error-handling mode is replace
, a codec-specific
replacement character is emitted by the transcoder, and encoding
continues with the next character.
The replacement character is U+FFFD for transcoders whose codec
can encode this character, but is U+003F (question mark)
if it cannot.
But if the error-handling mode is raise
, an
an error satisfying i/o-encoding-error?
is raised,
and encoding continues with the next character.
Implementations may support additional symbols.
A transcoder is an immutable Scheme object that combines a codec, an end-of-line style, and an error-handling mode. Each transcoder represents some specific bidirectional (but not necessarily lossless), possibly stateful translation between byte sequences and Unicode characters and strings. Every transcoding can decode bytes as characters and encode characters as bytes.
(make-codec string)
Returns a codec representing the character encoding scheme whose standard name is the case-insensitive string string.
Standard character encoding names for HTML can be found at the WHATWG encoding specification, and implementations should recognize and support all of these. There are a total of 39 encodings, which have between them 218 standard names. Note that the "replacement" codec signals an error whenever it is used. Additional encoding names listed at the IANA page on character sets may also be recognized and supported.
If make-codec
is called
on a string that the implementation does not support,
an error is signaled.
(latin-1-codec)
(utf-8-codec)
(utf-16-codec)
These are predefined codecs for the ISO 8859-1, UTF-8, and UTF-16 encoding schemes.
A call to any of these procedures returns a value that is equal in the
sense of eqv?
to the result of any other call to the same
procedure.
(native-eol-style)
Returns the default end-of-line style of the underlying platform, typically
lf
on Unix and crlf
on Windows.
(i/o-decoding-error? obj)
Returns #t
if obj is
an exception raised when one of the operations for
textual input from a port encounters a sequence of bytes that cannot
be decoded into a character or string by the port's transcoder.
When such an exception is raised, the port's position is past the invalid encoding.
(i/o-encoding-error? obj)
Returns #t
if obj is
an exception raised when one of the operations for
textual output to a port encounters a character that cannot be
encoded into bytes by the port's transcoder.
(i/o-encoding-error-char i/o-encoding-condition)
Returns the character that could not be encoded when the condition i/o-encoding-condition was signaled.
(make-transcoder codec eol-style handling-mode)
Returns a transcoder with the behavior specified by its arguments.
(native-transcoder)
Returns an implementation-dependent transcoder that represents a possibly locale-dependent “native” transcoding. This should be equivalent to the transcoder employed by Scheme operations that open textual ports.
Returns a new textual port with the specified transcoder from binary-port. The new textual port's state is largely the same as that of binary-port. If binary-port is an input port, the new textual port will be an input port and will decode the bytes of binary-port. If binary-port is an output port, the new textual port will be an output port and will write encoded characters to binary-port.
It is an error to call this procedure on binary-port after it has been read from or written to. It is also an error to read or write on binary-port after calling this procedure.
(bytevector->string bytevector transcoder)
Returns the string that results from decoding the bytevector according to the input direction of the transcoder.
(string->bytevector string transcoder)
Returns the bytevector that results from encoding the string according to the output direction of the transcoder.
Every conforming R6RS implementation, including at least Chez, Guile, IronScheme,
Larceny, Racket, and Vicare, already provides these procedures
with the exception of make-codec
, as there is no portable way to create
a codec for a named encoding in R6RS
in the (rnrs io ports)
library.
Therefore, no implementation is provided here, especially since
a portable implementation is not possible.
John Cowan does not claim copyright on his de minimis contributions to this SRFI. Its content is drawn almost entirely from R6RS, which does not have a copyright notice. It does, however, contain the following copyright license:
We intend this report to belong to the entire Scheme community, and so we grant permission to copy it in whole or in part without fee.
Nevertheless, in order to keep the lawyers happy, the standard SRFI license is subjoined:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.