Title

Transcoders and transcoded ports

Author

The R6RS editors; John Cowan (shepherd)

Status

This SRFI is currently in withdrawn status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-186@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Abstract

This is an extract from the R6RS that documents its support for transcoders and transcoded ports. These provide a hook into the Scheme port system from below, allowing the creation of textual ports that provide non-default encoding and decoding from arbitrary binary ports. It has been lightly edited to fit R7RS style.

Rationale

When reading from (or writing to) files, devices, pipes, sockets, or other sources (or sinks) of data, it's sometimes useful or necessary to handle textual data that are encoded differently from the system default encoding.

This can be done at a level above Scheme ports, but with some loss in convenience. In particular, the high-level Scheme procedures like read, write, and display only accept port arguments. By making it possible to create transcoded ports that accept a binary port and return a corresponding textual port, convenience is served.

Specification

Several different character encoding schemes exist that describe standard ways to encode characters and strings as byte sequences and to decode those sequences. Within this document, a codec is an immutable Scheme object that represents a Unicode or similar encoding scheme.

An end-of-line style is a symbol that describes how a textual port transcodes representations of line endings. The symbols lf and none mean that no newline conversion is done. The symbol crlf means that after decoding with a codec, the sequence CRLF, that is, a #\return followed by a #\newline, is replaced by #\newline. Correspondingly, the reverse is done before encoding. Implementations may support additional symbols.

An error-handling mode is a symbol that specifies the behavior of textual I/O operations in the presence of encoding or decoding errors.

If a textual input operation encounters an invalid or incomplete character encoding, then if the error-handling mode is replace, the replacement it is treated as the character U+FFFD (or if that character is not representable by the implementation or is not permitted in strings, then by the character U+003F (question mark). But if the error-handling mode is raise, an error satisfying i/o-decoding-error? is signaled, an appropriate number of bytes are ignored, and decoding continues with the following bytes.

If a textual output operation encounters a character it cannot encode, then if the error-handling mode is replace, a codec-specific replacement character is emitted by the transcoder, and encoding continues with the next character. The replacement character is U+FFFD for transcoders whose codec can encode this character, but is U+003F (question mark) if it cannot. But if the error-handling mode is raise, an an error satisfying i/o-encoding-error? is raised, and encoding continues with the next character.

Implementations may support additional symbols.

A transcoder is an immutable Scheme object that combines a codec, an end-of-line style, and an error-handling mode. Each transcoder represents some specific bidirectional (but not necessarily lossless), possibly stateful translation between byte sequences and Unicode characters and strings. Every transcoding can decode bytes as characters and encode characters as bytes.

Procedures

(make-codec string)

Returns a codec representing the character encoding scheme whose standard name is the case-insensitive string string.

Standard character encoding names for HTML can be found at the WHATWG encoding specification, and implementations should recognize and support all of these. There are a total of 39 encodings, which have between them 218 standard names. Note that the "replacement" codec signals an error whenever it is used. Additional encoding names listed at the IANA page on character sets may also be recognized and supported.

If make-codec is called on a string that the implementation does not support, an error is signaled.

(latin-1-codec)
(utf-8-codec)
(utf-16-codec)

These are predefined codecs for the ISO 8859-1, UTF-8, and UTF-16 encoding schemes.

A call to any of these procedures returns a value that is equal in the sense of eqv? to the result of any other call to the same procedure.

(native-eol-style)

Returns the default end-of-line style of the underlying platform, typically lf on Unix and crlf on Windows.

(i/o-decoding-error? obj)

Returns #t if obj is an exception raised when one of the operations for textual input from a port encounters a sequence of bytes that cannot be decoded into a character or string by the port's transcoder.

When such an exception is raised, the port's position is past the invalid encoding.

(i/o-encoding-error? obj)

Returns #t if obj is an exception raised when one of the operations for textual output to a port encounters a character that cannot be encoded into bytes by the port's transcoder.

(i/o-encoding-error-char i/o-encoding-condition)

Returns the character that could not be encoded when the condition i/o-encoding-condition was signaled.

(make-transcoder codec eol-style handling-mode)

Returns a transcoder with the behavior specified by its arguments.

(native-transcoder)

Returns an implementation-dependent transcoder that represents a possibly locale-dependent “native” transcoding. This should be equivalent to the transcoder employed by Scheme operations that open textual ports.

(transcoded-port binary-port transcoder)procedure 

Returns a new textual port with the specified transcoder from binary-port. The new textual port's state is largely the same as that of binary-port. If binary-port is an input port, the new textual port will be an input port and will decode the bytes of binary-port. If binary-port is an output port, the new textual port will be an output port and will write encoded characters to binary-port.

It is an error to call this procedure on binary-port after it has been read from or written to. It is also an error to read or write on binary-port after calling this procedure.

(bytevector->string bytevector transcoder)

Returns the string that results from decoding the bytevector according to the input direction of the transcoder.

(string->bytevector string transcoder)

Returns the bytevector that results from encoding the string according to the output direction of the transcoder.

Implementation

Every conforming R6RS implementation, including at least Chez, Guile, IronScheme, Larceny, Racket, and Vicare, already provides these procedures with the exception of make-codec, as there is no portable way to create a codec for a named encoding in R6RS in the (rnrs io ports) library. Therefore, no implementation is provided here, especially since a portable implementation is not possible.

Acknowledgements

This would have been much more difficult without the R6RS team, who produced a good-enough (as opposed to perfect) design that John was happy to adopt. Thanks also to Mikel More, who convinced him of the necessity of having transcoded ports in R7RS-large.

Copyright

John Cowan does not claim copyright on his de minimis contributions to this SRFI. Its content is drawn almost entirely from R6RS, which does not have a copyright notice. It does, however, contain the following copyright license:

We intend this report to belong to the entire Scheme community, and so we grant permission to copy it in whole or in part without fee.

Nevertheless, in order to keep the lawyers happy, the standard SRFI license is subjoined:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Editor: Arthur A. Gleckler