181: Custom ports (including transcoded ports)

by the R6RS editors; John Cowan (shepherd)

Status

This SRFI is currently in final status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-181@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Abstract

This SRFI is derived from parts of library section 8.2.4, library section 8.2.7, library section 8.2.10, and library section 8.2.13 of the R6RS. These sections are themselves based on parts of SRFI 79, SRFI 80 and SRFI 81. These procedures provide a hook into the Scheme port system from below, allowing the creation of custom ports that behave as much as possible like the standard file, string, and bytevector ports, but that call a procedure to produce data to input ports or to consume data from output ports. Procedures for creating ports that transcode between bytes and characters are an important special case and are also documented in this SRFI.

Rationale

When reading from (or writing to) files, devices, pipes, sockets, or other sources (or sinks) of data, it's often useful or necessary to perform one or more transformations on the data.

All of these can be done at a level above Scheme ports, but with some loss in convenience. In particular, the high-level Scheme I/O procedures like read, write, and display only accept port arguments. By making it possible to create custom ports that accept a low-level read (write) operation, perform a transformation, and pass it on to some other port, convenience is served. It is also straightforward to chain custom ports together in order to create transformation pipelines.

Examples of such transformations are:

A very important case of transformation is character encoding and decoding. It's sometimes useful or necessary to handle textual data that are encoded differently from the system default encoding. This SRFI provides comprehensive facilities for handling many different encodings by creating a custom textual port on top of a binary port.

Specification

Note: The effect of char-ready? and u8-ready? on custom ports is unspecified.

Custom ports in general

The types of the arguments to the procedures of this section of this SRFI are as follows:

(make-custom-binary-input-port id read! get-position set-position! close)

Returns a newly created binary input port whose byte source is an arbitrary algorithm represented by the read! procedure.

(make-custom-textual-input-port id read! get-position set-position! close)

Returns a newly created textual input port whose character source is an arbitrary algorithm represented by the read! procedure.

(make-custom-binary-output-port id write! get-position set-position! close [flush])

Returns a newly created binary output port whose byte sink is an arbitrary algorithm represented by the write! procedure.

(make-custom-textual-output-port id write! get-position set-position! close [flush])

Returns a newly created textual output port whose character sink is an arbitrary algorithm represented by the write! procedure.

(make-custom-binary-input/output-port id read! write! get-position set-position! close [flush])

Returns a newly created binary port that is both an input and an output port. Its byte source and sink are arbitrary algorithms represented by the read! and write! procedures. Each of the arguments behaves as specified in the description of make-custom-binary-input (for read!) or make-custom-binary-output-port (for the other arguments).

Note: R6RS provides custom textual input/output ports (i.e. textual ports that support both input and output), but they are difficult to implement and there are no clear use cases for them, so they have been removed from this SRFI.

(make-file-error obj ...)

Returns an object which satisfies the R7RS-small predicate file-error?. The use of the objs is implementation-defined. Custom ports may raise the result of this procedure from their open procedures.

Transcoded ports

In order to create a port that transcodes between characters and bytes, it is necessary to have a transcoder available. The following sections explain how to create and use transcoders.

Transcoders

A transcoder is an immutable Scheme object that combines a codec, an end-of-line style, and an error-handling mode (see the following sections for details). Each transcoder represents some specific bidirectional (but not necessarily lossless), possibly stateful translation between byte sequences and the Scheme-level characters and strings allowed by the implementation. Every transcoder can decode bytes as characters and encode characters as bytes.

(make-transcoder codec eol-style handling-mode)

Returns a transcoder with the behavior specified by its arguments.

(native-transcoder)

Returns an implementation-dependent transcoder that represents a possibly locale-dependent “native” transcoding. This should be equivalent to the transcoder employed by Scheme operations that open textual ports.

(transcoded-port binary-port transcoder)procedure 

Returns a new textual port with the specified transcoder from binary-port. The new textual port's externally visible state is largely the same as that of binary-port. If binary-port is an input port, the new textual port will be an input port and will decode the bytes of binary-port. If binary-port is an output port, the new textual port will be an output port and will write encoded characters to binary-port.

It is an error to call this procedure on binary-port after it has been read from or written to. It is also an error to read or write on binary-port after calling this procedure.

(bytevector->string bytevector transcoder)

Returns the string that results from decoding the bytevector according to the input direction of the transcoder.

(string->bytevector string transcoder)

Returns the bytevector that results from encoding the string according to the output direction of the transcoder.

Codecs

Several different character encoding schemes exist that describe standard ways to encode characters and strings as byte sequences and to decode those sequences. Within this document, a codec is an immutable Scheme object that represents a specific encoding scheme. A codec has one or more names, represented as strings, and whatever other properties it requires in order to implement specific rules for encoding and decoding.

(make-codec string)

Returns a codec representing the character encoding scheme one of whose names matches the string string case-insensitively.

Some character names, encodings and corresponding algorithms can be found at the WHATWG encoding specification, and implementations should recognize and support all of these that are feasible given space constraints. There are a total of 39 encodings, which have between them 218 standard names. Note that the "replacement" codec signals an error whenever it is used. Additional encodings listed at the IANA page on character sets are not recommended.

If make-codec is called on a string that the implementation does not support, an error satisfying unknown-encoding-error? is signaled.

(latin-1-codec)
(utf-8-codec)
(utf-16-codec)

These are predefined codecs for the ISO 8859-1, UTF-8, and UTF-16 encoding schemes. When decoding, the implementation must respect any BOM present, but the implementation may assume either endianness if no BOM is present. When encoding, whether a BOM is output and what endianness is used are implementation-dependent. A call to any of these procedures returns a value that is equal in the sense of eqv? to the result of any other call to the same procedure.

(unknown-encoding-error? obj)

Returns #t if obj is a condition object raised by make-codec or one of a set of implementation-defined objects.

(unknown-encoding-error-name unknown-encoding-obj)

Extracts the name of the unknown encoding from unknown-encoding-obj and returns it as a string. It is an error to mutate this string.

End-of-line styles

An end-of-line style is a symbol that describes how a textual port transcodes representations of line endings. In order to conform to this SRFI, implementations must support at least three kinds of line endings: a #\newline character, a #\return character, and a #\return followed by a #\newline, which is known as a CRLF sequence. Note that these match the line endings recognized by the R7RS read-line procedure even when invoked on a non-transcoding port. Implementations may support other line endings as well.

The end-of-line style symbol none means that no line ending conversion is performed in either direction. On an input port, any other symbol will convert any line ending into a #\newline character. On an output port, the symbol crlf causes any line ending to be output as a CRLF sequence, whereas the symbol lf causes any line ending to be output as a #\newline character, All other characters remain unchanged. Implementations may support additional symbols.

(native-eol-style)

Returns the default end-of-line style of the underlying platform, typically lf on Unix and crlf on Windows.

Error-handling modes

An error-handling mode is a symbol that specifies the behavior of textual I/O operations in the presence of encoding or decoding errors.

If a textual input operation encounters an invalid or incomplete character encoding, then if the error-handling mode is replace, the erroneous bytes are treated as the character #\xFFFD;, or if that character is not representable by the implementation or is not permitted in strings, then by the character #\? (question mark). But if the error-handling mode is raise, an error satisfying i/o-decoding-error? is signaled, an appropriate number of bytes are ignored, and decoding continues with the following bytes.

If a textual output operation encounters a character it cannot encode, and if the error-handling mode is replace, a replacement character is encoded instead, and encoding continues with the next character. The replacement character is #\xFFFD for transcoders that can encode this character, but is #\? (question mark) for transcoders that cannot. But if the error-handling mode is raise, an an error satisfying i/o-encoding-error? is raised, and encoding continues with the next character.

Implementations may support additional symbols.

(i/o-decoding-error? obj)

Returns #t if obj is an exception raised when one of the operations for textual input from a port encounters a sequence of bytes that cannot be decoded into a character or string by the port's transcoder.

When such an exception is raised, the port's position is past the invalid encoding.

(i/o-encoding-error? obj)

Returns #t if obj is an exception raised when one of the operations for textual output to a port encounters a character that cannot be encoded into bytes by the port's transcoder.

(i/o-encoding-error-char i/o-encoding-condition)

Returns the character that could not be encoded when the condition i/o-encoding-condition was signaled.

Implementation

Every conforming R6RS implementation, including at least Chez, IronScheme, Larceny, Racket, and Vicare, already provides these procedures in the (rnrs io ports) library, with the exceptions of make-file-error, which will already exist though not necessarily be exposed, and of make-codec, unknown-encoding-error?, and unknown-encoding-error-name. Therefore, no implementation is provided here, especially since a portable implementation is not possible.

However, strictly conforming R6RS implementations will not accept the flush argument, though a wrapper to accept and ignore it would be trivial. Furthermore, the read! and write! procedures will never be passed a vector of characters, but always a string.

This SRFI can be implemented on top of the Chicken procedures make-input-port and make-output-port in the (chicken ports) library. Chicken makes makes no provisions for getting and setting positions on either its built-in ports or custom ones. It also does not distinguish between textual and binary ports (as permitted by R7RS), and its strings can store binary data; indeed, interpretation as characters is up to a higher-level library such as the utf8 egg.

Shiro Kawai has provided a sample implementation that illustrates both transcoded ports and SRFI 192. It includes a positionable vector-backed custom port library to illustrate the use case of custom ports. The sample implementation, including the examples, can be found in the Git repository for SRFI 192 and in this .tgz archive.

Port positioning and peeking

The following considerations must be applied when peek-char and peek-u8 are used on custom ports. Much of the following is derived from Shiro Kawai's README file in the SRFI 192 implementation:

If the source of characters or bytes (collectively known as elements) underlying a custom input port is natively positionable (either because the source is itself a positionable port or because it is a random-access data structure like a list, vector, or string), then the custom port can support the SRFI 192 and R6RS procedures port-position and set-port-position!. This is true even if reading an element from the custom port involves reading a variable number of elements from the source. If on the other hand the source is not seekable, the get-position procedure can simply return the number of elements read from it so far, whereas the set-position! argument should be set to #f, in which case set-port-position! will be disabled.

However, the Scheme implementation's port system must cache the position of a custom port before peeking it, because it may not be possible for the port to rewind its position. The peeked element is also cached, so that on the next read it can be returned. But if port-position is called before the peeked character is read, the port must return its cached position rather than calling the get-position procedure.

Acknowledgements

This SRFI would have been much more difficult to write without the R6RS team, who produced a good-enough (as opposed to perfect) design that John was happy to adopt. Thanks also to Mikel More, who convinced him of the necessity of having custom and transcoded ports in R7RS-large.

Copyright

Much of the content of this SRFI is drawn from R6RS, which does not have a copyright notice. It does, however, contain the following copyright license:

We intend this report to belong to the entire Scheme community, and so we grant permission to copy it in whole or in part without fee.

For the remaining content, the standard SRFI license applies:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

This permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Editor: Arthur A. Gleckler