243: Unreadable Data

by Lassi Kortela

Status

This SRFI is currently in withdrawn status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-243@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Received: 2022-11-18
Draft #1 published: 2022-11-20
Draft #2 published: 2023-06-05
Withdrawn: 2022-11-30
Lassi says: "The SRFI works as is but since people didn't like it it's fine to withdraw it."

Abstract

This SRFI suggests how the Scheme reader and writer should handle unreadable data in general, and unreadable objects in particular.

Rationale
Specification
Recommendations
Examples
Implementation
Acknowledgements
References

Rationale

What is unreadable data?

Lisp code is represented as data. A Lisp system can be asked to write any live object as an S-expression. However, it’s inevitable that some of those objects have complex environmental dependencies which are difficult or impossible to write down.

The prototypical example of such an object is a port. Other common examples are procedures, continuations, promises, parameters, environments, and libraries. Objects managed by a foreign function interface tend to be unreadable. Additionally, objects that stand in for end-of-file and unspecified values are commonly written as unreadable objects since it makes little sense to read them.

Existing syntax

Common Lisp reserves the lexical syntax #<...> for unreadable data.

Apart from being de jure standard in Common Lisp, this syntax is de facto standard in Scheme.

For example, Chicken writes a port as #<output port "(stdout)">.

MIT Scheme and STklos use square brackets #[...] instead of angle brackets.

For example, MIT Scheme writes a port as #[textual-i/o-port 12 for console].

Data versus objects

The syntaxes #<...> and #[...] look as if the brackets are intended to delimit an S-expression like parentheses delimit lists. But the delimiters tend to be illusory.

Angle brackets are not delimiters in any Scheme implementation nor in Common Lisp. They are identifier characters used in the names of well-known procedures such as < and string>=.

Scheme implementations accepting square brackets as list delimiters are common but far from universal, and these implementations tend to use #<...> syntax for unreadable data. STklos is the only known implementation which both accepts [...] lists and uses #[...] for unreadable data.

The use of identifier characters like < > as delimiters implies the writer will not output a well-formed S-expression. The standard Scheme write procedure is meant to output an internally consistent structure whereas display may take more liberties. The use of ill-formed syntax to write unreadable objects suits the spirit of display but not the spirit of write.

When one or more unreadable objects are nested in an otherwise readable structure, and are written using ill-formed syntax, the reader cannot recover any part of the structure. Nor can it recover subsequent structures from the same port. This will cause difficulties when S-expressions become more pervasive. As people work with larger and more heterogeneous expressions in a wider variety of contexts, it will be inconvenient to ensure that all written expressions consist only of readable objects.

Unreadable objects

In this SRFI we talk about unreadable data in general and unreadable objects in particular.

An unreadable object is any object written as a well-formed S-expression such that the reader cannot recover the original object, but can recover a stand-in object representing the original.

For example, assume an implementation which can parse the lexical syntax #[primitive append] and extract a list of two symbols, primitive and append. The syntax represents an unreadable object (presumably the standard procedure append) and the list is the stand-in for the original object.

Since the syntax is well-formed, the reader can keep reading past the unreadable object to recover more objects (if there are any). The reader can also handle structures containing any mix of readable and unreadable objects nested to an arbitrary depth. For example, the following structure is lightly adapted from Chez Scheme.

#[transcoded-port utf8-codec #[buffered-port #[binary-output-port stdout]]]

To differentiate between readable and unreadable objects in nested structures, this SRFI introduces a special data type for unreadable objects. (The data type simply encapsulates the stand-in object.)

Unreadable data

The syntax #<...> is deeply entrenched in Lisp and Scheme culture and cannot be rooted out in any reasonable amount of time. This SRFI tries to broker peace by not dictating any particular syntax. Implementations are free to keep using traditional syntax.

Unfortunately the #<...> syntax does not meet the requirements given above for unreadable objects. For example, the Common Lisp specification says:

#< is not valid reader syntax. The Lisp reader will signal an error of type reader-error on encountering #<. This syntax is typically used in the printed representation of objects that cannot be read back in.

Scheme implementations behave similarly: The reader simply discards all text following the marker #<.

We accommodate syntaxes of this kind by providing the minimal guarantee that the implementation stops reading after encountering the unreadable data marker and raises an exception. The exception handler may read the rest of the data from the port as unstructured text if the programmer so chooses.

Future ports

Current editions of RnRS say that read and write use textual ports. There is no fundamental reason for this restriction. Binary S-expressions have been demonstrated to work well, and could be accessed via the same programming interface as textual S-expressions. In fact, even non-S-expression formats such as JSON and ASN.1 could share the same interface.

Consequently the specification in this SRFI avoids talking about text. Instead it talks about data which can be either text or bytes.

It makes sense for a large Scheme implementation to support more than one variant of S-expression syntax or even to have a programmable reader and writer. There is no existing standard, but there is a consensus that the best approach is to attach a lexical syntax to each port object. This way different ports can use different syntax without getting mixed up.

This SRFI addresses the situation by requiring read and write to mind which port they are dealing with, and to use the appropriate syntax (if any) for unreadable data on that port.

Specification

Stand-in objects

(unreadable-object? object) => boolean

Return #t if object stands in for an unreadable object. Else return #f.

(unreadable-object stand-in) => unreadable-object

Make an unreadable object using the given stand-in, which can be any object.

(unreadable-object-stand-in unreadable-object) => stand-in

Return the stand-in of unreadable-object.

Reading

(read [port]) => object

This RnRS procedure is expanded to account for unreadable data. (Similar modifications should be made to other procedures that read objects from ports.)

When encountering a top-level object that contains one or more unreadable objects, or is itself an unreadable object, a unreadable-error is raised and the unreadable-error-object is the offending top-level object. The top-level object has been encoded such that each unreadable object in it (including the top-level object itself, if it is unreadable) is wrapped in the unreadable-object data type. The port position lies directly after the top-level object. It is unspecified whether atmosphere (whitespace and comments) following the top-level object have been consumed. The programmer may attempt to read more objects from the port.

When encountering unreadable data that is not an unreadable object, an unreadable-error is raised and the unreadable-error-object is #f. The port position lies immediately after the marker which indicates the start of unreadable data. For example, in the case of a textual port for which the unreadable data marker is #< the next read-char will read the character immediately following the <. In general, it does not make sense to attempt to read objects from the port at this point.

(unreadable-error? object) => boolean

Return #t if object is an unreadable data error. Else return #f.

All unreadable data errors also satisfy the RnRS read-error? predicate. In other words, unreadable data errors are a subtype of read errors.

(unreadable-error-object error) => object

If all unreadable data were encoded as stand-in objects which the implementation was able to read, return the top-level object containing those unreadable objects. The top-level object may or may not itself be an unreadable object.

If the unreadable data was not encoded as a stand-in, the return value is #f.

Writing

(write object [port])

This RnRS procedure is expanded to account for unreadable data. (Similar modifications should be made to display, write-shared, write-simple, and other procedures that write objects to ports.)

If object satisfies unreadable-object? or is some other type of unreadable object for which a stand-in can be generated, then the stand-in is written to port using an implementation-defined lexical syntax. The syntax may vary based on implementation-defined settings attached to port.

For example, the stand-in (procedure append) — a list of the two symbols procedure and append — could represent a procedure and could be written using the syntax #[procedure append].

If object cannot be written to port for syntax reasons then an exception satisfying unwritable-error? is raised.

(unwritable-error? object) => boolean

Return #t if object is an unwritable object error. Else return #f.

Possible causes for the error include the following.

The port's syntax cannot encode any unreadable data at all.
The implementation does not know how to encode a stand-in for this particular object.
The programmer has set an implementation-defined flag saying that trying to write unreadable data to the port should raise an error.

(unwritable-error-object error) => object

Returns the original object for which no stand-in could be written.

If one or more unwritable objects are nested within an otherwise writable object, it is unspecified whether the object returned by unwritable-error-object is the top-level object or one of the nested unwritable objects.

Recommendations

In implementations with an R6RS-style condition system, it is recommended that the condition types &unreadable and &unwritable be defined.

The implementation should write well-formed unreadable objects instead of ill-formed unreadable data whenever feasible.

The unreadable object syntax should closely resemble the syntax for ordinary lists, for example by using parentheses or square brackets with a suitable prefix.

The stand-in should be a list.

The first element of the list should be a symbol. Known symbols are tracked in the Scheme Registry.

Examples

Assume an implementation that reads and writes unreadable objects using the lexical syntax #[...] such that the [...] part represents a list.

Then the following code snippet:

(guard (err
        (unreadable-error?
         (let* ((top (unreadable-error-object err))
                (outer (unreadable-object-stand-in (list-ref top 5)))
                (inner (unreadable-object-stand-in (list-ref outer 2))))
           (define (wr msg obj) (display msg) (write obj) (newline))
           (wr "inner stand-in: " inner)
           (wr "outer stand-in: " outer)
           (wr "top object: " top))))

       (read-from-string "(here is an unreadable object #[1 2 #[3 4 5]])"))

will display the following output:

inner stand-in: (3 4 5)
outer stand-in: (1 2 #[3 4 5])
top object: (here is an unreadable object #[1 2 #[3 4 5]])

Implementation

Two sample implementations are provided.

One written in Scheme as a patch to Göran Weinholt's "laesare" library.
One written in C as a patch to the STklos Scheme implementation.

In general, the following implementation strategy is recommended.

When reading nested data, keep a flag to remember whether any unreadable objects have been encountered.
Every such object is wrapped in the unreadable-object data type and the wrapped object is stored in the structure being built. Keep in mind that unreadable objects can be nested.
When the top-level read procedure has built a complete object, it checks the unreadable object flag. If the flag is set, it raises an unreadable-error exception whose unreadable-error-object is the offending top-level object. If the flag is not set, it returns the object as usual.
If the reader encounters unreadable data that isn't an unreadable object, it immediately raises an unreadable-error exception whose unreadable-error-object is #f without trying to read anything more.

Acknowledgements

Thanks to Marc Nieper-Wißkirchen for a thorough discussion of the problems in parsing #<...>.

References

The Common Lisp HyperSpec, section 2.4.8.20 (Sharpsign Less-Than-Sign)

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Editor: Arthur A. Gleckler