Title

Binary I/O

Author

Alex Shinn

Status

This SRFI is currently in withdrawn status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-56 @nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Abstract

This SRFI extends Scheme with procedures to read and write binary data to and from ports, including utility procedures for writing various integer and floating point values in both big and little endian formats. Predicates are provided to test if binary I/O is allowed on a port, along with new procedures for creating such ports.

Table of Contents

Rationale

R5RS implicitly provides access only to character I/O ports, with no portable means of reading or writing binary data, which is a prerequisite for handling binary data formats, implementing databases, creating encoding conversion libraries, among other uses typically required of programming languages.

Procedure Index

Port Extensions
binary-port? character-port?
open-binary-input-file open-binary-output-file
call-with-binary-input-file call-with-binary-output-file
with-input-from-binary-file with-output-to-binary-file
I/O Extensions
read-byte write-byte peek-byte byte-ready?
Endian Procedures
default-endian default-float-endian
Fixed Read Procedures
read-binary-uint read-binary-sint
read-binary-uint8 read-binary-uint16
read-binary-uint32 read-binary-uint64
read-binary-sint8 read-binary-sint16
read-binary-sint32 read-binary-sint64
Fixed Write Procedures
write-binary-uint write-binary-sint
write-binary-uint8 write-binary-uint16
write-binary-uint32 write-binary-uint64
write-binary-sint8 write-binary-sint16
write-binary-sint32 write-binary-sint64
Network Endian Procedures
read-network-uint16 read-network-uint32 read-network-uint64
read-network-sint16 read-network-sint32 read-network-sint64
write-network-uint16 write-network-uint32 write-network-uint64
write-network-sint16 write-network-sint32 write-network-sint64
Bignum Procedures
read-ber-integer write-ber-integer
IEEE Floating Point Procedures
read-ieee-float32 read-ieee-float64
write-ieee-float32 write-ieee-float64

Specification

Port Extensions

We extend Scheme with the following two predicates to test for allowed operations on a port:

binary-port? obj
character-port? obj

These predicates return #t if OBJ allows binary or character port operations respectively, and #f otherwise. Much like INPUT-PORT? and OUTPUT-PORT? these predicates are not necessarily disjoint.

Character port operations are the input and output operations specified in R5RS: READ, READ-CHAR, PEEK-CHAR, CHAR-READY?, WRITE, DISPLAY, NEWLINE and WRITE-CHAR, plus library procedures that can be defined in terms of these. It is an error to use a character port operation on a port for which CHARACTER-PORT? returns #f.

Existing R5RS procedures that instantiate ports are implicitly character ports, including OPEN-INPUT-FILE, OPEN-OUTPUT-FILE, CALL-WITH-INPUT-FILE, CALL-WITH-OUTPUT-FILE, WITH-INPUT-FROM-FILE, WITH-OUTPUT-TO-FILE, and extensions thereof.

The following six new analogous procedures may be used to instantiate ports for which BINARY-PORT? returns #t:

open-binary-input-file path
open-binary-output-file path
call-with-binary-input-file path proc
call-with-binary-output-file path proc
with-input-from-binary-file path thunk
with-output-to-binary-file path thunk

Even if an implementation makes no distinction between binary and character ports, it is recommended that for portability and to more clearly document intent, when exclusively using binary operations on a port you use one of the above forms.

Assuming no file-system errors, the following hold:

    (call-with-input-file <file> character-port?)          =>  #t
    (call-with-input-file <file> binary-port?)             =>  unspecified
    (call-with-binary-input-file <file> character-port?)   =>  unspecified
    (call-with-binary-input-file <file> binary-port?)      =>  #t
Both binary and character ports may be input and/or output ports, so the existing CLOSE-INPUT-PORT and CLOSE-OUTPUT-PORT work as expected on all ports.

I/O Extensions

Binary port operations are defined in terms of the following four new procedures:

read-byte [port]
write-byte int [port]
peek-byte [port]
byte-ready? [port]

These behave similar to their R5RS -CHAR analogs except that they take and return integer values representing a single octet from the port. Specifically, an octet is 8 bits (one byte), with a resulting range of [0-255]. It is an error to pass a value outside this range to WRITE-BYTE. It is an error to use a binary port operation on a port for which BINARY-PORT? returns #f.

For implementations that use ASCII or any of the single byte encodings (e.g. ISO-8859-*) as the native character encoding, don't change the integer value of the characters from the native octet value, and don't distinguish between binary and character ports, these new procedures could be defined as follows:

    (define (read-byte . opt)
      (let ((c (apply read-char opt)))
        (if (eof-object? c) c (char->integer c))))
    (define (write-byte int . opt)
      (apply write-char (integer->char int) opt))
    (define (peek-byte . opt)
      (let ((c (apply peek-char opt)))
        (if (eof-object? c) c (char->integer c))))
    (define byte-ready? char-ready?)
Schemes that use multi-byte encodings or don't handle arbitrary octets in I/O ports will have to define these as primitives.

Note that CHAR-READY? should only return #t if a full character value is available. If the beginning of a valid multiple octet sequence is found but no additional octets are in the input port, then #f is returned. BYTE-READY? can be used if you only wish to test the availability of any data regardless of character validity.

Library Procedures

The above extensions are sufficient to handle all forms of binary I/O, however they are very low-level. We also provide the following library procedures, which can be defined in terms of the above, although Schemes concerned about efficiency will probably wish to implement them at a lower level.

Procedures are described below with their parameter lists. Parameters in [ brackets ] are optional and may be omitted or passed a value of #f to revert to the default value. The default value of an input port is always the result of (current-input-port) and of an output port is (current-output-port).

Endianness

Most of the procedures below accept an optional ENDIAN parameter, which is a symbol defined to be either 'big-endian or 'little-endian. This interface allows for future addition of endian types such as 'middle-endian-3412 where needed, though this SRFI does not define them.

When not given the ENDIAN parameter defaults to the appropriate value for the current system's architecture. This value can be queried with the procedure:

default-endian

General Reading

read-binary-uint size [port] [endian]

Read an unsigned integer of SIZE octets from PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture). If fewer than SIZE octets are available in the port return the eof-object.

read-binary-sint size [port] [endian]

Read a signed integer in two's complement form of SIZE octets from PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture).

Schemes are not required to support the full numeric tower, and in particular if they do not support bignums they are unlikely to be able to provide the full range of machine integer values. In this case care should be taken that when reading values, if the final result fits within the implementation's supported range the value should be read properly. In particular, small negative values should be supported, even though they may first be interpreted as large positive values before two's complement conversion.

If the resulting integer would not be supported by the Scheme's numeric range then the result should be the same as when an arithmetic operation produces an result outside the supported range, such as signalling and error or causing overflow.

Schemes that choose to use optimization strategies that limit their numeric range would be free to provide read procedures returning disjoint types. For instance, Bigloo could provide a read-binary-elong procedure to read an elong object (a Bigloo hardware integer).

Predefined Read Sizes

We provide the following predefined read sizes. Although the reference implementation defines them in terms of the general read-binary-uint above, significant performance gains are possible if you hand code them to the appropriate size.

read-binary-uint8 [port] [endian]
read-binary-uint16 [port] [endian]
read-binary-uint32 [port] [endian]
read-binary-uint64 [port] [endian]

Read and return an unsigned binary integer as in read-binary-uint, using the corresponding numeric suffix as the number of bits (i.e. 8x the value of SIZE for read-binary-uint).

read-binary-sint8 [port] [endian]
read-binary-sint16 [port] [endian]
read-binary-sint32 [port] [endian]
read-binary-sint64 [port] [endian]

Read and return a signed binary integer as in read-binary-sint, using the corresponding numeric suffix as the number of bits.

General Writing

write-binary-uint size int [port] [endian]

Write unsigned integer INT of SIZE octets to PORT (default current-output-port) with endianness ENDIAN (default to that of the local architecture).

write-binary-sint size int [port] [endian]

Write signed integer INT of SIZE octets to PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture) in two's complement form.

Predefined Write Sizes

write-binary-uint8 int [port] [endian]
write-binary-uint16 int [port] [endian]
write-binary-uint32 int [port] [endian]
write-binary-uint64 int [port] [endian]

Write an unsigned binary integer as in write-binary-uint, using the corresponding numeric suffix as the number of bits.

write-binary-sint8 int [port] [endian]
write-binary-sint16 int [port] [endian]
write-binary-sint32 int [port] [endian]
write-binary-sint64 int [port] [endian]

Write a signed binary integer as in write-binary-sint, using the corresponding numeric suffix as the number of bits.

It is an error to pass an integer which does not fit within SIZE bytes to any of the write procedures.

Predefined Network Encodings

For portability between different architectures it can be useful to use the standard "network" byte encoding (big-endian). On big-endian architectures these can simply be aliases for the general versions above.

read-network-uint16 [port]
read-network-uint32 [port]
read-network-uint64 [port]

read-network-sint16 [port]
read-network-sint32 [port]
read-network-sint64 [port]

write-network-uint16 int [port]
write-network-uint32 int [port]
write-network-uint64 int [port]

write-network-sint16 int [port]
write-network-sint32 int [port]
write-network-sint64 int [port]

Bignum Encodings

Since Schemes may support unlimited size bignums it is useful to support the binary encoding of such values.

A BER (Basic Encoding Rules from X.690) compressed integer is an unsigned integer in base 128, most significant digit first, where the high bit is set on all but the final (least significant) byte. Thus any size integer can be encoded, but the encoding is efficient and small integers don't take up any more space than they would in normal char/short/int encodings. This is commonly used to encode an unlimited length field, and can form the basis for other variable length encodings.

Examples of integers converted to BER byte sequences:

            3 => #x03
          555 => #x84 #x2B
    123456789 => #xBA #xEF #x9A #x15

read-ber-integer [port]

Reads and returns an exact integer, or the eof-object if no bytes without the high bit set (i.e. less than 128) are found.

write-ber-integer int [port]

Writes INT to the specified output port in BER format. It is an error if INT is not a positive integer.

IEEE Floating Point Encodings

Floating point binary formats are much more complicated than simple two's complement integer formats, typically divided into a sign bit, exponent field and mantissa field, optionally using a hidden bit and different rounding behavior. Because of this we do not define general purpose floating point operations but simply provide the most common formats, IEEE-754 single and double precision floats.

On some architectures floating point is handled by a separate co-processor and is not guaranteed to use the same endian as integer values. We therefore use a separate default endian for floating point numbers.

default-float-endian

Returns the default endianness used for floating point procedures as a symbol, using the same symbol names as above for integer endians.

read-ieee-float32 [port] [endian]
read-ieee-float64 [port] [endian]

Reads an IEEE float, single or double precision respectively, from PORT in the given ENDIAN, and returns the corresponding inexact real value, or the eof-object if insufficient data is present.

If the Scheme implementation supports +/- Infinity or NaN, as IEEE floats or otherwise, the Scheme implementation may return these values for the IEEE defined bit patterns on read-ieee-float.

write-ieee-float32 real [port] [endian]
write-ieee-float64 real [port] [endian]

Write REAL to PORT in the given ENDIAN using IEEE floating point representation, single or double precision respectively. It is an error if REAL is not a real value.

If the Scheme implementation supports +/- Infinity or NaN, as IEEE floats or otherwise, the Scheme implementation may accept these values for REAL and write the corresponding IEEE defined bit patterns.

Implementation

The reference implementation is available at

    http://srfi.schemers.org/srfi-56/srfi-56.scm
and has been placed under the standard SRFI license.

A corresponding test suite can be found at

    http://srfi.schemers.org/srfi-56/srfi-56-test.scm
    http://srfi.schemers.org/srfi-56/srfi-56-test.dat
The reference implementation has been tested with the following Schemes: Bigloo, Chez, Chicken, Gambit, Gauche, Guile, Kawa, KSI, MIT-Scheme, MzScheme, RScheme, Scheme48, SISC and Stklos. The *-float64 code turns out to be a very rigorous stress test for an implementation's numeric code. At time of writing, Chicken 2.0 (with the optional numbers egg), KSI 3.4.2 and MzScheme (both 200 and 299 versions) are currently the only implementations to pass all tests. Petite Chez 6.0a is the next most complete failing 6, followed by Gambit4b14 failing 8. Any Scheme that implements floating point numbers internally as C floats rather than doubles will be fundamentally unable to pass all *-float64 tests.

The reference implementation uses only portable R5RS procedures and should work unmodified in any compliant Scheme. The API for a subset of SRFI-60 bitwise procedures was used but a portable implementation of these procedures included in the source itself, so the Scheme need not support SRFI-60 natively.

Care has been taken that intermediate values remain smaller than the final result, so that Schemes with limited numeric ranges will still read and write properly the values they do support.

The default endian for both integers and floating point numbers is set to 'little-endian, which is correct for x86 platforms. Most other architectures use 'big-endian and will need to be changed accordingly.

Optimization

The fastest implementations will of course be native (C or otherwise compiled), especially for the floating point operations. However, because it is fairly extensive, as well as tested and portable, many Schemes will choose to use some or all of the reference implementation directly. In this case the following optimizations can be made:

  1. Use native equivalents of the SRFI-60 bitwise operators instead of the portable versions. At the very least the SLIB portable versions are likely to be better optimized.

  2. Use native equivalents of call-with-mantissa&exponent, such as decode-float from Chez Scheme and Gauche.

  3. Make read-byte and write-byte native.

  4. Drop the asserts.

  5. Specialize the predefined size procedures rather than define them in terms of the more general operations.

  6. Check the high bytes to determine if the end result fits within the Schemes native fixnum range, and use specialized fixnum operations in that case. See the code in Oleg's TIFF library OLEG1.

    I have not done any benchmarking but I suspect that in most cases the bottleneck is likely to be I/O rather than CPU, and extensive optimization (beyond 1 and 2 above) may not be worth the effort.

Acknowledgements

I would like to thank all those who have contributed to the design and discussion of this SRFI, both on list and off, including Per Bothner, Thomas Bushnell, Ray Dillinger, Sebastian Egner, Dale Jordan, Shiro Kawai, Oleg Kiselyov, Dave Mason, Hans Oesterholt-Dijkema, David Rush, Bradd W. Szonye and Felix Winkelmann. A special thanks goes to David Van Horn, the editor of this SRFI.

This is not to imply that these individuals necessarily endorse the final results, of course.

References

R5RS
      R. Kelsey, W. Clinger, J. Rees (eds.), Revised^5 Report on the
      Algorithmic Language Scheme, Higher-Order and Symbolic
      Computation, 11(1), September, 1998 and ACM SIGPLAN Notices,
      33(9), October, 1998.
      http://www.schemers.org/Documents/Standards/R5RS/.
CommonLisp
      Common Lisp: the Language
      Guy L. Steele Jr. (editor).
      Digital Press, Maynard, Mass., second edition 1990.
      http://www.elwood.com/alu/table/references.htm#cltl2.
      http://www.lispworks.com/documentation/HyperSpec/Front/index.htm.
ISO-C
      ISO Standard C ISO/IEC 9899:1999
      http://www.sics.se/~pd/ISO-C-FDIS.1999-04.pdf.
HOLY
      ON HOLY WARS AND A PLEA FOR PEACE
      Danny Cohen, IEN 137, April 1980.
      http://www.networksorcery.com/enp/ien/ien137.txt.
      http://www.isi.edu/in-notes/ien/ien137.txt.
IEEE-754
      Various IEEE-754 references and a calculator in JavaScript.
      http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html.
X.690
      ASN.1 encoding rules: Specification of Basic Encoding Rules
      (BER), Canonical Encoding Rules (CER) and Distinguished Encoding
      Rules (DER), February, 2002.
      http://www.itu.int/ITU-T/studygroups/com17/languages/.
      http://luca.ntop.org/Teaching/Appunti/asn1.html.
OLEG1
      Various binary parsing utilities for Scheme.
      http://okmij.org/ftp/Scheme/binary-io.html.
OLEG2
      Oleg Kiselyov, Reading IEEE binary floats in R5RS Scheme.
      Article from comp.lang.scheme, on 8 March, 2000,
      Message-ID: <8a4h56$oqu$1@nnrp1.deja.com>.
      http://okmij.org/ftp/Scheme/reading-IEEE-floats.txt.

Copyright

Copyright (C) Alex Shinn (2005). All Rights Reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Editor: David Van Horn
Last modified: Mon Oct 31 18:48:10 EST 2005