Binary I/O
Alex Shinn
This SRFI is currently in withdrawn status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-56 @nospamsrfi.schemers.org
. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.
This SRFI extends Scheme with procedures to read and write binary data to and from ports, including utility procedures for writing various integer and floating point values in both big and little endian formats. Predicates are provided to test if binary I/O is allowed on a port, along with new procedures for creating such ports.
R5RS implicitly provides access only to character I/O ports, with no portable means of reading or writing binary data, which is a prerequisite for handling binary data formats, implementing databases, creating encoding conversion libraries, among other uses typically required of programming languages.
binary-port? character-port? open-binary-input-file open-binary-output-file call-with-binary-input-file call-with-binary-output-file with-input-from-binary-file with-output-to-binary-file
read-byte write-byte peek-byte byte-ready?
default-endian default-float-endian
read-binary-uint read-binary-sint read-binary-uint8 read-binary-uint16 read-binary-uint32 read-binary-uint64 read-binary-sint8 read-binary-sint16 read-binary-sint32 read-binary-sint64
write-binary-uint write-binary-sint write-binary-uint8 write-binary-uint16 write-binary-uint32 write-binary-uint64 write-binary-sint8 write-binary-sint16 write-binary-sint32 write-binary-sint64
read-network-uint16 read-network-uint32 read-network-uint64 read-network-sint16 read-network-sint32 read-network-sint64 write-network-uint16 write-network-uint32 write-network-uint64 write-network-sint16 write-network-sint32 write-network-sint64
read-ber-integer write-ber-integer
read-ieee-float32 read-ieee-float64 write-ieee-float32 write-ieee-float64
We extend Scheme with the following two predicates to test for allowed operations on a port:
binary-port?
obj
character-port?
obj
These predicates return #t if OBJ allows binary or character port operations respectively, and #f otherwise. Much like INPUT-PORT? and OUTPUT-PORT? these predicates are not necessarily disjoint.
Character port operations are the input and output operations specified in R5RS: READ, READ-CHAR, PEEK-CHAR, CHAR-READY?, WRITE, DISPLAY, NEWLINE and WRITE-CHAR, plus library procedures that can be defined in terms of these. It is an error to use a character port operation on a port for which CHARACTER-PORT? returns #f.
Existing R5RS procedures that instantiate ports are implicitly character ports, including OPEN-INPUT-FILE, OPEN-OUTPUT-FILE, CALL-WITH-INPUT-FILE, CALL-WITH-OUTPUT-FILE, WITH-INPUT-FROM-FILE, WITH-OUTPUT-TO-FILE, and extensions thereof.
The following six new analogous procedures may be used to instantiate ports for which BINARY-PORT? returns #t:
open-binary-input-file
path
open-binary-output-file
path
call-with-binary-input-file
path proc
call-with-binary-output-file
path proc
with-input-from-binary-file
path thunk
with-output-to-binary-file
path thunk
Even if an implementation makes no distinction between binary and character ports, it is recommended that for portability and to more clearly document intent, when exclusively using binary operations on a port you use one of the above forms.
Assuming no file-system errors, the following hold:
(call-with-input-file <file> character-port?) => #t (call-with-input-file <file> binary-port?) => unspecified (call-with-binary-input-file <file> character-port?) => unspecified (call-with-binary-input-file <file> binary-port?) => #tBoth binary and character ports may be input and/or output ports, so the existing CLOSE-INPUT-PORT and CLOSE-OUTPUT-PORT work as expected on all ports.
Binary port operations are defined in terms of the following four new procedures:
read-byte
[port]
write-byte
int [port]
peek-byte
[port]
byte-ready?
[port]
These behave similar to their R5RS -CHAR analogs except that they take and return integer values representing a single octet from the port. Specifically, an octet is 8 bits (one byte), with a resulting range of [0-255]. It is an error to pass a value outside this range to WRITE-BYTE. It is an error to use a binary port operation on a port for which BINARY-PORT? returns #f.
For implementations that use ASCII or any of the single byte encodings (e.g. ISO-8859-*) as the native character encoding, don't change the integer value of the characters from the native octet value, and don't distinguish between binary and character ports, these new procedures could be defined as follows:
(define (read-byte . opt) (let ((c (apply read-char opt))) (if (eof-object? c) c (char->integer c))))
(define (write-byte int . opt) (apply write-char (integer->char int) opt))
(define (peek-byte . opt) (let ((c (apply peek-char opt))) (if (eof-object? c) c (char->integer c))))
(define byte-ready? char-ready?)Schemes that use multi-byte encodings or don't handle arbitrary octets in I/O ports will have to define these as primitives.
Note that CHAR-READY? should only return #t if a full character value is available. If the beginning of a valid multiple octet sequence is found but no additional octets are in the input port, then #f is returned. BYTE-READY? can be used if you only wish to test the availability of any data regardless of character validity.
The above extensions are sufficient to handle all forms of binary I/O, however they are very low-level. We also provide the following library procedures, which can be defined in terms of the above, although Schemes concerned about efficiency will probably wish to implement them at a lower level.
Procedures are described below with their parameter lists. Parameters in [ brackets ] are optional and may be omitted or passed a value of #f to revert to the default value. The default value of an input port is always the result of (current-input-port) and of an output port is (current-output-port).
Most of the procedures below accept an optional ENDIAN parameter, which is a symbol defined to be either 'big-endian or 'little-endian. This interface allows for future addition of endian types such as 'middle-endian-3412 where needed, though this SRFI does not define them.
When not given the ENDIAN parameter defaults to the appropriate value for the current system's architecture. This value can be queried with the procedure:
read-binary-uint
size [port] [endian]
Read an unsigned integer of SIZE octets from PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture). If fewer than SIZE octets are available in the port return the eof-object.
read-binary-sint
size [port] [endian]
Read a signed integer in two's complement form of SIZE octets from PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture).
Schemes are not required to support the full numeric tower, and in particular if they do not support bignums they are unlikely to be able to provide the full range of machine integer values. In this case care should be taken that when reading values, if the final result fits within the implementation's supported range the value should be read properly. In particular, small negative values should be supported, even though they may first be interpreted as large positive values before two's complement conversion.
If the resulting integer would not be supported by the Scheme's numeric range then the result should be the same as when an arithmetic operation produces an result outside the supported range, such as signalling and error or causing overflow.
Schemes that choose to use optimization strategies that limit their numeric range would be free to provide read procedures returning disjoint types. For instance, Bigloo could provide a read-binary-elong procedure to read an elong object (a Bigloo hardware integer).
We provide the following predefined read sizes. Although the reference implementation defines them in terms of the general read-binary-uint above, significant performance gains are possible if you hand code them to the appropriate size.
read-binary-uint8
[port] [endian]
read-binary-uint16
[port] [endian]
read-binary-uint32
[port] [endian]
read-binary-uint64
[port] [endian]
Read and return an unsigned binary integer as in read-binary-uint, using the corresponding numeric suffix as the number of bits (i.e. 8x the value of SIZE for read-binary-uint).
read-binary-sint8
[port] [endian]
read-binary-sint16
[port] [endian]
read-binary-sint32
[port] [endian]
read-binary-sint64
[port] [endian]
Read and return a signed binary integer as in read-binary-sint, using the corresponding numeric suffix as the number of bits.
write-binary-uint
size int [port] [endian]
Write unsigned integer INT of SIZE octets to PORT (default current-output-port) with endianness ENDIAN (default to that of the local architecture).
write-binary-sint
size int [port] [endian]
Write signed integer INT of SIZE octets to PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture) in two's complement form.
write-binary-uint8
int [port] [endian]
write-binary-uint16
int [port] [endian]
write-binary-uint32
int [port] [endian]
write-binary-uint64
int [port] [endian]
Write an unsigned binary integer as in write-binary-uint, using the corresponding numeric suffix as the number of bits.
write-binary-sint8
int [port] [endian]
write-binary-sint16
int [port] [endian]
write-binary-sint32
int [port] [endian]
write-binary-sint64
int [port] [endian]
Write a signed binary integer as in write-binary-sint, using the corresponding numeric suffix as the number of bits.
It is an error to pass an integer which does not fit within SIZE bytes to any of the write procedures.
For portability between different architectures it can be useful to use the standard "network" byte encoding (big-endian). On big-endian architectures these can simply be aliases for the general versions above.
read-network-uint16
[port]
read-network-uint32
[port]
read-network-uint64
[port]
read-network-sint16
[port]
read-network-sint32
[port]
read-network-sint64
[port]
write-network-uint16
int [port]
write-network-uint32
int [port]
write-network-uint64
int [port]
write-network-sint16
int [port]
write-network-sint32
int [port]
write-network-sint64
int [port]
Since Schemes may support unlimited size bignums it is useful to support the binary encoding of such values.
A BER (Basic Encoding Rules from X.690) compressed integer is an unsigned integer in base 128, most significant digit first, where the high bit is set on all but the final (least significant) byte. Thus any size integer can be encoded, but the encoding is efficient and small integers don't take up any more space than they would in normal char/short/int encodings. This is commonly used to encode an unlimited length field, and can form the basis for other variable length encodings.
Examples of integers converted to BER byte sequences:
3 => #x03 555 => #x84 #x2B 123456789 => #xBA #xEF #x9A #x15
Reads and returns an exact integer, or the eof-object if no bytes without the high bit set (i.e. less than 128) are found.
Writes INT to the specified output port in BER format. It is an error if INT is not a positive integer.
Floating point binary formats are much more complicated than simple two's complement integer formats, typically divided into a sign bit, exponent field and mantissa field, optionally using a hidden bit and different rounding behavior. Because of this we do not define general purpose floating point operations but simply provide the most common formats, IEEE-754 single and double precision floats.
On some architectures floating point is handled by a separate co-processor and is not guaranteed to use the same endian as integer values. We therefore use a separate default endian for floating point numbers.
Returns the default endianness used for floating point procedures as a symbol, using the same symbol names as above for integer endians.
read-ieee-float32
[port] [endian]
read-ieee-float64
[port] [endian]
Reads an IEEE float, single or double precision respectively, from PORT in the given ENDIAN, and returns the corresponding inexact real value, or the eof-object if insufficient data is present.
If the Scheme implementation supports +/- Infinity or NaN, as IEEE floats or otherwise, the Scheme implementation may return these values for the IEEE defined bit patterns on read-ieee-float.
write-ieee-float32
real [port] [endian]
write-ieee-float64
real [port] [endian]
Write REAL to PORT in the given ENDIAN using IEEE floating point representation, single or double precision respectively. It is an error if REAL is not a real value.
If the Scheme implementation supports +/- Infinity or NaN, as IEEE floats or otherwise, the Scheme implementation may accept these values for REAL and write the corresponding IEEE defined bit patterns.
The reference implementation is available at
http://srfi.schemers.org/srfi-56/srfi-56.scmand has been placed under the standard SRFI license.
A corresponding test suite can be found at
http://srfi.schemers.org/srfi-56/srfi-56-test.scm http://srfi.schemers.org/srfi-56/srfi-56-test.datThe reference implementation has been tested with the following Schemes: Bigloo, Chez, Chicken, Gambit, Gauche, Guile, Kawa, KSI, MIT-Scheme, MzScheme, RScheme, Scheme48, SISC and Stklos. The *-float64 code turns out to be a very rigorous stress test for an implementation's numeric code. At time of writing, Chicken 2.0 (with the optional numbers egg), KSI 3.4.2 and MzScheme (both 200 and 299 versions) are currently the only implementations to pass all tests. Petite Chez 6.0a is the next most complete failing 6, followed by Gambit4b14 failing 8. Any Scheme that implements floating point numbers internally as C floats rather than doubles will be fundamentally unable to pass all *-float64 tests.
The reference implementation uses only portable R5RS procedures and should work unmodified in any compliant Scheme. The API for a subset of SRFI-60 bitwise procedures was used but a portable implementation of these procedures included in the source itself, so the Scheme need not support SRFI-60 natively.
Care has been taken that intermediate values remain smaller than the final result, so that Schemes with limited numeric ranges will still read and write properly the values they do support.
The default endian for both integers and floating point numbers is set to 'little-endian, which is correct for x86 platforms. Most other architectures use 'big-endian and will need to be changed accordingly.
The fastest implementations will of course be native (C or otherwise compiled), especially for the floating point operations. However, because it is fairly extensive, as well as tested and portable, many Schemes will choose to use some or all of the reference implementation directly. In this case the following optimizations can be made:
I have not done any benchmarking but I suspect that in most cases the bottleneck is likely to be I/O rather than CPU, and extensive optimization (beyond 1 and 2 above) may not be worth the effort.
I would like to thank all those who have contributed to the design and discussion of this SRFI, both on list and off, including Per Bothner, Thomas Bushnell, Ray Dillinger, Sebastian Egner, Dale Jordan, Shiro Kawai, Oleg Kiselyov, Dave Mason, Hans Oesterholt-Dijkema, David Rush, Bradd W. Szonye and Felix Winkelmann. A special thanks goes to David Van Horn, the editor of this SRFI.
This is not to imply that these individuals necessarily endorse the final results, of course.
R. Kelsey, W. Clinger, J. Rees (eds.), Revised^5 Report on the Algorithmic Language Scheme, Higher-Order and Symbolic Computation, 11(1), September, 1998 and ACM SIGPLAN Notices, 33(9), October, 1998. http://www.schemers.org/Documents/Standards/R5RS/.
Common Lisp: the Language Guy L. Steele Jr. (editor). Digital Press, Maynard, Mass., second edition 1990. http://www.elwood.com/alu/table/references.htm#cltl2. http://www.lispworks.com/documentation/HyperSpec/Front/index.htm.
ISO Standard C ISO/IEC 9899:1999 http://www.sics.se/~pd/ISO-C-FDIS.1999-04.pdf.
ON HOLY WARS AND A PLEA FOR PEACE Danny Cohen, IEN 137, April 1980. http://www.networksorcery.com/enp/ien/ien137.txt. http://www.isi.edu/in-notes/ien/ien137.txt.
Various IEEE-754 references and a calculator in JavaScript. http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html.
ASN.1 encoding rules: Specification of Basic Encoding Rules (BER), Canonical Encoding Rules (CER) and Distinguished Encoding Rules (DER), February, 2002. http://www.itu.int/ITU-T/studygroups/com17/languages/. http://luca.ntop.org/Teaching/Appunti/asn1.html.
Various binary parsing utilities for Scheme. http://okmij.org/ftp/Scheme/binary-io.html.
Oleg Kiselyov, Reading IEEE binary floats in R5RS Scheme. Article from comp.lang.scheme, on 8 March, 2000, Message-ID: <8a4h56$oqu$1@nnrp1.deja.com>. http://okmij.org/ftp/Scheme/reading-IEEE-floats.txt.
Copyright (C) Alex Shinn (2005). All Rights Reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.