Title

Extended ports

Author

Marc Feeley

Status

This SRFI is currently in withdrawn status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-91@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Received: 2006-04-01
Draft: 2006-04-12--2006-06-11
Withdrawn: 2007-07-10

Abstract

This SRFI specifies an extension to the R5RS ports that supports several useful features: binary I/O and text I/O, bulk I/O, file opening attributes, and bidirectional ports. Binary I/O is provided through byte ports which are ports whose fundamental I/O unit is an 8 bit byte. Because characters can be encoded with bytes using a character encoding such as ISO 8859-1, UTF-8, and UTF-16BE, any byte port is also a character port (a port that supports the character level I/O of R5RS). A byte port's character encoding and various other attributes are specified when the port is opened. Because reasonable defaults exist, these attributes are specified using a named optional parameter syntax. All procedures which have the same name as in R5RS are compatible with R5RS but may provide additional functionality.

Rationale

The R5RS Scheme standard specifies a set of procedures to perform input and output operations. These operations have limited functionality. According to the R5RS:

To Scheme, an input port is a Scheme object that can deliver characters upon command, while an output port is a Scheme object that can accept characters.

The R5RS model of I/O does not provide for the following features which are very useful in modern applications:

External encoding of the characters. In many operating systems files are stored as sequences of 8 bit bytes. Text files are in fact binary files where each character is encoded using a sequence of bytes according to some character encoding such as ISO 8859-1, UTF-8, UTF-16BE, etc. Normally the application must know in advance the character encoding used by a file, but sometimes this information is contained in the file.
Binary input and output. Some files represent data using a custom encoding. In order to implement procedures to read and write such data the ability to read and write individual bytes is necessary.
Bulk input and output. Reading large amounts of data one byte or character at a time is not efficient. It is more efficient to read and write blocks of data.
File opening attributes. Many operating systems allow attributes to be specified when a file is opened. These attributes affect the file opening operation and the I/O operations that follow. For example a flag to force the creation of the file if it does not exist, and a flag to indicate that all write operations should be done at the end of the file.
Bidirectional ports. Many operating systems have I/O devices on which it is possible to read and to write data, for example UNIX terminals, pipes and sockets. Bidirectional ports are a natural extension of the port model to support these devices.

The Gambit Scheme system currently provides an I/O model and an API very similar to the one specified in this SRFI. Gambit provides many additional I/O features that are not specified in this SRFI, such as nonblocking I/O, process ports, TCP client and server ports, object ports, double-ended string ports (a.k.a. pipes), `read-line', etc. This is an intentional omission. A more focused port SRFI, like this one, has a higher likelihood of achieving widespread consensus in the community. The goal of this SRFI is to specify a core I/O model and API that is compatible with R5RS but extensible. We hope that it will foster the development of other I/O SRFIs extending it to other I/O requirements.

This SRFI addresses all the ``binary I/O requirements'' which the members of the Scheme Language Editors Committee agreed were needed for the R6RS standard. This SRFI is a concrete proposal for an I/O subsystem that addresses all these requirements and that conforms to R5RS.

Specification

This SRFI assumes that SRFI 4 (Homogeneous numeric vector datatypes) and SRFI 88 (Keyword objects) are supported.

Port direction

A byte port can support unidirectional I/O (i.e. input port or output port), or bidirectional I/O (i.e. input-output port). Bidirectional ports can be viewed as an input port and an output port combined into a single port object. I/O operations on one side of a bidirectional port do not normally affect the other side. When an operation is said to apply to an input port it also applies to a bidirectional port. When an operation is said to apply to an output port it also applies to a bidirectional port.

procedure: (input-port? obj)
procedure: (output-port? obj): As in R5RS. Note that when port is a bidirectional port both of these procedures return #t.

procedure: (port? obj)

Returns #t if obj is an input port or an output port (or an input-output port), otherwise returns #f.

(port? 123)                          ==>  #f
(port? (current-input-port))         ==>  #t
(port? (current-output-port))        ==>  #t

Port class hierarchy

Byte ports support character I/O operations because with each byte port is attached a character encoding specifying how characters are encoded with bytes. It is incorrect to believe however that all ports are byte ports. For example the ``string ports'' of SRFI 6 (Basic String Ports) have no reason to be aware of the character to byte encoding because they only deal with sequences of characters. So they need not be byte ports. For this reason this SRFI views byte ports as a subtype of character ports. Character ports support character I/O operations and byte ports support character I/O operations and byte I/O operations. All I/O operations which are valid on a character port are also valid on a byte port. [Although not specified in this SRFI a further generalization is ``object ports'' which are ports whose fundamental I/O unit is the Scheme object. Character ports are object ports because there is a standard encoding of (most) Scheme objects to characters.]

The following predicates test to which port class an object belongs.

procedure: (char-port? obj)

Returns #t if obj is a character port, otherwise returns #f.

(char-port? 123)                     ==>  #f
(char-port? (current-input-port))    ==>  #t (probably)

procedure: (byte-port? obj)

Returns #t if obj is a byte port, otherwise returns #f.

(byte-port? 123)                     ==>  #f
(byte-port? (current-input-port))    ==>  #t (probably)

Port buffering

I/O systems typically use buffering to enhance performance. Data to be output is accumulated in a local output buffer that is only drained when it is full or the application explicitly requests it. Draining the output buffer causes the buffer's content to be passed to the operating system which is then responsible for sending it off to the target device (note that it is common to have other buffering layers on the path to the device so draining an output buffer does not imply that the device has received it). An input buffer allows data to be obtained in blocks from the operating system. When the application requests a certain amount of data, the request is served from the input buffer if it contains all the data requested, otherwise it is necessary to call the operating system to fill the input buffer.

In many cases buffering can be fully transparent and the application need not be aware of its existence. When an application mixes text and binary I/O, or there is a dependence between the output and the input (such as in an interactive program), buffering gets in the way and must be taken into account by the application. Alternatively the simplicity of a non-buffered model can be regained by disabling the buffering.

Character ports need buffers at the character level. Byte ports need buffers at the character level and at the byte level. This is because there is a performance advantage to encode and decode characters in bulk to/from the byte buffer.

procedure: (force-output [port])

The port, which defaults to the value returned by `current-output-port', must be an output port or bidirectional port. This procedure drains the character output buffer of port, and then the byte output buffer of port if it is a byte port. This procedure has no effect if the port is closed. The value returned is unspecified.

(define (ask-yes-no)
  (display "yes or no? ")
  (force-output)
  (eqv? (read) 'yes))

procedure: (close-input-port port)
procedure: (close-output-port port): As in R5RS. Note that `close-output-port' implicitly calls `force-output' before the port is closed.

procedure: (close-port port): When port is a bidirectional port, both sides of the port are closed and the value returned is unspecified. When port is a unidirectional port, `(close-port port)' is equivalent to `(close-input-port port)' when port is an input port, and it is equivalent to `(close-output-port port)' when port is an output port.

Port settings

Port settings are attributes of ports which affect the I/O operations. Port settings for a given port class are also valid for its subclasses. The settings are specified in a port settings list when the port is created; those not specified default to a reasonable value. Keyword objects as specified in SRFI 88 (Keyword objects) are used to name the settings in the port settings list. The keyword object is followed by the value to assign to that setting. As a simple example, a port can be created for the file ``foo'' using the call

    (open-input-file "foo")

This will use default settings for the character encoding, buffering, etc. To force the use of the UTF-8 character encoding the port could be opened using the call

    (open-input-file
     (list path: "foo"
           char-encoding: 'UTF-8))

Here the argument of the procedure `open-input-file' has been replaced by a port settings list which specifies the value of each port setting that should not be set to the default value. Note that some port settings have no useful default and it is therefore required to specify a value for them, such as the `path:' in the case of the file opening procedures. All port creation procedures take a single argument that can either be a port settings list or a value of a type that depends on the kind of port being created.

The reason the settings are specified using a single required parameter rather than named optional parameters is so that the R5RS `with-...-file' and `call-with-...-file' family of I/O procedures can be extended with port settings while maintaining their R5RS API (i.e. their first argument is either a string naming a file or a port settings list). Passing the settings as named optional parameters after the second parameter would be awkward and hard to read when the second parameter is a long lambda-expression.

Character ports

Character port settings

The following port settings are valid for character ports and byte ports.

direction: ( input | output | input-output )
This setting controls the direction of the port. The symbol `input' indicates a unidirectional input port, the symbol `output' indicates a unidirectional output port, and the symbol `input-output' indicates a bidirectional port. The default value of this setting depends on the port creation procedure (for example for `open-input-file' the default is `input' and for `open-output-file' the default is `output').
input-char-buffering: ( #f | #t )
This setting controls the input character buffering of the port. The value `#f' disables input buffering and the value `#t' enables input buffering. The default value of this setting is implementation dependent.
output-char-buffering: ( #f | #t | line )
This setting controls the output character buffering of the port. The value `#f' disables output buffering and the values `#t' and `line' enable output buffering. When the setting is `#f' an implicit call to `force-output' occurs after every character output to the port. When the setting is `line' an implicit call to `force-output' occurs after each `#\newline' character output to the port. The default value of this setting is implementation dependent.
char-buffering: ( #f | #t | line )
The setting `char-buffering' can be used to simultaneously set the input character buffering and the output character buffering. When `line' is used, the input character buffering is set to `#t'.

Character port operations

procedure: (read-char [port])
procedure: (peek-char [port])
procedure: (eof-object? obj)
procedure: (char-ready? [port])
procedure: (write-char char [port]): As in R5RS.

procedure: (read-substring string start end [port])

The ability to perform bulk input of characters is provided by the `read-substring' procedure. The string will receive the characters read, and start and end delimit the section of the string that is targeted (using the same indexing as the `substring' procedure). Characters will be read from the character input port port, which defaults to the value returned by `current-input-port'. The number of characters read, N, will not exceed end-start. N will be less than end-start only when the end of the input stream is reached. [Note that an extension to this SRFI may relax this constraint to provide nonblocking I/O.] The procedure returns N.

; This example assumes that this is typed at the REPL and that the
; current input port is the one used by the REPL.

> (define s (make-string 10 #\x))
> (read-substring s 2 5)123456789
3
> 456789
> s
"xx123xxxxx"

procedure: (write-substring string start end [port])

The ability to perform bulk output of characters is provided by the `write-substring' procedure. The string is the source of the characters to output, and start and end delimit the section of the string to output (using the same indexing as the `substring' procedure). Characters will be written to the character output port port, which defaults to the value returned by `current-output-port'. The number of characters written, N, will be equal to end-start. [Note that an extension to this SRFI may relax this constraint to provide nonblocking I/O.] The procedure returns N.

; This example assumes that this is typed at the REPL and that the
; current output port is the one used by the REPL.

> (define n (write-substring "1234567" 2 5))
345
> n
3

Byte ports

Byte port settings

The following port settings are valid for byte ports.

input-char-encoding: ( ISO8859-1 | UTF-8 | UTF-16 | UTF-16BE | UTF-16LE )
output-char-encoding: ( ISO8859-1 | UTF-8 | UTF-16 | UTF-16BE | UTF-16LE )
char-encoding: ( ISO8859-1 | UTF-8 | UTF-16 | UTF-16BE | UTF-16LE )
These settings control the input character encoding and the output character encoding of the port. When char-encoding: is used, both the input encoding and the output encoding are set. The encoding is selected with a symbol:
- ISO8859-1. Each character is encoded using a single byte. Only Unicode characters with a code in the range 0 to 255 are allowed.
- UTF-8. Each character is encoded using a sequence of 1 to 6 bytes. The minimum length UTF-8 encoding is used. If a BOM is needed at the beginning of the stream then it must be explicitly written.
- UTF-16. Each character is encoded using 2 or 4 bytes. The minimum length UTF-16 encoding is used. If the port is an input port and the first two bytes read are a BOM (``Byte Order Mark'' character with hexadecimal code FEFF) then the BOM will be discarded and the endianness will be set accordingly, otherwise the endianness is implementation dependent. If the port is an output port then the endianness is set in an implementation dependent way and a BOM is automatically output at the beginning of the stream.
- UTF-16BE. Each character is encoded using 2 or 4 bytes like with UTF-16, however the endianness is set to big-endian and there is no automatic BOM processing. If a BOM is needed at the beginning of the stream then it must be explicitly written.
- UTF-16LE. Each character is encoded using 2 or 4 bytes like with UTF-16, however the endianness is set to little-endian and there is no automatic BOM processing. If a BOM is needed at the beginning of the stream then it must be explicitly written.
input-eol-encoding: ( lf | cr | cr-lf )
output-eol-encoding: ( lf | cr | cr-lf )
eol-encoding: ( lf | cr | cr-lf )
These settings control the input end-of-line encoding and the output end-of-line encoding of the port. When eol-encoding: is used, both the input encoding and the output encoding are set. The end-of-line input encoding specifies which sequence when read from an input port will be decoded as a `#\newline' character. The output end-of-line encoding is used when the `#\newline' character is output. The encoding is selected with a symbol:
- lf. The end-of-line is encoded using the linefeed character (U+000A).
- cr. The end-of-line is encoded using the carriage return character (U+000D).
- cr-lf. If the port is an output port, the end-of-line is encoded using a carriage return character (U+000D) followed by a linefeed character (U+000A). If the port is an input port, all three encoding are recognized as an end-of-line, with a carriage returned followed by linefeed having precedence.
input-byte-buffering: ( #f | #t )
This setting controls the input byte buffering of the port. The value `#f' disables input buffering and the value `#t' enables input buffering. The default value of this setting is implementation dependent.
output-byte-buffering: ( #f | #t )
This setting controls the output byte buffering of the port. The value `#f' disables output buffering and the value `#t' enables output buffering. The default value of this setting is implementation dependent.
byte-buffering: ( #f | #t )
The setting `byte-buffering' can be used to simultaneously set the input byte buffering and the output byte buffering.

Byte port operations

procedure: (read-subu8vector u8vector start end [port])

The ability to perform bulk input of bytes is provided by the `read-subu8vector' procedure. The u8vector will receive the bytes read, and start and end delimit the section of the u8vector that is targeted (using the same indexing as the `substring' procedure). Bytes will be read from the byte input port port, which defaults to the value returned by `current-input-port'. It is an error if the port's character input buffer is not empty. The number of bytes read, N, will not exceed end-start. N will be less than end-start only when the end of the input stream is reached. [Note that an extension to this SRFI may relax this constraint to provide nonblocking I/O.] The procedure returns N.

; This example assumes that this is typed at the REPL and that the
; current input port is the one used by the REPL, and that it uses
; the ISO8859-1 character encoding.

> (define v (make-u8vector 10 0))
> (read-subu8vector v 2 5)123456789
3
> 456789
> v
#u8(0 0 49 50 51 0 0 0 0 0)

procedure: (write-subu8vector u8vector start end [port])

The ability to perform bulk output of bytes is provided by the `write-subu8vector' procedure. The u8vector is the source of the bytes to output, and start and end delimit the section of the u8vector to output (using the same indexing as the `substring' procedure). Bytes will be written to the byte output port port, which defaults to the value returned by `current-output-port'. The port's character output buffer is first emptied by encoding the characters into bytes. The number of bytes written from u8vector, N, will be equal to end-start. [Note that an extension to this SRFI may relax this constraint to provide nonblocking I/O.] The procedure returns N.

; This example assumes that this is typed at the REPL and that the
; current output port is the one used by the REPL, and that it uses
; the ISO8859-1 character encoding.

> (define n (write-subu8vector '#u8(49 50 51 52 53 54) 2 5))
345
> n
3

procedure: (read-byte [port])

procedure: (write-byte byte [port])

These procedures respectively read and write a single byte from the port. The procedure `write-byte' returns an unspecified value. These procedures could be defined as follows.

(define (read-byte . opt)
  (let* ((buf (u8vector 0))
         (port (if opt (car opt) (current-input-port)))
         (n (read-subu8vector buf 0 1 port)))
    (if (= n 0)
        the-end-of-file-object
        (u8vector-ref buf 0))))

(define (write-byte byte . opt)
  (let* ((buf (u8vector byte))
         (port (if opt (car opt) (current-output-port))))
    (write-subu8vector buf 0 1 port)))

File byte ports

procedure: (open-input-file string-or-settings)

procedure: (open-output-file string-or-settings)

procedure: (open-file string-or-settings)

These procedures open a file and return a byte port. In all cases the single parameter can be a string naming a file, or a settings list that indicates the file name and possibly other attributes. The following port settings are valid for file byte ports.

path: string
This setting indicates the location of the file in the filesystem. There is no default value for this setting.
append: ( #f | #t )
This setting controls whether output will be added to the end of the file each time the output byte buffer is drained. This is useful for writing to log files that might be open by more than one process. The default value of this setting is `#f'.
create: ( #f | #t | maybe )
This setting controls whether the file will be created when it is opened. A setting of `#f' requires that the file exist. A setting of `#t' requires that the file does not exist. A setting of `maybe' will create the file if it does not exist. The default value of this setting is `maybe' for output ports and `#f' for input ports and bidirectional ports.
truncate: ( #f | #t )
This setting controls whether the file will be truncated when it is opened. For input ports, the default value of this setting is `#f'. For output ports, the default value of this setting is `#t' when the `append:' setting is `#f', and `#f' otherwise.

For all three procedures the settings list can specify a `direction:' setting to select the direction of the port. The default direction for `open-input-file' is `input', for `open-output-file' it is `output', and for `open-file' it is `input-output'.

For file byte ports, the default character encoding is `ISO8859-1' and the default end-of-line encoding is `lf'. This combination of settings matches the UNIX conventions, and it represents an identity encoding for all characters whose code is in the range 0 to 255.

(let ((p (open-file
          (list direction: 'output
                path: "mydata"
                char-encoding: 'UTF-8
                append: #t))))
  (display "this is a lambda: \u03BB\n" p)
  (close-port p))

procedure: (call-with-input-file string-or-settings proc)

procedure: (call-with-output-file string-or-settings proc)

procedure: (with-input-from-file string-or-settings thunk)

procedure: (with-output-to-file string-or-settings thunk)

These procedures are as in R5RS except the first parameter can be a string naming a file, or a settings list which is interpreted as above.

(with-output-to-file
  (list path: "mydata"
        char-encoding: 'UTF-16)
  (lambda ()
    (for-each (lambda (i)
                (display i)
                (display " ")
                (display (* i i))
                (newline))
              '(1 2 3 4 5))))

Changing port settings

procedure: (input-port-settings-set! port settings)

procedure: (output-port-settings-set! port settings)

These procedures change the settings of the port to settings. The settings of the port not mentioned in settings keep their current value. It is an error to mention in settings a setting that cannot change, that is: `direction:', `path:', `append:', `create:', or `truncate:'.

When changing the character encoding or end-of-line encoding of an output port the character output buffer is first emptied by encoding the characters into bytes. It is an error to change the character encoding or end-of-line encoding of an input port when the character input buffer is not empty.

(define (read-data path)
  (let ((p (open-input-file
            (list path: path
                  char-encoding: 'ISO8859-1
                  eol-encoding: 'lf
                  char-buffering: #f))))
    (let ((c (read-char p)))
      (input-port-settings-set!
       p
       (case c ; use first char as a char encoding selector
         ((#\L) ; Latin 1
          '(char-buffering: #t))
         ((#\8) ; UTF-8
          '(char-encoding: UTF-8 char-buffering: #t))
         (else
          (error "unknown character encoding")))))
    (let* ((s (make-string 10000)) ; read up to 10000 chars
           (n (read-substring s 0 10000 p)))
      (close-port p)
      (substring s 0 n))))

Implementation

A fairly portable implementation of this SRFI is planned for the near future. The Gambit sources can be used in the meantime as a sample implementation of this SRFI (and much more!).

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Editor: Donovan Kolbly