38: External Representation for Data With Shared Structure

by Ray Dillinger

Status

This SRFI is currently in final status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-38@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Abstract

This SRFI proposes (write-with-shared-structure) and (read-with-shared-structure), procedures for writing and reading external representations of data containing shared structure. These procedures implement a proposed standard external notation for data containing shared structure.

This SRFI permits but does not require replacing the standard (write) and (read) functions. These functions may be implemented without the overhead in time and space required to detect and specify shared structure.

An implementation conforms to this SRFI if it provides procedures named (write-with-shared-structure) and (read-with-shared-structure), which produce and read the same notation as produced by the reference implementation. It may also provide (read/ss) and (write/ss), equivalent functions with shorter names.

Rationale

R5RS scheme and IEEE scheme provide the procedure (write), which prints machine-readable representations of lists and other objects. However, the printed representation does not preserve information about what parts of the structure are shared, and in the case of self-referential objects the behavior of (write) itself is undefined; it is permitted to go into an infinite loop or invoke the dreaded curse of the nasal demons.

For example, it is possible to have a list within which two or more members are the same string (in the sense of (eq?)), but when the list is written, there is not sufficient information in the representation to recover the (eq?) relationship. When the list is read back in, there will be two or more copies of the string which are (eqv?) but not (eq?).

As an example of the second problem, The results of evaluating

(begin (define a (cons 'val1 'val2))
       (set-cdr! a a)
       (write a))

are undefined; in R5RS parlance, calling write on such a structure "is an error", but not one that is necessarily signalled. The routine is permitted to print a nonstandard notation such as the one proposed in this standard or a different one, fail silently, signal an error, go into an infinite loop, or make demons fly out of your nose. Some of these results are unacceptable in some cases. This SRFI hopes to provide a standard way of dealing with this problem by providing a method of writing data which is guaranteed to be well-behaved and predictable even on data containing shared structures.

The extended functionality described below in the implementation of (write-with-shared-structure)is already present in the (write) function of several major scheme implementations (PLT, SISC, Chez, Bigloo, MIT scheme, and possibly others).

Specification

Formal Grammar of the New External Representation

This SRFI creates an alternative external representation for data written and read under (write/ss) and (read/ss). It is identical to the grammar for external representation for data written and read under (write) and (read) given in section 7 of R5RS, except that the single production


<datum> --> <simple datum> | <compound datum>

Is replaced by the following five productions.


<datum> --> <defining datum> | <nondefining datum> | <defined datum>

<defining datum> -->  #<indexnum>=<nondefining datum>

<defined datum> --> #<indexnum>#

<nondefining datum> --> <simple datum> | <compound datum>

<indexnum> --> <digit 10>+

New Procedures


[[library procedure]] (write-with-shared-structure obj)
[[library procedure]] (write-with-shared-structure obj port)
[[library procedure]] (write-with-shared-structure obj port optarg)

Writes a written representation of obj to the given port. Strings that appear in the written representation are enclosed in doublequotes, and within those strings backslash and doublequote characters are escaped by backslashes. Character objects are written using the #\ notation.

Objects which denote locations rather than values (cons cells, vectors, and non-zero-length strings in R5RS scheme; also mutable objects, records, or containers if provided by the implementation), if they appear at more than one point in the data being written, must be preceded by "#N=" the first time they are written and replaced by "#N#" all subsequent times they are written, where N is a natural number used to identify that particular object. If objects which denote locations occur only once in the structure, then (write-with-shared-structure) must produce the same external representation for those objects as (write).

Write-with-shared-structure must terminate in finite time when writing finite data. Write-with-shared-structure must produce a finite representation when writing finite data.

Write-with-shared-structure returns an unspecified value. The port argument may be omitted, in which case it defaults to the value returned by (current-output-port). The optarg argument may also be omitted. If present, its effects on the output and return value are unspecified but (write-with-shared-structure) must still write a representation that can be read by (read-with-shared-structure). Some implementations may wish to use optarg to specify formatting conventions, numeric radixes, or return values. The reference implementation ignores optarg.

For example, the code


(begin (define a (cons 'val1 'val2))
       (set-cdr! a a)
       (write-with-shared-structure a))

should produce the output #1=(val1 . #1#) . This shows a cons cell whose cdr contains itself.


[[library procedure]] (read-with-shared-structure)
[[library procedure]] (read-with-shared-structure  port)

(read-with-shared-structure) converts the external representations of Scheme objects produced by (write-with-shared-structure) into scheme objects. That is, it is a parser for the nonterminal <datum> in the augmented external representation grammar defined above. (read-with-shared-structure) returns the next object parsable from the given input port, updating port to point to the first character past the end of the external representation of the object.

If an end-of-file is encountered in the input before any characters are found that can begin an object, then an end-of-file object is returned. The port remains open, and further attempts to read it (by (read-with-shared-structure) or (read) will also return an end-of-file object. If an end of file is encountered after the beginning of an object's external representation, but the external representation is incomplete and therefore not parsable, an error is signalled.

The port argument may be omitted, in which case it defaults to the value returned by (current-input-port). It is an error to read from a closed port.

Implementation

The reference implementation of (write-with-shared-structure) is based on an implementation provided by Al Petrofsky. If there are any errors in it, I probably introduced them when I was adding support for an optional port argument. The reference implementation of (read-with-shared-structure) is the implementation provided by Al Petrofsky. Both are used here with his generous permission.

Note that portability forces the reference implementation of (write-with-shared-structure) to be O(N^2) but that if an implementor tracks objects through additional fields hidden from R5RS scheme, a more efficient implementation can be provided.

If all the locations in your scheme are mutable and you don't do any locking or multithreading, you can write an O(n) version that destructively marks locations as it goes and then restores them all when done. If locations are immutable, then there should be some fixed ordering of them that you can use to make an O(log n) lookup table, giving you an O(n log n) write-with-shared-structure. R5RS does not give the programmer access to mutability information nor to comparison of constant data's addresses, but both of these are trivial operations if you have access to the system's internals.

;;; A printer that shows all sharing of substructures.  Uses the Common
;;; Lisp print-circle notation: #n# refers to a previous substructure
;;; labeled with #n=.   Takes O(n^2) time.


  (define (write-with-shared-structure obj . optional-port)
    (define (acons key val alist)
      (cons (cons key val) alist))
    (define outport (if (eq? '() optional-port)
			(current-output-port)
			(car optional-port)))
    ;; We only track duplicates of pairs, vectors, and strings.  We
    ;; ignore zero-length vectors and strings because r5rs doesn't
    ;; guarantee that eq? treats them sanely (and they aren't very
    ;; interesting anyway).

    (define (interesting? obj)
      (or (pair? obj)
	  (and (vector? obj) (not (zero? (vector-length obj))))
	  (and (string? obj) (not (zero? (string-length obj))))))
    ;; (write-obj OBJ ALIST):
    ;; ALIST has an entry for each interesting part of OBJ.  The
    ;; associated value will be:
    ;;  -- a number if the part has been given one,
    ;;  -- #t if the part will need to be assigned a number but has not been yet,
    ;;  -- #f if the part will not need a number.
    ;; The cdr of ALIST's first element should be the most recently
    ;; assigned number.
    ;; Returns an alist with new shadowing entries for any parts that
    ;; had numbers assigned.
    (define (write-obj obj alist)
      (define (write-interesting alist)
	(cond ((pair? obj)
	       (display "(" outport)
	       (let write-cdr ((obj (cdr obj)) (alist (write-obj (car obj) alist)))
		 (cond ((and (pair? obj) (not (cdr (assq obj alist))))
			(display " " outport)
			(write-cdr (cdr obj) (write-obj (car obj) alist)))
		       ((null? obj)
			(display ")" outport)
			alist)
		       (else
			(display " . " outport)
			(let ((alist (write-obj obj alist)))
			  (display ")" outport)
			  alist)))))
	      ((vector? obj)
	       (display "#(" outport)
	       (let ((len (vector-length obj)))
		 (let write-vec ((i 1) (alist (write-obj (vector-ref obj 0) alist)))
		   (cond ((= i len) (display ")" outport) alist)
			 (else (display " " outport)
			       (write-vec (+ i 1)
					  (write-obj (vector-ref obj i) alist)))))))
	      ;; else it's a string
	      (else (write obj outport) alist)))
      (cond ((interesting? obj)
	     (let ((val (cdr (assq obj alist))))
	       (cond ((not val) (write-interesting alist))
		     ((number? val)
		      (begin (display "#" outport)
			     (write val outport)
			     (display "#" outport) alist))
		     (else
		      (let ((n (+ 1 (cdar alist))))
			(begin (display "#" outport)
			       (write n outport)
			       (display "=" outport))
			(write-interesting (acons obj n alist)))))))
	    (else (write obj outport) alist)))

    ;; Scan computes the initial value of the alist, which maps each
    ;; interesting part of the object to #t if it occurs multiple times,
    ;; #f if only once.
    (define (scan obj alist)
      (cond ((not (interesting? obj)) alist)
	    ((assq obj alist)
             => (lambda (p) (if (cdr p) alist (acons obj #t alist))))
	    (else
	     (let ((alist (acons obj #f alist)))
	       (cond ((pair? obj) (scan (car obj) (scan (cdr obj) alist)))
		     ((vector? obj)
		      (let ((len (vector-length obj)))
			(do ((i 0 (+ 1 i))
			     (alist alist (scan (vector-ref obj i) alist)))
			    ((= i len) alist))))
		     (else alist))))))
    (write-obj obj (acons 'dummy 0 (scan obj '())))
    ;; We don't want to return the big alist that write-obj just returned.
    (if #f #f))





(define (read-with-shared-structure . optional-port)
  (define port
    (if (null? optional-port) (current-input-port) (car optional-port)))

  (define (read-char*) (read-char port))
  (define (peek-char*) (peek-char port))

  (define (looking-at? c)
    (eqv? c (peek-char*)))

  (define (delimiter? c)
    (case c
      ((#\( #\) #\" #\;) #t)
      (else (or (eof-object? c)
		(char-whitespace? c)))))

  (define (not-delimiter? c) (not (delimiter? c)))

  (define (eat-intertoken-space)
    (define c (peek-char*))
    (cond ((eof-object? c))
	  ((char-whitespace? c) (read-char*) (eat-intertoken-space))
	  ((char=? c #\;)
	   (do ((c (read-char*) (read-char*)))
	       ((or (eof-object? c) (char=? c #\newline))))
	   (eat-intertoken-space))))

  (define (read-string)
    (read-char*)
    (let read-it ((chars '()))
      (let ((c (read-char*)))
	(if (eof-object? c)
	    (error "EOF inside a string")
	    (case c
	      ((#\") (list->string (reverse chars)))
	      ((#\\) (read-it (cons (read-char*) chars)))
	      (else (read-it (cons c chars))))))))

  ;; reads chars that match PRED and returns them as a string.
  (define (read-some-chars pred)
    (let iter ((chars '()))
      (let ((c (peek-char*)))
	(if (or (eof-object? c) (not (pred c)))
	    (list->string (reverse chars))
	    (iter (cons (read-char*) chars))))))

  ;; reads a character after the #\ has been read.
  (define (read-character)
    (let ((c (peek-char*)))
      (cond ((eof-object? c) (error "EOF inside a character"))
	    ((char-alphabetic? c)
	     (let ((name (read-some-chars char-alphabetic?)))
	       (cond ((= 1 (string-length name)) (string-ref name 0))
		     ((string-ci=? name "space") #\space)
		     ((string-ci=? name "newline") #\newline)
		     (else (error "Unknown named character: " name)))))
	    (else (read-char*)))))

  (define (read-number first-char)
    (let ((str (string-append (string first-char)
			      (read-some-chars not-delimiter?))))
      (or (string->number str)
	  (error "Malformed number: " str))))

  (define char-standard-case
    (if (char=? #\a (string-ref (symbol->string 'a) 0))
	char-downcase
	char-upcase))

  (define (string-standard-case str)
    (let* ((len (string-length str))
	   (new (make-string len)))
      (do ((i 0 (+ i 1)))
	  ((= i len) new)
	(string-set! new i (char-standard-case (string-ref str i))))))

  (define (read-identifier)
    (string->symbol (string-standard-case (read-some-chars not-delimiter?))))

  (define (read-part-spec)
    (let ((n (string->number (read-some-chars char-numeric?))))
      (let ((c (read-char*)))
	(case c
	  ((#\=) (cons 'decl n))
	  ((#\#) (cons 'use n))
	  (else (error "Malformed shared part specifier"))))))

  ;; Tokens: strings, characters, numbers, booleans, and
  ;; identifiers/symbols are represented as themselves.
  ;; Single-character tokens are represented as (CHAR), the
  ;; two-character tokens #( and ,@ become (#\#) and (#\@).
  ;; #NN= and #NN# become (decl . NN) and (use . NN).
  (define (read-optional-token)
    (eat-intertoken-space)
    (let ((c (peek-char*)))
      (case c
	((#\( #\) #\' #\`) (read-char*) (list c))
	((#\,)
	 (read-char*)
	 (if (looking-at? #\@)
	     (begin (read-char*) '(#\@))
	     '(#\,)))
	((#\") (read-string))
	((#\.)
	 (read-char*)
	 (cond ((delimiter? (peek-char*)) '(#\.))
	       ((not (looking-at? #\.)) (read-number #\.))
	       ((begin (read-char*) (looking-at? #\.)) (read-char*) '...)
	       (else (error "Malformed token starting with \"..\""))))
	((#\+) (read-char*) (if (delimiter? (peek-char*)) '+ (read-number c)))
	((#\-) (read-char*) (if (delimiter? (peek-char*)) '- (read-number c)))
	((#\#)
	 (read-char*)
	 (let ((c (peek-char*)))
	   (case c
	     ((#\() (read-char*) '(#\#))
	     ((#\\) (read-char*) (read-character))
	     ((#\t #\T) (read-char*) #t)
	     ((#\f #\F) (read-char*) #f)
	     (else (cond ((eof-object? c) (error "EOF inside a # token"))
			 ((char-numeric? c) (read-part-spec))
			 (else (read-number #\#)))))))
	(else (cond ((eof-object? c) c)
		    ((char-numeric? c) (read-char*) (read-number c))
		    (else (read-identifier)))))))

  (define (read-token)
    (let ((tok (read-optional-token)))
      (if (eof-object? tok)
	  (error "EOF where token was required")
	  tok)))

  ;; Parts-alist maps the number of each part to a thunk that returns the part.
  (define parts-alist '())

  (define (add-part-to-alist! n thunk)
    (set! parts-alist (cons (cons n thunk) parts-alist)))

  ;; Read-object returns a datum that may contain some thunks, which
  ;; need to be replaced with their return values.
  (define (read-object)
    (finish-reading-object (read-token)))

  ;; Like read-object, but may return EOF.
  (define (read-optional-object)
    (finish-reading-object (read-optional-token)))

  (define (finish-reading-object first-token)
    (if (not (pair? first-token))
	first-token
	(if (char? (car first-token))
	    (case (car first-token)
	      ((#\() (read-list-tail))
	      ((#\#) (list->vector (read-list-tail)))
	      ((#\. #\)) (error (string-append "Unexpected \"" (string (car first-token)) "\"")))
	      (else
	       (list (caadr (assv (car first-token)
				  '((#\' 'x) (#\, ,x) (#\` `x) (#\@ ,@x))))
		     (read-object))))
	    ;; We need to specially handle chains of declarations in
	    ;; order to allow #1=#2=x and #1=(#2=#1#) and not to allow
	    ;; #1=#2=#1# nor #1=#2=#1=x.
	    (let ((starting-alist parts-alist))
	      (let read-decls ((token first-token))
		(if (and (pair? token) (symbol? (car token)))
		    (let ((n (cdr token)))
		      (case (car token)
			((use)
			 ;; To use a part, it must have been
			 ;; declared before this chain started.
			 (cond ((assv n starting-alist) => cdr)
			       (else (error "Use of undeclared part " n))))
			((decl)
			 (if (assv n parts-alist)
			     (error "Double declaration of part " n))
			 ;; Letrec enables us to make deferred
			 ;; references to an object before it exists.
			 (letrec ((obj (begin
					 (add-part-to-alist! n (lambda () obj))
					 (read-decls (read-token)))))
			   obj))))
		    (finish-reading-object token)))))))

  (define (read-list-tail)
    (let ((token (read-token)))
      (if (not (pair? token))
	  (cons token (read-list-tail))
	  (case (car token)
	    ((#\)) '())
	    ((#\.) (let* ((obj (read-object))
			  (tok (read-token)))
		     (if (and (pair? tok) (char=? #\) (car tok)))
			 obj
			 (error "Extra junk after a dot"))))
	    (else (let ((obj (finish-reading-object token)))
		    (cons obj (read-list-tail))))))))

  ;; Unthunk.
  ;; To deference a part that was declared using another part,
  ;; e.g. #2=#1#, may require multiple dethunkings.  We were careful
  ;; in finish-reading-object to ensure that this won't loop forever:
  (define (unthunk thunk)
    (let ((x (thunk)))
      (if (procedure? x) (unthunk x) x)))

  (let ((obj (read-optional-object)))
    (let fill-in-parts ((obj obj))
      (cond ((pair? obj)
	     (if (procedure? (car obj))
		 (set-car! obj (unthunk (car obj)))
		 (fill-in-parts (car obj)))
	     (if (procedure? (cdr obj))
		 (set-cdr! obj (unthunk (cdr obj)))
		 (fill-in-parts (cdr obj))))
	    ((vector? obj)
	     (let ((len (vector-length obj)))
	       (do ((i 0 (+ i 1)))
		   ((= i len))
		 (let ((elt (vector-ref obj i)))
		   (if (procedure? elt)
		       (vector-set! obj i (unthunk elt))
		       (fill-in-parts elt))))))))
    obj))

Copyright

©r; 2003 Ray Dillinger.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Editor: David Rush