This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.
At Marc's request, I'm responding to the SRFI-91 list on the subject of binary ports. Overall, I think that this SRFI is quite good. That said, I think that the port hierarchy proposal is misconceived and needs to be abandoned. The design mistake lies in trying to unify binary ports and character ports (even partially). Since the typing issues have been addressed in prior discussion, I'm going to focus on semantic, implementation, and experience issues here. A note: I'm assuming in all of this that scheme will move to an international character set. The problems I am about to discuss do not manifest in a system implementing only a 7-bit or 8-bit character set. Historical Background: Historically, scheme has implicitly required that all ports be character ports. The requirement that port I/O is character based is implicit in the definitions of external representation and the relationship between external representation, READ, and WRITE. The assumption was reinforced with the introduction of string ports. So: whatever else is done, we are stuck with character ports as a fundamental abstraction provided by the language runtime. One consequence of this is that external encodings cannot be neutral. While it is possible to implement arbitrary external encodings on top of READ-BYTE and WRITE-BYTE, there must be at least one multibyte character encoding that is known to the runtime at primitive level in order to support the implementations of READ and WRITE: the encoding that is the one used for scheme program input. * Mixing Bytes and Characters: Semantic issues It is tempting to imagine that we might simply mix READ-BYTE and READ-CHAR calls on a port. This breaks down quickly in the fact of multibyte character sets. Consider the situation where the input stream is pointed at a well-formed multibyte character sequence: U+4467 U+27 Now given the sequence: (READ-BYTE port) (READ-CHAR port) what should READ-CHAR return? Note that the input cursor is not pointed at the start of a character. Either read-byte must resynchronize the stream (which is not possible in some character sets) or it must signal an error (which suggests that mixing READ-BYTE and READ-CHAR was an ill-conceived thing to permit). Resynchronizing is probably the right answer, for many reasons having to do with recovery from malformed input. Resynchronization behavior of READ-CHAR needs to be defined independent of whether it is conflated with READ-BYTE. If we conclude that resynchronizing is the right answer, then the SRFI should be augmented with something like: (SYNC-TO-CHAR port) which advances past bytes that cannot be the start of a character. Having such a procedure is probably a good idea in any case. But how about: (READ-BYTE port) (PEEK-CHAR port) (PEEK-BYTE port) Assume that we can resynchronize. The whole point of PEEK-CHAR is that it should not modify the input stream, but it is obliged to throw away the bad characters while resynchronizing. So what does READ-BYTE return? Should READ-CHAR do a non-destructive resynchronization? This is consistent, but *extremely* expensive for reasons I will discuss below under "implementation issues." Finally, consider operations like: (let ((port (open-input-string "some-string"))) (read-byte port)) does this imply the need to "explode" the string into bytes at the implementation level? As I have been re-implementing tinyscheme, I have been forced to conclude that it probably does. Aside: SRFI-6, and the scheme standard in general, are *extremely* sloppy about failing to specify interacting conditions. Things like: (let* ((s "abcdef") (p (open-input-string s))) (read-char p) (string-set! s 1 #\d) (read-char p) ; returns #\b or #\d?? ) should be well defined, but are not. Is the port opened on a *copy* of the string, or does it share state? If it shares state, note that exploding the string into its constituent bytes in order to implement READ-BYTE is a nightmare! And of course, matters get *really* fun when we consider WRITE-BYTE: (let ((s (open-output-string))) (write-byte s 255) ... (get-output-string s)) Unless the sequence of written bytes happens to lead to valid character encodings, it is unclear what get-output-string can safely return here. * Mixing Bytes and Character: Implementation Issues Assume, for the moment, a UTF-8 encoding for characters. Future scheme implementations might support other encodings, but this one is certainly one that should be able to work well. If we have a resynchronization mechanism, and the implementation chooses to provide legacy support for the ISO-10656 legacy character planes, then the implementation must provide up to 11 characters of "push back" (6 good ones and 5 bad ones). In the absence of legacy planes, I believe (but check me) that 7 characters of push-back is sufficient. The classic implementation of PEEK-CHAR in C has been ungetc(getc(f)) which is pleasant, because it avoids the need to re-implement STDIO from the ground up. This is also pleasant because it is possible to share the "file" abstraction with native code. If 7 or 11 characters of pushback are required, implementations are now forced to rebuild STDIO. The problem doesn't exist in the same way when bytes and characters are not mixed, because wide-character implementations of STDIO already exist, and have already extended the idea of push-back into wide character sets. Note, however, that these implementations *rely* on a modal separation between binary file descriptors and text file descriptors. * Implementation Experience One or two of you may have noticed that I've taken over tinyscheme. I've been working on various modernizations and fixes, among them the introduction of unicode character sets. Tinyscheme is primarily a scripting language, and so will deviate from scheme in some respects, but I'm trying to keep it as close as I can within the requirements of sensible and robust scripting practices. In the course of this, I decided to see if I could come up with a sensible implementation in which READ-BYTE and READ-CHAR could have sensible behavior when both can be applied to the same port. It is a complete mess. A clean implementation does not always imply a clean semantics, but a horrific implementation executed by a competent programmer (though perhaps I flatter myself) is usually a sign of a poor design choice. * A Caution Something I said earlier deserves stronger emphasis: not all character sets have the ability to resynchronize! Unicode, I promise you, will not be the last character set in the universe. It would be a great misfortune to tie the definition of the language to an attribute (re-synchronizability) that future character set designers are likely to get wrong. Summary: We need to add read-byte, write-byte, and friends, but we should firmly segregate character ports and byte ports. Byte ports should NOT support object I/O (in the form of READ/WRITE/DISPLAY, nor READ-CHAR). The atomic unit of transfer in a byte port should be the byte. The atomic unit of transfer in "classic" ports should be the character. shap