This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 are here. Eventually, the entire history will be moved there, including any new messages.
[By the way, my profound apologies for failing to update the subject line!] On Tue, 2006-05-09 at 11:32 -0400, John Cowan wrote: > A few nits to an otherwise well-reasoned argument: > > Jonathan S. Shapiro scripsit: > > > A note: I'm assuming in all of this that scheme will move to an > > international character set. The problems I am about to discuss do not > > manifest in a system implementing only a 7-bit or 8-bit character set. > > But they do manifest quite well in 16-bit and 24-bit national character > sets, so even avoiding Unicode doesn't avoid the problem. I agree completely. > > We need to add read-byte, write-byte, and friends, but we should firmly > > segregate character ports and byte ports. Byte ports should NOT support > > object I/O (in the form of READ/WRITE/DISPLAY, nor READ-CHAR). The > > atomic unit of transfer in a byte port should be the byte. The atomic > > unit of transfer in "classic" ports should be the character. > > I agree absolutely, and would add: > > We need standard procedures that take a byte port and a representation of > an encoding and return a character port. I agree that this would be nice to have, but I think that the presence of PEEK-BYTE and PEEK-CHAR makes this problematic because of the need for multibyte lookahead. Further, I don't think that this can be implemented correctly as a non-primitive mechanism. Here are the issues that I see: 1. Can you suggest an feasible implementation that does not demand 7 bytes of pushback on the byte port? 2. If the character port is an overlay on the byte port, then problems will arise in concurrent implementations. It will become necessary for the character port implementation to obtain a lock on the byte port so that no calls to READ-BYTE or PEEK-BYTE from a second thread are allowed to interleave. The second point has exceptionally unpleasant consequences if the reader in the lock-holding thread manages to exhaust heap space without completing the operation. In addition to this, there is another issue: we should not inadvertently mandate that there should be no embedded scheme implementations. Realizing your desire implies that the scheme runtime must carry some *very* large compiled-in tables. Other cases might be omitted from a given implementation, but the proposal to support UTF-8 encoded unicode drags in many *megabytes*. This is not an issue with character ports per se, but it *is* an issue raised by READ and case-insensitive symbol name matching. The downcasing (or, if preferred, upcasing) rule tables are several megabytes. My preference would be to resolve this by declaring that R6RS is going to make a break and use case-sensitive symbol matching, but this will undoubtedly provoke holy wars on both sides. In keeping with this, I would actually like to remove the -ci- comparison operations from the core and relocate these to a library. So: if your proposal is to be implemented, I think that it should be in a library, not in the core, and I think it demands some consideration of a reconciliation of multithreading and PEEK-CHAR/PEEK-BYTE. My opinion on that: don't reconcile them, acknowledge that the use case in which the byte port will remain accessable is rare, and leave people who are engaged in multithreading to implement their own thread-respecting wrappers around raw byte ports. Finally, do *not* allow the standard input and output ports to be byte ports. shap