This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
From: Tom Lord <lord@xxxxxxx> Subject: Re: strings draft Date: Thu, 22 Jan 2004 18:06:24 -0800 (PST) > > In fact, READ-OCTET and WRITE-OCTET would in that case become primitive, > > since READ-CHAR and DISPLAY could be implemented in terms of them but > > the reverse would not be true. > > > This neatly sidesteps the issue of needing character mappings for > > every member of the range 128-255, and separates the ideas of octet > > and character at the lowest level. > > Hmm. Well, an example of what it fails to sidestep is the issue of > making the values representable by the C `char' type a subset of CHAR? > It's also a fairly sorry approach to take for implementing many > network protocols in a way that is simple, direct, "tolerant of what > it receives". Hm, I now see an advantage in Tom's approach. I've written code for an email filter program (with bayesian spam filtering, of course :-) I read a RFC2822 message header and build an assoc list of header field, dealing with folded header lines. Although RFC2822 defines the field body of the message headers should include only US-ASCII characters (except CR and LF), there are messages that has other octets within the header. With Tom's approach that a character can be used to represent an octet as well, probably one can set the input port encoding mode to "raw" or something (assume the port has a feature of character set conversion), then let read-char to retrieve each octet. In such case, you need to do the "encoding conversion" over the string afterwards (potentially performing "encode guessing" before that). The string that contains range 128-255 characters might be unprintable as is, but the implementation can have some escaped format for them. The approach I'm taking is to read the header field as an octet stream, and construct an octet string, which is a special type of string that can contain any octet sequences. After I do necessary processing, I make a conversion on octet string to produce a valid string, which contains legal characters. The benefit of octet string is that (1) it is fast to convert to underlying byte string (2) you can always tell the string is "safe and normal" or not. (1) is important for some applications, for example a program that does lots of UDP packet sending and returning---such "block" read/write is done in either octet string or uniform vector in my Scheme, and they are fast because it directly grabs C buffer. However, I do feel the presense of octet string ad-hoc. Tom's approach does have conceptual cleanness, although probably the programmer has to be careful about the state of the string object she is dealing with (i.e. whether it has been converted or not). --shiro