[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

READ-OCTET (Re: strings draft)

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.



From: Tom Lord <lord@xxxxxxx>
Subject: Re: strings draft
Date: Thu, 22 Jan 2004 18:06:24 -0800 (PST)

>     > In fact, READ-OCTET and WRITE-OCTET would in that case become primitive,
>     > since READ-CHAR and DISPLAY could be implemented in terms of them but
>     > the reverse would not be true.
> 
>     > This neatly sidesteps the issue of needing character mappings for
>     > every member of the range 128-255, and separates the ideas of octet
>     > and character at the lowest level.
> 
> Hmm.  Well, an example of what it fails to sidestep is the issue of
> making the values representable by the C `char' type a subset of CHAR?
> It's also a fairly sorry approach to take for implementing many
> network protocols in a way that is simple, direct, "tolerant of what
> it receives".

Hm, I now see an advantage in Tom's approach.

I've written code for an email filter program (with bayesian
spam filtering, of course :-)  I read a RFC2822 message header
and build an assoc list of header field, dealing with folded
header lines.  Although RFC2822 defines the field body of the
message headers should include only US-ASCII characters
(except CR and LF), there are messages that has other octets
within the header.

With Tom's approach that a character can be used to represent
an octet as well, probably one can set the input port encoding mode
to "raw" or something (assume the port has a feature of character
set conversion), then let read-char to retrieve each octet.
In such case, you need to do the "encoding conversion"
over the string afterwards (potentially performing "encode guessing"
before that).   The string that contains range 128-255 characters
might be unprintable as is, but the implementation can have
some escaped format for them.

The approach I'm taking is to read the header field as an
octet stream, and construct an octet string, which is a special
type of string that can contain any octet sequences.  After
I do necessary processing, I make a conversion on octet string
to produce a valid string, which contains legal characters.

The benefit of octet string is that (1) it is fast to convert
to underlying byte string (2) you can always tell the string
is "safe and normal" or not.
(1) is important for some applications, for example a program
that does lots of UDP packet sending and returning---such "block"
read/write is done in either octet string or uniform vector
in my Scheme, and they are fast because it directly grabs C
buffer.

However, I do feel the presense of octet string ad-hoc.  Tom's
approach does have conceptual cleanness, although probably
the programmer has to be careful about the state of the string
object she is dealing with (i.e. whether it has been converted
or not).

--shiro