[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Specification vs. Implementation

Michael Sperber wrote:
> Per Bothner <per@xxxxxxxxxxx> writes:
>>I don't buy this at all.  Why can't I replace:
>>  (make-simple-reader id descriptor etc ...)
>>  (make-simple-input-port id descriptor etc ...)
>>I.e. why can't I merge the functionality of readers into
> You could do that, but you'd have eliminated only one procedure call.
> Moreover, it's unclear to me what you've gained.

You may be missing my point.  The "make-simple-input-port" function
wouldn't call make-simple-reader because there would be no simple
reader type anymore.  The input-port would be an object with the
necessary "methods".  What we're gaining is removing an extra level of
indirection and layering for *all* operations.

> It still wouldn't make the primitive layer obsolete.

I think it would.  The point is there is no need for a separate
primitive layer.  At least I haven't seen such a need, and there are
good reasons for avoiding it, primarily simplicity, both in API and
implementation.  The latter can allow better performance.

> You've stated that you think you can get their performance, and lack
> of buffering from the ports layer, but you haven't demonstrated how to
> do it---it's certainly not straightforward, given the buffering
> inherent in things like PEEK-CHAR.

Agreed.  If character operations are performed, then we require at
least some buffering.  And we can't just use the system block-boundary
buffers, since characters can straddle block boundaries.

> It'll take a concrete proposal to convince me to change the layout.

Here are some preliminary thoughts.

In pre-JDK-1.4 Java (i.e. without direct access to the translation API)
you'd still need multiple layers (i.e. a separate Reader for chars
and an InputStream for bytes) but the layers would be driven by
implementation considerations, and not constrained by the Scheme API.
I.e. an input-port might be an object that has both an InputStream
and a Reader.

I think we should specifrequire only these modes:
* A program can arbitrarily switch back and forth between bytes amd
UTF-8 or Latin-1 chars.  In that case the Scheme "ports" can do the
conversion directly without depending on external conversion APIs.

Such a port would basically be a binary port with a byte buffer (a
blob).  (An unbuffered port still has a one-byte buffer.)  The
current position is a byte position, indicated by an offset into
the blob.  Reading bytes is obvious; reading also more-or-less so.
Handling peek-char is trickier if the blob only contains a
partial character.  We can implement this using an extra short buffer,
which can be represented by a fixnum field.  We can also use negative
blob indexes for when the current position is in the short buffer.

Suppose the current position is before the last byte in the buffer,
and that byte is the first byte of a multi-byte character.
peek-char gets the byte, fills the buffer, and looks at enough
bytes in the new "block" to determine the character.  It saves
the byte from the previous block in the short buffer, and notes
that the offset in the current block is -1 - i.e. one byte *before*
the start of the current block.  A subsequent read or peek operation
notes this negative offset, and gets the data from the short buffer.

Java input streams have a read-ahead mehanism, where you can mark
a position, read ahead, and then reset back to the mark.  This is
a useful feature for lexers/parsers.  It makes sense to combine
support this read-ahead support with that needed for peek-char.
I.e. instead of a "short buffer" you have an extra buffer "save buffer".

So I'd recommend using two buffers: the "system buffer" is block-sized,
propertly aligned, etc, for actual I/O.  The system buffer is missing
(i.e. zero bytes long) for unbuffered files.  In addition, there is a
"save buffer" which is normally just a few bytes, but can grow
arbitrarily large, if we allow arbitrary look-ahead.  Conceptually, we
have a single buffer, consisting of the concatenation of the save
buffer and the system buffer.  (An implementation can use a single
buffer, but that will normally be less efficient.)  The current
position is an offset, where non-negative values point into the system
buffer, and negative values point into the save buffer.  The position
can always be reset to any point between the start of the save buffer
and the current position.  A peek is then a "mark current position as
a save-point", read-ahead, and then revert back to the saved position.
Relatively simple, efficient, and general.

Output ports don't have the same complications.  (They may have
other complications, such as pretty-printing and handling cycles,
but mixing binary and text in different encodings is at least
conceptually straight-forward.)

* A program can switch from reading/writing bytes to reading/writing
chars in an abitrary support encoding, but cannot necessarily switch
back, or switch to a different encoding, unless the encoding is UTF-8
or Latin-1.  In that case the implementation can layer a
implementation character stream on top of an implementation byte
stream.  This is fairly straightforward in Java, but some
implementations have have difficulty if there have been any byte
operations before the first char operations.  So it should be
permissible for an implementation to not support *any* byte/char mixing
(except for UTF-8 and Latin-1).
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/