[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: shared-text substrings



   I just doubt it's worth the tradeoff.  Is there hard data that the
   optimizations you envision actually give significant performance
   gains?  I've always found non-shared strings plenty fast.

Let us suppose we read in a line of text, allocating a fresh string for it.
We are working in a Scheme w/o shared-text substrings. We wish to trim
whitespace from both ends of the string. Then we will throw away the original
string, and further process the trimmed string.

Let us further suppose that most of the time, strings do not have extra
whitespace on the ends; we are trimming to ensure an invariant which is
rarely broken by the input data.

The string-trimming procedures are allowed simply to return the input
string if they do no trimming. Requiring them to redundantly copy the
data is bogus. It wastes space and it wastes time. It can waste a *lot*
of space, if I'm processing a long, long file of data. You get into doing
that, and suddenly you care that this extra copy happens. I've blown out
my heap in S48 because of the *one* extra copy of a really big mother
string (in cases where I read an entire file into a string, processed it
with regexp stuff, and blatted output into a string port... which keeps
lists of strings around, under the hood. Just one copy too many.)

Copying data from one buffer to another is something that happens a lot in
data processing. If you need a fresh copy, great. But if you are processing
the data in a linear fashion, then you'd like to allow the underlying system
to reuse storage. Which, you'll note from the above example, does *NOT*
require shared-text substrings! In some cases, you can simply return the
original string. This is not as generally applicable a win, but one that can
be provided even in Schemes w/o shared-text substring support. My portable
reference implementation of SRFI-13 has these optmisations built-in. Another
example of this is STRING-APPEND/SHARED applied to some strings, all but one
of which are empty. Or STRING-CONCATENATE/SHARED applied to a singleton
list. When singleton lists are the common or important case, it's nice to be
able to pick up the win, without having to lard your code up with explicit
checks.

I have code in scsh that does *exactly* this kind of thing. I have to
throw in a lot of obfuscatory code, which special-cases the basic op.
I'd love to be able to simplify that stuff, and be able to rely on the
string system. I can't rely on having shared-text substrings -- S48
doesn't have them, and it would destroy any possibility of porting to
other Schemes that lack them, in any event.

You don't *have* to use these procedures. You can always, in a GC'd,
functional language, play it safe and use pure functions. These procedures
simply give you a portable way to write efficient code. I like to write
efficient code. I have been burned using inefficient code.

I didn't throw these procedures into the library from scratch, hoping they'd
turn out useful. They arose from programming.
    -Olin