[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Update

This page is part of the web mail archives of SRFI 13 from before July 7th, 2015. The new archives for SRFI 13 contain all messages, not just those from before July 7th, 2015.



I have altered the SRFI to make the following changes:
  - "prefix-count"/"suffix-count" lexeme changed to
    "prefix-length"/"suffix-length"
      These				are now these
      -------------------------	--------------------------
      string-prefix-count           string-prefix-length      
      string-suffix-count           string-suffix-length      
      string-prefix-count-ci        string-prefix-length-ci   
      string-suffix-count-ci        string-suffix-length-ci   
      substring-prefix-count        substring-prefix-length   
      substring-suffix-count        substring-suffix-length   
      substring-prefix-count-ci     substring-prefix-length-ci
      substring-suffix-count-ci     substring-suffix-length-ci

  - string-comparison functions now return a simple boolean

	string<>     string=    string<    string>    string<=    string>=
	string-ci<>  string-ci= string-ci< string-ci> string-ci<= string-ci>=

	substring=  substring<> substring-ci=  substring-ci<>
	substring<  substring>  substring-ci<  substring-ci>
	substring<= substring>= substring-ci<= substring-ci>=

    Note that these comparison functions still return a mismatch index:
	string-compare    substring-compare
	string-compare-ci substring-compare-ci

What follows is general discussion and replies to msgs from Oleg & Dan.
    -Olin

-------------------------------------------------------------------------------
    From: oleg@xxxxxxxxx

	    If I may I'd like to propose two more functions,
    string->integer  and string-split.

A general STRING-SPLIT is just too complicated for me. Here are the variants
we want to support:
   - Variant grammars, e.g. tolerant-infix, strict-infix, and suffix
   - Optional substring indices
   - Number of fields to parse. You might want
	    - as many as exist,
	    - exactly N, or error,
	    - at least N, or error.
   - Do contiguous runs of delimiter chars make a single delimiter, or
     to they designate empty-string tokens?

Scheme makes it difficult to have different, independent sets of optional
args, since you have to order them. The "field parser" utilities I wrote
for scsh's awk utility (see the scsh manual) handle all this complexity,
and more -- you can specify tokens or delimiters with general regexps, not
just char sets. But this is much hairier machinery than I feel is appropriate
for a basic string library.

I dodged at least the grammar issue with STRING-TOKENIZE by having you
specify not the separator chars but the token chars -- contiguous runs
of token chars make a token. End of story. And I just punted the
number-of-fields issue, leaving only the substring indices as possible
optional args, so things worked out -- given my low ambitions.

    Some problems
    are more elegantly and efficiently expressed in terms of inclusion,
    some other are in terms of delimiting. I found for example that in
    Perl and Python split() is a rather often-used function.

Yeah, perl hackers use split() a lot, for sure. But the char-set SRFI provides
a CHAR-SET:GRAPHIC set, which makes it as easy to use STRING-TOKENIZE to pick
out non-whitespace tokens as it is for perl hackers to use split() to break
tokens at whitespace. So I really think STRING-TOKENIZE is going to take care
of you for the simple cases, and if you've got fancier requirements... then
you probably oughta code up a little custom parser for your app, anyway.

	    R5RS procedure string->number is far more generic than the
    proposed string->integer -- and this may be a problem IMHO.  For
    example, string->number will try to read strings like "1/2" "1S2"
    "1.34" and even "1/0" (the latter causing a zero-divide error). Note
    that to Gambit's string->number, "1S2" is a valid representation of an
    _inexact_ integer (100 to be precise).  Oftentimes we want to be more
    restrictive about what we consider a number; we want merely to read an
    integral label.

	    -- procedure+: string->integer STR START END

    Makes sure a substring of the STR from START (inclusive) till END
    (exclusive) is a representation of a non-negative integer in decimal
    notation. If so, this integer is returned. Otherwise -- when the
    substring contains non-decimal characters, or when the range from
    START till END is not within STR, the result is #f.

This is a can of worms. string->integer is undoubtedly useful. But so is
string->floating-point. What about base? Return #f or raise an error on
bad syntax?

Bornstein had a nice summary of the complexities involved:
    I don't like this particularly. I can think of a kabillion variants on
    parsing strings into numbers that I might find useful. The one that's
    built-in is the right one since it's about Scheme read form (which you
    gotta implement anyway). The moment you step into the territory of other
    number formats, you should be ready to define a full suite of procedures to
    deal with the plethora of possibilities.

    > [SRFI-13]
    > string-concatenate string-list -> string
    >     Append the elements of STRING-LIST together into a single _list_.
    >     Guaranteed to return a freshly allocated _list_.

    Did you mean to say a 'string' (instead of a _list_)?

Yes, you are quite correct. Thanks; I've fixed the text.

    SRFI-13 mentions that string-unfold is also called "anamorphism".
    Do you want to point out that a foldr combinator (e.g.,
    string-fold-right) is also called a "catamorphism"?

Excellent! Done.

    From: Dan Bornstein <danfuzz@xxxxxxxx>

    Olin writes:
    >C'mon. Do you really think that people would use STRING-SET ?
    >STRING-FILL is an easier case to make. Let's see, that would be

    Actually, my suggestions come from actual use. The Scheme variant that I'm
    working on for work started out life as a functional-only system (that is,
    no mutable data *at all*), and I ended up implementing string-set and using
    it quite a bit. Do I have to rehash the issues of why working with
    immutable data can be a big win?

    Anyway, the straightforward implementation is simple:

	(define (string-set str k ch)
	  (set! str (string-copy str)) ; or substring or whatever
	  (string-set! str k ch)
	  str)

Careful with that axe, Eugene! Never use SET! unless you really need a 
true side-effect. Use LET:
	(define (string-set str k ch)
	  (let ((str (string-copy str)))
	    (string-set! str k ch)
	    str))

    and it (I know I harp on this) maintains the overall consistency of the
    library. More consistency means easier to learn and easier to understand.
    Big win.

I'm still maintaining that you are a freak with strange programming needs,
and that STRING-SET is really an uncommon op. Does anybody besides Dan
want to stand up for a pure-functional STRING-SET ?

    I'd actually just as soon drop string-fill! as add string-fill (I don't
    think I've ever had a compelling reason to use either), but I'm more in
    favor of doing one or the other than leaving the asymmetry. For the
    record:

	(define (string-fill str ch)
	   (make-string (string-length str) ch))

Uhh... I'll add STRING-FILL if I get more support for it. 

    >>[issue with string-copy and string-copy! not taking parallel args]
    >Yeah, you're right. However, your non-side-effecting STRING-COPY is subsumed
    >by the STRING-REPLACE Welsh proposes below. I think I'll leave things as-is.

    If by "as-is" you mean dropping the proposal for string-copy! then I'm for
    that. If you mean simply leaving your original proposal where the two
    procedures take different sets of args, then I'm against that. Again, I'm
    not against the particular functionality (which seems useful to me), just
    against calling two essentially different procedures by essentially similar
    names.

I mean (1) adding STRING-REPLACE, and (2) keeping both my STRING-COPY and
my STRING-COPY!. I recognise the non-parallelism, but do not think it's
a big deal. However,
  - I'm open to a better name for STRING-COPY or STRING-COPY!, to break
    the bogus parallelism. I've considered STRING-BLT and STRING-MOVE;
    don't think they're too good.
  - I'm open to being beaten on more by others who want to back Dan up.

    >[mismatch index with the (in)equality procedures] It turns out to be a
    >handy value to have around if you are comparing strings.

    However, requiring it means that implementations are precluded from using
    certain short-cut optimizations, in particular, = and <> can't return
    quickly based on the length of the arguments. I'm against returning
    mismatch indices in the standard (in)equality functions, but do see their
    benefit and would be in favor of specifying explicit
    mismatch-index-returning procedures, not just because of the above
    efficiency tweak but also because they would signal programmer intent. I
    don't have a strong opinion about what these functions would be named,
    "stringOP-mismatch-index" is an off-the-top-of-my-head suggestion.

	string=-mismatch-index
	string<-mismatch-index
	etc.

Oops. Precluding short-cut optimisations is a bad thing. Hmm. 

OK, here's my proposal: the string-length shortcuts are only available
for string= and string<>. So we will back out the mismatch-index
functionality from all the STRING=, STRING<, etc. functions -- they
now only return boolean values.

However, I know of no shortcuts for the STRING-COMPARE procedures that
are precluded by returning mismatch indices. So we'll leave that functionality
in place.

Now programmers can choose what they want.

I *could* have restricted only = and <>, and left <, >, <=, >= alone --
but that seems a little ugly to me.

I have modified the SRFI to reflect this change.