[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Strings/chars

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.




    > From: Alex Shinn <foof@xxxxxxxxxxxxx>

    > Shiro's proposal is well thought out, handles encoding simply,
    > and is based on real working practice in Gauche.

    > The main complication is that Scheme strings don't necessarily
    > have anything to do with C strings.  Shared substrings, in fact,
    > are not C strings as already acknowledged by the API and strings
    > as lists or Boehm cords aren't even consecutive memory
    > references.  Handle these issues and the only thing left for
    > Unicode is to specify the default encoding (and an advanced SRFI
    > could specify fetching w/ alternate encodings for efficiency).


My own thinking in this area isn't fully cooked yet but let me make a
few general observations.


* portable FFI vs. native FFI

  It's worth keeping clear the difference between an FFI for writing
  code portable across multiple implementations vs. an FFi exposing
  the full glory of a particular implementation.

  In a portable FFI, we can tolerate moderate inefficiences, loss of
  generality, and all kinds of sins -- just so long as the result
  really is portable and really is enough to write useful code
  in a large number of cases.

  In terms of strings, I like the idea of ALLOCATE_COPY_OF rather than 
  EXTRACT:  function(s) that give you copies of strings or parts of
  strings, in whatever encoding you like (from a small set), but 
  which don't share state with the actual Scheme string and do have 
  to be explicitly freed.

  That's at least enough to be able to, for example, get the name of a
  file you're supposed to open.



* indexes are a total nightmare

  Let's suppose a C function wants to hand Scheme the return value of
  mb_strlen.   Or that Scheme wants to hand C a "string index".

  Total train wreck.



* the real problem is C and C libraries

  The standard C facilities for large character sets are fairly lame.
  The de facto standard practice of using UTF-8 for everything is
  limiting.   Indeed, there are no standard libraries for things such 
  as ropes, edit buffers, and so forth.

  It's beyond the scope of SRFI-50 but I think that in the longer
  term, as we build these next generation Schemes with good Unicode
  support, an interesting possibility is to aim for a run-time system
  that doubles as a next-generation C library for Unicode text
  manipulation.


-t