[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: constant-time access to variable-width encodings



Shiro Kawai wrote:
I feel a bit uncomfortable, though, with the fact that indexes and
string-length differ among different implementations, or even in the
same implementations with different character encodings.

I'm assuming a single character encoding per implementation: either UTF-8, UTF-16, or a plain array of 20-bit characters. Supporting general character encodings is problematic, since you cannot always tell if a byte is an initial or subsequent (partial) character.

In explaining/specifying my proposal it might be useful to add:
(define (char-representation-size ch)
  ;; Implementations will do this more efficiently!
  (string-length (make-string 1 ch)))

> It makes a datastructure that holds a string and its indexes non-portable, for example.

I can see an issue if you try to write that out using one implementation, and read it back in with another. Not sure how important that is.

I'd agree the proposal if it introduces a different means of
indexing, other than character count used for string-ref.  Call it
'offset' for now.  string-offset-ref, substring-offset etc. would
provide offset-based operation, while string-ref, substring etc.
work on character-based op.

That might be reasonable.  But ...

Though it may be too cumbersome for
core language.

Well, the complication is that existing code will be less efficient, and people have a choice between using string-ref (portable to R5RS but potentially slow) and string-offset-ref (portable to R6RS only but fast).

An alternative idea is to have a cache that maps the most recent (char index, offset) mapping. One problem is that even an immutable string now requires a mutable cache, with possible synchronization issues.

 And this is too much variable-length-character centric
API, which fixed-length character implementation or other
implementations (such as tree of segments) wouldn't care much.

Not sure your point. Certainly a more complex data structure is appropriate for (say) a text editor, especially once you support character "attributes".
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/