[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: constant-time access to variable-width encodings
Shiro Kawai wrote:
I feel a bit uncomfortable, though, with the fact that indexes and
string-length differ among different implementations, or even in the
same implementations with different character encodings.
I'm assuming a single character encoding per implementation: either
UTF-8, UTF-16, or a plain array of 20-bit characters. Supporting
general character encodings is problematic, since you cannot always tell
if a byte is an initial or subsequent (partial) character.
In explaining/specifying my proposal it might be useful to add:
(define (char-representation-size ch)
;; Implementations will do this more efficiently!
(string-length (make-string 1 ch)))
> It makes a datastructure that holds a string and its indexes
non-portable, for example.
I can see an issue if you try to write that out using one
implementation, and read it back in with another. Not sure how
important that is.
I'd agree the proposal if it introduces a different means of
indexing, other than character count used for string-ref. Call it
'offset' for now. string-offset-ref, substring-offset etc. would
provide offset-based operation, while string-ref, substring etc.
work on character-based op.
That might be reasonable. But ...
Though it may be too cumbersome for
core language.
Well, the complication is that existing code will be less efficient, and
people have a choice between using string-ref (portable to R5RS but
potentially slow) and string-offset-ref (portable to R6RS only but fast).
An alternative idea is to have a cache that maps the most recent (char
index, offset) mapping. One problem is that even an immutable string
now requires a mutable cache, with possible synchronization issues.
And this is too much variable-length-character centric
API, which fixed-length character implementation or other
implementations (such as tree of segments) wouldn't care much.
Not sure your point. Certainly a more complex data structure is
appropriate for (say) a text editor, especially once you support
character "attributes".
--
--Per Bothner
per@xxxxxxxxxxx http://per.bothner.com/