[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: constant-time access to variable-width encodings

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: Shiro Kawai <shiro@xxxxxxxx>
Subject: Re: constant-time access to variable-width encodings
From: Per Bothner <per@xxxxxxxxxxx>
Date: Wed, 13 Jul 2005 13:35:59 -0700
Cc: srfi-75@xxxxxxxxxxxxxxxxx
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
In-reply-to: <20050713.101557.865676685.shiro@xxxxxxxx>
References: <42D559A9.1080000@xxxxxxxxxxx> <20050713.101557.865676685.shiro@xxxxxxxx>
User-agent: Mozilla Thunderbird 1.0.2-6 (X11/20050513)

Shiro Kawai wrote:

I feel a bit uncomfortable, though, with the fact that indexes and
string-length differ among different implementations, or even in the
same implementations with different character encodings.

I'm assuming a single character encoding per implementation: eitherUTF-8, UTF-16, or a plain array of 20-bit characters. Supportinggeneral character encodings is problematic, since you cannot always tellif a byte is an initial or subsequent (partial) character.


In explaining/specifying my proposal it might be useful to add:
(define (char-representation-size ch)
  ;; Implementations will do this more efficiently!
  (string-length (make-string 1 ch)))

> It makes a datastructure that holds a string and its indexesnon-portable, for example.

I can see an issue if you try to write that out using oneimplementation, and read it back in with another. Not sure howimportant that is.

I'd agree the proposal if it introduces a different means of
indexing, other than character count used for string-ref.  Call it
'offset' for now.  string-offset-ref, substring-offset etc. would
provide offset-based operation, while string-ref, substring etc.
work on character-based op.


That might be reasonable.  But ...

Though it may be too cumbersome for
core language.

Well, the complication is that existing code will be less efficient, andpeople have a choice between using string-ref (portable to R5RS butpotentially slow) and string-offset-ref (portable to R6RS only but fast).

An alternative idea is to have a cache that maps the most recent (charindex, offset) mapping. One problem is that even an immutable stringnow requires a mutable cache, with possible synchronization issues.

 And this is too much variable-length-character centric
API, which fixed-length character implementation or other
implementations (such as tree of segments) wouldn't care much.

Not sure your point. Certainly a more complex data structure isappropriate for (say) a text editor, especially once you supportcharacter "attributes".

--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/

Follow-Ups:
- Re: constant-time access to variable-width encodings
  - From: Shiro Kawai
- Re: constant-time access to variable-width encodings
  - From: bear

References:
- constant-time access to variable-width encodings
  - From: Per Bothner
- Re: constant-time access to variable-width encodings
  - From: Shiro Kawai

Prev by Date: Re: constant-time access to variable-width encodings
Next by Date: Re: constant-time access to variable-width encodings
Previous by thread: Re: constant-time access to variable-width encodings
Next by thread: Re: constant-time access to variable-width encodings
Index(es):
- Date
- Thread