Surrogates and character representation

bear wrote:
(1) Are your "random" accesses into your corpus linguistics strings
really random, do they have significant locality, or could they be
arranged to have have significant locality?

Speaking for myself, I would say they are as close to random as
makes no difference.

Thanks for your answer.

I think I'm convinced that representing strings in plain UTF-8 is a losing representation for this application. Or, generalizing, this application really needs strings that have constant-time random access and not just linear-time traversal.

If I wanted to rescue UTF-8 (because I really really really want to keep conversion to UTF-8 as a constant-time operation), I could maintain a vector of byte offsets to every Nth character.


