[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Surrogates and character representation
(1) Are your "random" accesses into your corpus linguistics strings
really random, do they have significant locality, or could they be
arranged to have have significant locality?
Speaking for myself, I would say they are as close to random as
makes no difference.
Thanks for your answer.
I think I'm convinced that representing strings in plain UTF-8 is a
losing representation for this application. Or, generalizing, this
application really needs strings that have constant-time random access
and not just linear-time traversal.
If I wanted to rescue UTF-8 (because I really really really want to keep
conversion to UTF-8 as a constant-time operation), I could maintain a
vector of byte offsets to every Nth character.
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México