bear wrote:
(1) Are your "random" accesses into your corpus linguistics strings really random, do they have significant locality, or could they be arranged to have have significant locality?Speaking for myself, I would say they are as close to random asmakes no difference.
Thanks for your answer.I think I'm convinced that representing strings in plain UTF-8 is a losing representation for this application. Or, generalizing, this application really needs strings that have constant-time random access and not just linear-time traversal.
If I wanted to rescue UTF-8 (because I really really really want to keep conversion to UTF-8 as a constant-time operation), I could maintain a vector of byte offsets to every Nth character.
Regards, Alan -- Dr Alan Watson Centro de Radioastronomía y Astrofísica Universidad Astronómico Nacional de México