[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



bear wrote:
(1) Are your "random" accesses into your corpus linguistics strings
really random, do they have significant locality, or could they be
arranged to have have significant locality?


Speaking for myself, I would say they are as close to random as
makes no difference.

Thanks for your answer.

I think I'm convinced that representing strings in plain UTF-8 is a losing representation for this application. Or, generalizing, this application really needs strings that have constant-time random access and not just linear-time traversal.

If I wanted to rescue UTF-8 (because I really really really want to keep conversion to UTF-8 as a constant-time operation), I could maintain a vector of byte offsets to every Nth character.

Regards,

Alan
--
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México