[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



Hi again,

The application of character indexes into a corpus is very interesting. Thanks for bringing it up.

However, I wonder how bad UTF-8 really is. For example, if I want to extract all of the prepositions, I can sort the character index ranges and then make a single pass through the string. This is linear in the string length, which is not as nice as random accesses to a UCS-32 vector, but isn't obviously a killer. (Especially when one thinks about memory cache hierarchies and their effect on random accesses.)

There is a difference between using character indexes into UTF-8 with locality (i.e., scanning forwards or backwards through a string or using something like B-M which has a fair bit of locality) and real random access. If the implementation caches the last character to byte index conversion, the former can often be linear whereas the latter is quadratic (string length times the number of accesses).

So, two questions:

(1) Are your "random" accesses into your corpus linguistics strings really random, do they have significant locality, or could they be arranged to have have significant locality?

(2) Could you live with linear complexity to extract classes of substrings?

Regards,

Alan
--
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México