[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: Shiro Kawai <shiro@xxxxxxxx>
Subject: Re: Surrogates and character representation
From: Alan Watson <a.watson@xxxxxxxxxxxxxxxx>
Date: Thu, 28 Jul 2005 12:02:24 -0500
Cc: bear@xxxxxxxxx, tree@xxxxxxxxxxxxx, per@xxxxxxxxxxx, srfi-75@xxxxxxxxxxxxxxxxx
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
In-reply-to: <20050728.000652.1016278026.shiro@xxxxxxxx>
Organization: Centro de Radioastronomía y Astrofísica UNAM
References: <42E8546F.9000407@xxxxxxxxxxx> <17128.22540.135687.288180@xxxxxxxxxxxxxxxxxxxxxx> <Pine.LNX.4.58.0507280119280.28883@xxxxxxxxxxxxxx> <20050728.000652.1016278026.shiro@xxxxxxxx>
User-agent: Mozilla Thunderbird 1.0 (X11/20050317)

Hi again,

The application of character indexes into a corpus is very interesting.Thanks for bringing it up.

However, I wonder how bad UTF-8 really is. For example, if I want toextract all of the prepositions, I can sort the character index rangesand then make a single pass through the string. This is linear in thestring length, which is not as nice as random accesses to a UCS-32vector, but isn't obviously a killer. (Especially when one thinks aboutmemory cache hierarchies and their effect on random accesses.)

There is a difference between using character indexes into UTF-8 withlocality (i.e., scanning forwards or backwards through a string or usingsomething like B-M which has a fair bit of locality) and real randomaccess. If the implementation caches the last character to byte indexconversion, the former can often be linear whereas the latter isquadratic (string length times the number of accesses).


So, two questions:

(1) Are your "random" accesses into your corpus linguistics stringsreally random, do they have significant locality, or could they bearranged to have have significant locality?


(2) Could you live with linear complexity to extract classes of substrings?

Regards,

Alan
--
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México

Follow-Ups:
- Re: Surrogates and character representation
  - From: bear

References:
- Re: Surrogates and character representation
  - From: Per Bothner
- Re: Surrogates and character representation
  - From: Tom Emerson
- Re: Surrogates and character representation
  - From: bear
- Re: Surrogates and character representation
  - From: Shiro Kawai

Prev by Date: Re: Surrogates and character representation
Next by Date: Allowing ASCII only, string escapes, and normalization
Previous by thread: Re: Surrogates and character representation
Next by thread: Re: Surrogates and character representation
Index(es):
- Date
- Thread