[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft



From: Tom Lord <lord@xxxxxxx>
Subject: Re: strings draft
Date: Sun, 25 Jan 2004 19:27:58 -0800 (PST)

> I'm not aware enough about the details of Shiro's.  All I'm (sort of)
> aware of is that he's dealing with a EUCJP -- which sounds very
> challenging if you want to wind up with an implementation suitable for
> intensive string processing.   (Unicode is similarly challenging.)

To be precise, what I'm dealing with is to use a CES-independent
multibyte string representation, currently including utf-8, EUC-JP,
and Shift_JIS.  EUC-CN, EUC-TW and EUC-KR should be supported
easily.   I exclude stateful encodings, like ISO2022
with stateful escape sequences.  (EUC is subset of ISO2022,
but it only uses single shift escape, thus effectively it's stateless).

Tom, if you have specific instance that discourages mb string,
I'm curious about hearing it, either off-list or on-list.

The following is a discussion that why I think mb string is
feasible.  Those who aren't interested can skip it.

 * * *

"Intensive string processing" would vary for application domains.
The domain I'm looking at has these properties:

  * very frequent use of regexp.
  * strings are hardly mutated.
  * lots of data passing between external programs/libraries.

Regexp engine can be implemented on multibyte strings almost as
efficient as "uniform character array" string, by compiling regexp
into octet-stream NFA/DFA.  Currently the only penalty of my
implementation is when you use a character range including large
character set.  It can be optimized, I think.

Regexp is heavily used to extract a part of string.  Returning
substring directly is very efficient if you share the string body.
Using string indices can be actually less efficient, even if you
use uniform character array strings.

Multibyte representation doesn't necessarily put a penalty to
use large corpora; e.g. suffix array can be constructed and used
efficiently using byte index (actually, any kind of string reference).

Most external libraries and programs nowadays require strings
to be passed in some sort of multibyte format.  If you can use
the same multibyte format internally, sending and receiving
data have little overhead.   It may not help when you're writing
a program that will be used on wide variety of environments,
but it is an advantage if you're writing in-house tools where
you have knowledge of which encoding is used in the environment.

Of course I don't insist multibyte strings is generally superior.
Actually I'm not 100% sure multibyte strings doesn't have serious
problems.  But I was curious, so I started implementing it,
and haven't seen a serious problem yet, though there are
some unresolved issues (like how to tread illegal byte sequences).

--shiro