[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: TR29 word boundary use cases



On Mon, Dec 9, 2013 at 2:18 AM, John Cowan <cowan@xxxxxxxxxxxxxxxx> wrote:
Alex Shinn scripsit:

> I've been reviewing the TR29 word boundary algorithm for implementation,
> and it strikes me as a rather complicated way to do only part of
> the job.

Basically true.  As you know, the word "word" does not really have a
language-independent meaning.

> For example, it breaks sequences of hiragana on every codepoint, but
> chunks all consecutive Thai letters into a single word.

Japanese/Chinese and Thai/Lao are explicitly places where the algorithm
is not good enough, and needs to be supplemented by further information.
I think the fact that word breaks appear between every hiragana letter
is a reflection of the fact that each such place is a line break
opportunity; whereas in Thai, line break oppos come only between actual
words, which you can only find (absent ZWSP characters) with a Thai
morphology engine.

I talked to the author of the report and he thought
the hiragana splitting was due to the tendency for
hiragana to be used for particles.  My own intuition
is that it would be better to chunk hiragana, but
it would depend on the corpus and neither approach
is perfect.

I was mistaken about Thai, it splits on every letter.
If ZWSP were still widely used, chunking Thai would
likely be better, but it's fallen out of use now that
Thai morphology engines have become more common.

The ideal algorithmic boundaries aside, it seems this
is almost entirely unimplemented in regex libraries.
The ICU regex lib supports TR29 word boundaries
for \b, but the behavior is surprising enough it isn't
the default.

There's another concern which is that the `word'
definition, i.e.

  (: bow (+ letter) eow)

wouldn't guarantee a single word anymore.  The
only way to make a TR29 `word' pattern work would
be with a new primitive definition, not defined by
TR18 and which wouldn't translate directly into any
existing PCRE rules.

So because of the lack of implementation support
and the unintuitiveness of the algorithm, I'm dropping
the TR29 word boundary requirement.  I'm also
going to change the grapheme definitions to be
absolute, and not change in `w/ascii'.  The only
thing `w/ascii' will affect will be the named char
sets and restricting `w/nocase' to ASCII.

-- 
Alex