[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: TR29 word boundary use cases

Alex Shinn scripsit:

> I've been reviewing the TR29 word boundary algorithm for implementation,
> and it strikes me as a rather complicated way to do only part of
> the job.

Basically true.  As you know, the word "word" does not really have a
language-independent meaning.

> For example, it breaks sequences of hiragana on every codepoint, but
> chunks all consecutive Thai letters into a single word.

Japanese/Chinese and Thai/Lao are explicitly places where the algorithm
is not good enough, and needs to be supplemented by further information.
I think the fact that word breaks appear between every hiragana letter
is a reflection of the fact that each such place is a line break
opportunity; whereas in Thai, line break oppos come only between actual
words, which you can only find (absent ZWSP characters) with a Thai
morphology engine.

John Cowan  cowan@xxxxxxxx  http://ccil.org/~cowan
If he has seen farther than others,
        it is because he is standing on a stack of dwarves.
                --Mike Champion, describing Tim Berners-Lee (adapted)