Alex Shinn scripsit:
Basically true. As you know, the word "word" does not really have a
> I've been reviewing the TR29 word boundary algorithm for implementation,
> and it strikes me as a rather complicated way to do only part of
> the job.
Japanese/Chinese and Thai/Lao are explicitly places where the algorithm
> For example, it breaks sequences of hiragana on every codepoint, but
> chunks all consecutive Thai letters into a single word.
is not good enough, and needs to be supplemented by further information.
I think the fact that word breaks appear between every hiragana letter
is a reflection of the fact that each such place is a line break
opportunity; whereas in Thai, line break oppos come only between actual
words, which you can only find (absent ZWSP characters) with a Thai