This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.
Alex Shinn scripsit:
Basically true. As you know, the word "word" does not really have a
> I've been reviewing the TR29 word boundary algorithm for implementation,
> and it strikes me as a rather complicated way to do only part of
> the job.
Japanese/Chinese and Thai/Lao are explicitly places where the algorithm
> For example, it breaks sequences of hiragana on every codepoint, but
> chunks all consecutive Thai letters into a single word.
is not good enough, and needs to be supplemented by further information.
I think the fact that word breaks appear between every hiragana letter
is a reflection of the fact that each such place is a line break
opportunity; whereas in Thai, line break oppos come only between actual
words, which you can only find (absent ZWSP characters) with a Thai