This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.
Alex Shinn scripsit: > I've been reviewing the TR29 word boundary algorithm for implementation, > and it strikes me as a rather complicated way to do only part of > the job. Basically true. As you know, the word "word" does not really have a language-independent meaning. > For example, it breaks sequences of hiragana on every codepoint, but > chunks all consecutive Thai letters into a single word. Japanese/Chinese and Thai/Lao are explicitly places where the algorithm is not good enough, and needs to be supplemented by further information. I think the fact that word breaks appear between every hiragana letter is a reflection of the fact that each such place is a line break opportunity; whereas in Thai, line break oppos come only between actual words, which you can only find (absent ZWSP characters) with a Thai morphology engine. -- John Cowan cowan@xxxxxxxx http://ccil.org/~cowan If he has seen farther than others, it is because he is standing on a stack of dwarves. --Mike Champion, describing Tim Berners-Lee (adapted)