[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

TR29 word boundary use cases

I've been reviewing the TR29 word boundary
algorithm for implementation, and it strikes me
as a rather complicated way to do only part of
the job.  For example, it breaks sequences of
hiragana on every codepoint, but chunks all
consecutive Thai letters into a single word.  It
seems more useful to consistently split
aggressively and then use a separate step to
recompose as needed, or to split conservatively
and then use a separate step to segment further.
But the TR29 algorithm does neither.

Indeed, in my company we do a lot of text
processing, and split words in many ways,
including at simplistic levels requiring post-
processing and with very sophisticated natural
language aware segmenters, but to my
knowledge we don't use the TR29 algorithm
anywhere.  Does anyone have real-world uses
of the TR29 word boundary algorithm they
could share?