[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

TR29 word boundary use cases

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.



I've been reviewing the TR29 word boundary
algorithm for implementation, and it strikes me
as a rather complicated way to do only part of
the job.  For example, it breaks sequences of
hiragana on every codepoint, but chunks all
consecutive Thai letters into a single word.  It
seems more useful to consistently split
aggressively and then use a separate step to
recompose as needed, or to split conservatively
and then use a separate step to segment further.
But the TR29 algorithm does neither.

Indeed, in my company we do a lot of text
processing, and split words in many ways,
including at simplistic levels requiring post-
processing and with very sophisticated natural
language aware segmenters, but to my
knowledge we don't use the TR29 algorithm
anywhere.  Does anyone have real-world uses
of the TR29 word boundary algorithm they
could share?

Thanks,

-- 
Alex