TR29 word boundary use cases

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.

To: SRFI-115 discussion list <srfi-115@xxxxxxxxxxxxxxxxx>

Subject: TR29 word boundary use cases

From: Alex Shinn <alexshinn@xxxxxxxxx>

Date: Mon, 2 Dec 2013 09:53:07 +0900

Delivered-to: srfi-115@xxxxxxxxxxxxxxxxx

Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=yyCddlfgiVP9kJpoaScMNUWuj5seVos4J/TPzltgHKA=; b=uV0qthEDTqY0oP6gzUPjpxdQCQmq5yN/kZbkIPnIkWx44S1/Ue9LQNoFWvK4FR05gI BwHU3Oz9WRXGLr9skhKm9cTpNbcVipluFzElYI+HpCqxjJlUdqW+VpWfdEwxsgz/KvsO BBENjR/OMMRBm2b234rRC51QcXqD9dxnYm4+uigphKlhymcVr84fS50FZ51HlTldAAW/ sRKL6tjls/81EUj2huoeRNMPNlYT6ZJCiRePG+iJG1lsugTM/yr0O2/LtXp0z0iMWSmE VOfZGR4Y8fObdxigHwVbLH2G3It0HloDG/c0v34QrZREGSCcpHDk48yvXWok4WJRzJgS C5Pw==

I've been reviewing the TR29 word boundary

algorithm for implementation, and it strikes me

as a rather complicated way to do only part of

the job. For example, it breaks sequences of

hiragana on every codepoint, but chunks all

consecutive Thai letters into a single word. It

seems more useful to consistently split

aggressively and then use a separate step to

recompose as needed, or to split conservatively

and then use a separate step to segment further.

But the TR29 algorithm does neither.

Indeed, in my company we do a lot of text

processing, and split words in many ways,

including at simplistic levels requiring post-

processing and with very sophisticated natural

language aware segmenters, but to my

knowledge we don't use the TR29 algorithm

anywhere. Does anyone have real-world uses

of the TR29 word boundary algorithm they

could share?

Thanks,

Alex