Re: TR29 word boundary use cases

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.

On Mon, Dec 9, 2013 at 2:18 AM, John Cowan <cowan@xxxxxxxxxxxxxxxx> wrote:

Alex Shinn scripsit:

> I've been reviewing the TR29 word boundary algorithm for implementation,
> and it strikes me as a rather complicated way to do only part of
> the job.

Basically true. As you know, the word "word" does not really have a
language-independent meaning.

> For example, it breaks sequences of hiragana on every codepoint, but
> chunks all consecutive Thai letters into a single word.

Japanese/Chinese and Thai/Lao are explicitly places where the algorithm
is not good enough, and needs to be supplemented by further information.
I think the fact that word breaks appear between every hiragana letter
is a reflection of the fact that each such place is a line break
opportunity; whereas in Thai, line break oppos come only between actual
words, which you can only find (absent ZWSP characters) with a Thai
morphology engine.

I talked to the author of the report and he thought

the hiragana splitting was due to the tendency for

hiragana to be used for particles. My own intuition

is that it would be better to chunk hiragana, but

it would depend on the corpus and neither approach

is perfect.

I was mistaken about Thai, it splits on every letter.

If ZWSP were still widely used, chunking Thai would

likely be better, but it's fallen out of use now that

Thai morphology engines have become more common.

The ideal algorithmic boundaries aside, it seems this

is almost entirely unimplemented in regex libraries.

The ICU regex lib supports TR29 word boundaries

for \b, but the behavior is surprising enough it isn't

the default.

There's another concern which is that the `word'

definition, i.e.

(: bow (+ letter) eow)

wouldn't guarantee a single word anymore. The

only way to make a TR29 `word' pattern work would

be with a new primitive definition, not defined by

TR18 and which wouldn't translate directly into any

existing PCRE rules.

So because of the lack of implementation support

and the unintuitiveness of the algorithm, I'm dropping

the TR29 word boundary requirement. I'm also

going to change the grapheme definitions to be

absolute, and not change in `w/ascii'. The only

thing `w/ascii' will affect will be the named char

sets and restricting `w/nocase' to ASCII.

Alex