[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: words, punctuation, and whitespace

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

On 7/20/05, Aubrey Jaffer <agj@xxxxxxxxxxxx> wrote:
> The first task in writing text-processing programs is to separate the
> input text into words, punctuation, and whitespace.  Could R6RS deal
> with Unicode text as words, punctuation, and whitespace?

Unfortunately, no.

>   Unicode-read port
> would return a word, punctuation, or whitespace object; or an
> eof-object.

This is an AI-complete problem.  Chinese, Japanese and Thai (at least)
don't use whitespace to separate words, and require dictionary lookups
and natural language processing.

Emacs' forward-word and related procedures use a simple hack to be
useful in Japanese (though not actually breaking at word boundaries),
but are useless in Chinese and Thai.

So yes, full multi-lingual processing is very difficult, but fortunately
you rarely need it.  Editors and translation software are about the only
examples I can think of where this is needed, and they will use
specialized libraries anyway.  We just need to specify in this SRFI
enough so that those libraries can be portable.