This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
On 7/20/05, Aubrey Jaffer <agj@xxxxxxxxxxxx> wrote: > > The first task in writing text-processing programs is to separate the > input text into words, punctuation, and whitespace. Could R6RS deal > with Unicode text as words, punctuation, and whitespace? Unfortunately, no. > Unicode-read port > > would return a word, punctuation, or whitespace object; or an > eof-object. This is an AI-complete problem. Chinese, Japanese and Thai (at least) don't use whitespace to separate words, and require dictionary lookups and natural language processing. Emacs' forward-word and related procedures use a simple hack to be useful in Japanese (though not actually breaking at word boundaries), but are useless in Chinese and Thai. So yes, full multi-lingual processing is very difficult, but fortunately you rarely need it. Editors and translation software are about the only examples I can think of where this is needed, and they will use specialized libraries anyway. We just need to specify in this SRFI enough so that those libraries can be portable. -- Alex