[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: words, punctuation, and whitespace

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



Aubrey Jaffer scripsit:

> The first task in writing text-processing programs is to separate the
> input text into words, punctuation, and whitespace.  Could R6RS deal
> with Unicode text as words, punctuation, and whitespace?

Unfortunately, Chinese and Japanese do not use whitespace or anything
similar to divide text into words, nor does Thai.  In Chinese, the whole
concept of words is rather artificial; in Japanese, you can divide on
word boundaries based on fairly superficial rules; but in Thai, there
is no alternative to implementing a fairly complex morphological parser
just to do rendering, because lne breaks can only be inserted between
words, and without understanding the rules of Thai word construction
in detail you cannot know where the word boundaries are.

The ICU library (which has C, C++, and Java flavors) encodes all this
knowledge and a great deal more; it would be well worthwhile, IMHO,
to have an ICU-based SRFI.

"Internationalization is twice as hard as you think, even when you take
this rule into account."

-- 
John Cowan  jcowan@xxxxxxxxxxxxxxxxx  www.reutershealth.com  www.ccil.org/~cowan
"The exception proves the rule."  Dimbulbs think: "Your counterexample proves
my theory."  Latin students think "'Probat' means 'tests': the exception puts
the rule to the proof."  But legal historians know it means "Evidence for an
exception is evidence of the existence of a rule in cases not excepted from."