[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

words, punctuation, and whitespace

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



Having written many text processing applications in Scheme, I have
found plain R5RS poorly suited to "bespoke" parsers; so I use several
SLIB modules for string-level infrastructure:

  (require 'string-search)
  (require 'string-port)
  (require 'string-case)
  (require 'line-i/o)

These SRFI-75 discussions dealing with character attributes are
leading me to believe that, knowing only one language well, I will be
unable to write language-portable programs.  But why are we working at
the character or even the string level?

The first task in writing text-processing programs is to separate the
input text into words, punctuation, and whitespace.  Could R6RS deal
with Unicode text as words, punctuation, and whitespace?

  Unicode-read port

would return a word, punctuation, or whitespace object; or an
eof-object.

A procedure named `Unicode-write' or `Unicode-display' would write a
word, punctuation, or whitespace object to a port.  Perhaps `display'
can serve this purpose.

With case-sensitivity, symbols look like good candidates for word
objects.  Words as symbols would seem to make multilingual Scheme
programs possible.

Lists or vectors of these objects would represent multilingual text
compactly without character size or encoding issues.

As evidence that one can deal with multilanguage text at a high level,
consider http://swiss.csail.mit.edu/~jaffer/Scheme.html.jis.  Although
I know no Japanese, I cobbled together this Japanaese and English page
by cutting and pasting from Japanese web pages.