[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Why are byte ports "ports" as such?

This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.



Jonathan S. Shapiro scripsit:

> The underlying issue within UNICODE is the existence of the so-called
> "combining characters". There exist characters that have no single
> defining codepoint. These exist primarily in Asian languages, for
> example in the form of multiple code points that together form a single
> "glyph".

In fact they are all over the place: you cannot write such a very
European language as Lithuanian, which uses the Latin script, without
employing them.  (Well, you can write memos or to-do lists, but not
poetry or dictionaries.)

However, whether a "default grapheme cluster" (the Unicode name for a base
character together with its combining characters) is a "character" in the
non-technical sense depends on the culture.  Is an "o" with a dot-above
accent and a macron accent a single "character"?  Sure.  How about a Hindi
consonant letter with associated vowel mark?  Not at all: one sense of
"character" in Hindi covers consonants and vowels separately just as in
Latin, another sense is "run of consonants up to and including the next
vowel."  What about Korean?  Is a Hangul syllable one character or 2-3?
Depends on the context: sometimess one, sometimes the other.

"Character" is not a technical term in Unicode because it can't be; it
would have to match too many contradictory expectations.  The Unicode
Glossary, which is not normative, says:

	Character. (1) The smallest component of written language that
	has semantic value; refers to the abstract meaning and/or shape,
	rather than a specific shape (see also glyph), though in code
	tables some form of visual representation is essential for the
	reader's understanding.  (2) Synonym for abstract character
	[defined as "A unit of information used for the organization,
	control, or representation of textual data. "]. (3) The basic
	unit of encoding for the Unicode character encoding.  (4) The
	English name for the ideographic written elements of Chinese
	origin. (See ideograph(2).)

There *are* technical terms in Unicode, like code unit, code point,
default grapheme cluster, and so on.  Which of these should be mapped
to a given programming culture's pre-existing concept of "characters"
is a question which Unicode by itself cannot answer.  So far, C has gone
for the 8-bit code unit interpretation, Java for the 16-bit code unit
interpretation, and XML for the code point interpretation.

(The Glossary is at http://www.unicode.org/glossary/ .)

-- 
Andrew Watt on Microsoft:                       John Cowan
Never in the field of human computing           cowan@ccil.org
has so much been paid by so many                http://www.ccil.org/~cowan
to so few! (pace Winston Churchill)