[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: terminology



At Tue, 10 Feb 2004 13:06:28 -0800 (PST), Tom Lord wrote:
> 
> There is an easy example of why such a category is desirable in
> computing.  Let's suppose that I'm going to specify the lexical syntax
> of identifiers in a programming language.  As part of that
> specification, I'll need to identify this category.  (For an example,
> see "Unicode Technical Report #31: Identifier and Pattern Syntax",
> http://www.unicode.org/reports/tr31/tr31-2.html)

We may want to take that report with a grain of salt for Scheme.  A
simpler approach would be to define Scheme identifiers as everything
_excluding_ the reserved punctuation characters, optionally allowing
Unicode variations on those characters and extending the definition of
whitespace.  Most Schemes already work in this manner, despite the fact
that R5RS uses an inclusive list.  With a quick check, the *only* Scheme
I found that doesn't let me enter and use arbitrary high-bit UTF-8
identifier names is Kawa, regardless of the Scheme's internal encoding.

[checked Bigloo, Chez, Chicken, Gambit, Gauche, Guile, MIT Scheme,
MzScheme, SCM and SISC]

> In their wisdom (or absense of wisdom) the Unicode consortium chose a
> name for this category: they call these characters "letters".  That
> _is_ an overloading of the term "letter" -- but it is an overloading
> that pervades the Unicode specifications and data tables.  For
> example, every assigned Unicode codepoint has a property called "the
> major class of its General Category".  The class of alphabetic,
> syllabic, and ideographic characters has the major class "L" (short
> for "letter").

I apologize, I was mistaken.  I was mostly going off of the official
names of the characters, which consistently only uses "letter" for
alphabets.  It seems strange to me to call an ideograph a letter, but if
Unicode officially uses that definition I'm not going to fight it.

Unicode also uses alphabetic to describe syllabic characters, and does
not provide any "syllabic" property.

> Alex also writes:
> 
>     > "Ideograph" applied to all Han characters is technically
>     > incorrect.  Linguists prefer the term "sinogram" which refers to
>     > Chinese-derived characters.  "Sinogram" fits all uses being
>     > applied to the term "ideograph" in these discussions (at least
>     > until Unicode adds hieroglyphs).  Since the usage of ideograph
>     > is fairly ubiquitous, however, it may not be worth fighting it.
> 
> I have an intellectual curiosity about why you say that "ideograph"
> is inaccurate.

There are four general classifications of Chinese characters (from
Kenneth Henshall's _A_Guide_To_Remembering_Japanese_Characters_):

  # cut&paste into utf-8 terminal for reference
  gosh -E'map(lambda(x y)(format #t"~A (~04X): ~A\n"(ucs->char x)x y))
   `(#x6728 #x5C71 #x99AC #x4E0A #x56DE #x5CE0 #x6CE8)
   `(tree mountain horse up around mountain-pass pour)' -Eexit

1) Pictograph.  U+6728 and U+5C71 are simple stylized pictures of a tree
   and mountain respectfully.  Though these are simple, some pictographs
   are stylized beyond easy recognition, such as U+99AC (horse).

2) Sign or Symbol.  U+4E0A is a symbol showing the direction up.  U+56DE
   is a stylized form of two concentric circles meaning "around".

3) Ideograph. U+5CE0 shows a mountain on the left (the "radical") with
   the symbols up and down stacked on the right, leading to the idea of
   "mountain pass".

4) Phonetic-Ideograph (or Semasio-Phonetic).  Something like 85% of all
   modern Chinese characters fall into this group.  U+6CE8 (pour) is
   made from the radical for water on the left, plus a character with
   the same sound as a character meaning continuous, thus continuous
   flow of water, a reference to pouring.

It's not always clear what category a character falls into, and this is
mostly of interest to historians anyway.  Unicode itself consistently
refers to all Chinese characters as ideographs, even though most of them
are much more complex, so I'm not even objecting to this term, I was
just nit-picking.  Also, the reference to this in the Unicode section
11.1 is the only place I've seen the term "sinogram," (most references
use just "Chinese character" or "Kanji").

> I do note that Han characters are not the only ideographic letters
> encoded in Unicode -- although I'm not sure there is a huge future in
> writing Scheme programs whose identifiers are spelled using the Linear
> B script :-)

Now this gets weird.  The Unicode standard consistently refers to the
Linear B characters as "ideograms," the same meaning as "ideograph" but
for no apparent reason uses different word.  And they don't have the
ideographic property:

gosh> (any (cut char-set-contains? char-set:ideographic <>)
           (map integer->char (map (cut + #x10080 <>) (iota #x100))))
#f

Indeed, the only characters with the ideographic property are the Han
characters (from PropList-4.0.0.txt):

------------------------------------------------------------------------
3006          ; Ideographic # Lo       IDEOGRAPHIC CLOSING MARK
3007          ; Ideographic # Nl       IDEOGRAPHIC NUMBER ZERO
3021..3029    ; Ideographic # Nl   [9] HANGZHOU NUMERAL ONE..HANGZHOU NUMERAL NINE
3038..303A    ; Ideographic # Nl   [3] HANGZHOU NUMERAL TEN..HANGZHOU NUMERAL THIRTY
3400..4DB5    ; Ideographic # Lo [6582] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DB5
4E00..9FA5    ; Ideographic # Lo [20902] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FA5
F900..FA2D    ; Ideographic # Lo [302] CJK COMPATIBILITY IDEOGRAPH-F900..CJK COMPATIBILITY IDEOGRAPH-FA2D
20000..2A6D6  ; Ideographic # Lo [42711] CJK UNIFIED IDEOGRAPH-20000..CJK UNIFIED IDEOGRAPH-2A6D6
2F800..2FA1D  ; Ideographic # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800..CJK COMPATIBILITY IDEOGRAPH-2FA1D

# Total code points: 71053
------------------------------------------------------------------------

Perhaps we should consider this a bug in the Unicode specification?

-- 
Alex