[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode and Scheme

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.



Re: "This SRFI is based in part on the presumption that one should be able
to write a portable Scheme program which can accurately read and manipulate
source texts in any implementation, even if those source texts contain
characters specific to that implementation."

Personally, I believe that it's a mistake to attempt to more abstractly
extend the interpretation of scheme's standard character type and associated
functions, which are presently specified in such a way to enable their
implementation to be based on the host's platform's native 8-bit byte
character encoding which may be synonymous the platform's raw octet data
interfaces (thereby enabling various scheme implementation's historical
ability to manipulate raw data byte streams as character sequences, which
may actually encode what ever one needs to; which these proposals begin to
indirectly break by prohibiting the ability to maintain that equivalence,
without offering an alternative).

However, it is likely true that scheme's character set and associated
function specification should be tightened up a little bit even in this
regard; so as feedback on this aspect of the proposals:

- character-set and lexical ordering could be improved along these lines:

  digit:        0 .. 9

  letter:       A a .. Z z           ;; where A a .. F f also hexdigits

  symbol:       ( ) # ' ` , @ . "    ;; for consistency lexical ordering
                ; $ % & * / : + -    ;; could/should be defined/improved
                ^ _ ~ \ < = > ?
                { } [ ] | !          ;; which should also be included

  space:        space tab newline    ;; as well as tab


- lexical ordering should be refined as above to be more typically useful:

  (char<? #\A #\a ... #\Z #\z) -> #t

  (char<? <digit> <letter> <symbol> <space>) -> #t

- only <letter> characters have different upper/lower case representations;
  all other character encodings, including those unspecified, are unaltered
  by upper-case, lower-case, and read/write-port functions:
  
  (char-upper-case? <digit> #\A..#\Z <symbol> <space>) -> #t
  (char-lower-case? <digit> #\a..#\z <symbol> <space>) -> #t

  (char-upper-case? #\a..#\z) -> #f
  (char-lower-case? #\A..#\Z) -> #f

  (char=? (char-upper-case (char-lower-case x)) (char-upper-case x)) -> #t
  (char=? (char-lower-case (char-upper-case x)) (char-lower-case x)) -> #t

  for all x <letter> characters:
  (char=? (char-upper-case x) (char-lower-case x)) -> #f

  for all x non <letter> characters:
  (char=? (char-upper-case x) (char-lower-case x)) -> #t

  for all x characters:
  (char-ci=? (char-upper-case x) (char-lower-case x)) -> #t

- all characters are assumed to be encoded as bytes using the host's
  native encoding representation, thereby enabling equivalence between
  the host's native raw byte data I/O and storage, and an implementation's
  character-set encoding.

- portability of the native platform's encoded text is the responsibility
  of the host platform and/or other external utilities aware of the
  transliterations requirements between the various encoding formats.

- implementations which desire to support specific character set encoding
  which may require I/O port transliteration between scheme's presumed
  platform neutral character/byte encodings and that of it's native host,
  may do so by defining a collection of functions which map an arbitrary
  specific character set encoding into scheme's neutral character/byte
  sequences as required; and/or may extend the definition of standard
  function definitions as long as they do not alter the presumed neutrality
  and binary equivalence between scheme's character/byte data sequence
  representation and that of it's host.

(lastly, the notion of enabling scheme symbols to be composed of arbitrary
 extended character set characters which may not be portably displayed on
 or easily manipulated on arbitrary platforms, is clearly antithetical to
 achieving portability; so it's suggestion should just be dropped.)

Although I know that these views may not be shared by many, I don't believe
that scheme should be indirectly restricted to only being able to interface
to a text only world (regardless of it's encoding); and hope that some
recognize that these proposals begin to restrict the applicability of scheme
in just that way, without providing an alternative mechanism to facilitate
scheme's ability to access and manipulate raw binary, which is that all
truly flexible programming languages with any legs must do; as the computing
world is a tad larger than assuming all that needs to be processed and
interfaced with is text encoded in some specific way.

Thanks for your patience, and hopeful consideration,

-paul-