[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Specifying character sets with numeric ranges in SRFI-14

This page is part of the web mail archives of SRFI 14 from before July 7th, 2015. The new archives for SRFI 14 contain all messages, not just those from before July 7th, 2015.



I have been reviewing the character-set SRFI in light of my recent study
of Unicode and internationalisation. The main difference is that I have
killed ASCII-RANGE->CHAR-SET and replaced it with UNICODE-RANGE->CHAR-SET.

The general design principles are:

  - I don't want to *require* conformant Schemes to use Unicode.
    I specifically want to allow "small character" implementations
    such as ASCII or Latin-1.

  - However, I do want code to be portable across conformant implementations.
    So, the one routine in SRFI-14 that exposes encodings commits to a 
    Unicode interface as the uber-spec for character encodings. This is
    *independent* of how chars are stored/represented "under the hood,"
    and the API allows the user to request different behaviours if a program
    requests a character via Unicode that is not provided by the
    implementation.

More elaborate hackery would need a *character* SRFI, with routines for
encoding and decoding characters; that's beyond the scope of SRFI-14.

I append the spec for UNICODE-RANGE->CHAR-SET below. Comments?
    -Olin

unicode-range->char-set  lower upper [error? base-cs] -> char-set
unicode-range->char-set! lower upper  error? base-cs  -> char-set
    Returns a character set containing every character whose Unicode
    code lies in the half-open range [LOWER,UPPER).

    The [LOWER,UPPER) range must lay completely within the general Unicode
    space: 0 <= LOWER <= UPPER <= 2^32 - 1. If the requested range includes
    unassigned Unicode values, these are silently ignored (the current Unicode
    specification has "holes" in the space of assigned codes). If the
    requested range includes "private" or "user space" codes, these are
    handled in an implementation-specific manner; however, a Unicode-based
    Scheme implementation should pass them through transparently.

    If any code from the requested range specifies a valid, assigned Unicode
    character but has no corresponding representative in the implementation's
    character type, then (1) an error is raised if ERROR? is true, and (2) the
    code is ignored if ERROR? is false (the default). This might happen, for
    example, if the implementation uses ASCII characters, and the requested
    range includes non-ASCII characters.

    If character set BASE-CS is provided, the characters specified by the
    range are added to it. UNICODE-RANGE->CHAR-SET! is allowed, but not
    required, to side-effect and reuse the storage in BASE-CS;
    UNICODE-RANGE->CHAR-SET produces a fresh character set.

    Note that ASCII codes are a subset of the Latin-1 codes, which are in turn
    a subset of the 16-bit Unicode codes, which are themselves a subset of the
    32-bit Unicode codes. We commit to a specific encoding in this routine,
    regardless of the underlying representation of characters, so that client
    code using this library will be portable. I.e., a conformant Scheme
    implementation may use EBCDIC or SHIFT-JIS or even 6BIT to encode
    characters; it must simply map the Unicode characters from the given range
    into the native representation (when possible).