This page is part of the web mail archives of SRFI 14 from before July 7th, 2015. The new archives for SRFI 14 contain all messages, not just those from before July 7th, 2015.
I have been reviewing the character-set SRFI in light of my recent study of Unicode and internationalisation. The main difference is that I have killed ASCII-RANGE->CHAR-SET and replaced it with UNICODE-RANGE->CHAR-SET. The general design principles are: - I don't want to *require* conformant Schemes to use Unicode. I specifically want to allow "small character" implementations such as ASCII or Latin-1. - However, I do want code to be portable across conformant implementations. So, the one routine in SRFI-14 that exposes encodings commits to a Unicode interface as the uber-spec for character encodings. This is *independent* of how chars are stored/represented "under the hood," and the API allows the user to request different behaviours if a program requests a character via Unicode that is not provided by the implementation. More elaborate hackery would need a *character* SRFI, with routines for encoding and decoding characters; that's beyond the scope of SRFI-14. I append the spec for UNICODE-RANGE->CHAR-SET below. Comments? -Olin unicode-range->char-set lower upper [error? base-cs] -> char-set unicode-range->char-set! lower upper error? base-cs -> char-set Returns a character set containing every character whose Unicode code lies in the half-open range [LOWER,UPPER). The [LOWER,UPPER) range must lay completely within the general Unicode space: 0 <= LOWER <= UPPER <= 2^32 - 1. If the requested range includes unassigned Unicode values, these are silently ignored (the current Unicode specification has "holes" in the space of assigned codes). If the requested range includes "private" or "user space" codes, these are handled in an implementation-specific manner; however, a Unicode-based Scheme implementation should pass them through transparently. If any code from the requested range specifies a valid, assigned Unicode character but has no corresponding representative in the implementation's character type, then (1) an error is raised if ERROR? is true, and (2) the code is ignored if ERROR? is false (the default). This might happen, for example, if the implementation uses ASCII characters, and the requested range includes non-ASCII characters. If character set BASE-CS is provided, the characters specified by the range are added to it. UNICODE-RANGE->CHAR-SET! is allowed, but not required, to side-effect and reuse the storage in BASE-CS; UNICODE-RANGE->CHAR-SET produces a fresh character set. Note that ASCII codes are a subset of the Latin-1 codes, which are in turn a subset of the 16-bit Unicode codes, which are themselves a subset of the 32-bit Unicode codes. We commit to a specific encoding in this routine, regardless of the underlying representation of characters, so that client code using this library will be portable. I.e., a conformant Scheme implementation may use EBCDIC or SHIFT-JIS or even 6BIT to encode characters; it must simply map the Unicode characters from the given range into the native representation (when possible).