This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
Handling large character sets in 'proper' way is a difficult task, but I think we can assume there will be layers (e.g. octets, codepoints, graphemes, ...), and the current level of string/character abstraction may be retained in some way. So I focus on how to interface with R5RS-level Scheme string/character object. Before going details, I point out that if we want maximum portability, we probably should stick to one encoding, "ucs4 character and utf8 string", and let implementations to handle all conversion works. I'll discuss it later. First, I'll go through APIs with assumption that we want efficient treatment of native Scheme strings. - A Scheme character may be just an octet, or an immediate object that fits in a word, or a multi-word object. (In Gauche, a Scheme character is an immeidate object, and fits in a machine word.) Thus SCHEME_STRING_REF should return an implementation-dependent type scheme_char or something. So do SCHEME_EXTRACT_CHAR. SCHEME_ENTER_CHAR and SCHEME_MAKE_STRING would take scheme_char. If a Scheme character is a multi-word object, SCHEME_STRING_REF may invoke GC. Or if a scheme_char can be allocated on C stack, SCHEME_STRING_REF might receive a region to store the result. SCHEME_STRING_REF(scheme_value str, long k, scheme_char *ch) to avoid unnecessary allocation on such implementations. But it is less efficient in the implementations that uses just an octet per character. - A Scheme string may not be an array of scheme_char objects. (In Gauche, Scheme string uses multibyte encoding, i.e. each character occupies different number of bytes.) So it is a good question that SCHEME_EXTRACT_STRING and SCHEME_ENTER_STRING should use scheme_char* or char*. In wide-character string implementations, scheme_char* would be much more efficient; in multi-bypte implementations, char* would be much more efficient. - The body of Scheme string may be read-only (it is so in Gauche, and it may be shared by may Scheme strings), and/or it may consist of chunks of memory. In such implementatinos: -- SCHEME_STRING_SET may invoke GC, and potentially very inefficient. -- To return a mutable (char *) string, SCHEME_EXTRACT_STRING may need to allocate memory and copy the content to it. Returning (const char *) can be cheaper. - Preventing SCHEME_ENTER_STRING from creating a string that includes NUL character seems an unnecessary restriction. Passing length as well enables including NUL character. However, we need to specify that the "length" is whether number of octets or number of characters. - If the implementation has sharable string body, it is useful to tell SCHEME_ENTER_STRING whether it should copy the content or not, so that it can avoid unnecessary copy. - SCHEME_GET_IMPORTED_BINDING and SCHEME_DEFINE_EXPORTED_BINDING take char*. Implementation may use internal Scheme string to represent the names of symbols. So we need to specify a clear mapping between them. The safest way is to limit binding names within ASCII. (A bit off-topic, but why these API takes char*, instead of const char*?). For me, native Scheme string represetation can vary too much to have one single efficient C API. However, if we think this srfi to ease writing portable "bindings" to other existing C libraries, then we have to convert internal Scheme string to C char* of well-known encodings anyway. If so, the most practical choice of encoding would be ucs4 character and utf8 string (although it isn't the case in my daily working environment). So, suppose we have something like these: typedef long ucs4char; typedef char * ucs8string; And the APIs are: ucs4char SCHEME_EXTRACT_CHAR(scheme_value); (may GC) scheme_value SCHEME_ENTER_CHAR(ucs4char); (may GC) utf8string* SCHEME_EXTRACT_STRING(scheme_value); (may GC) scheme_value SCHEME_ENTER_STRING(const utf8string*, long); (may GC) /* always copy the passed string */ ucs4char SCHEME_STRING_REF(scheme_value, long); (may GC) void SCHEME_STRING_SET(scheme_value, long, ucs4char); (may GC) scheem_value SCHEME_MAKE_STRING(long, ucs4char); (may GC) Note that SCHEME_EXTRACT_CHAR may invoke GC as well, since the implementation may need to run a character-code conevrsion routine which needs a dynamic buffer. Furthermore, I think there need to be a way to extract "raw" byte stream of the internal string body; the above API assumes all the Scheme strings are convertible to Unicode strings, and in reality it is not true. So something like these would help. char *SCHEME_EXTRACT_STRING_RAW(scheme_value); (may GC) scheme_value SCHEME_ENTER_STRING_RAW(const char*, long); (may GC) And also I think it's reasonable to have read-only reference. (it is debatable wether we should make this default and have mutable reference optional). const utf8string* SCHEME_EXTRACT_STRING_CONST(scheme_value); (may GC) const char *SCHEME_EXTRACT_STRING_RAW_CONST(scheme_value); (may GC) --shiro