[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

Thomas Lord scripsit:

> My plan (and stalled code) works that way.  If a
> string contains only codepoints in 0..255, store it as bytes.
> 0..ffff, use 16-bits, otherwise, use 32.

This is a plausible design.  If you are willing to pay more time to save
some more space, you could have multiple flavors of single-byte strings
based on SCSU dynamic windows.  Keep a single overhead byte T with each
single-byte string that indicates the meaning of the byte range 80-FF:

Value of T      Unicode offset  Comment
01..67          x*80            half-blocks from U+0080 to U+3380 
68..A7          x*80+AC00       half-blocks from U+E000 to U+FF80 
F9              00C0            Latin-1 letters + half of Latin Extended-A 
FA              0250            IPA Extensions
FB              0370            Greek 
FC              0530            Armenian 
FD              3040            Hiragana 
FE              30A0            Katakana
FF              FF60            Halfwidth Katakana

So your byte strings (range U+0000..U+00FF) would have an T byte of 01.
Of course there is no requirement to implement this entire scheme;
you can cherry-pick particular T values that make sense.

As you read this, I don't want you to feel      John Cowan 
sorry for me, because, I believe everyone       jcowan@xxxxxxxxxxxxxxxxx
will die someday.                               http://www.reutershealth.com
        --From a Nigerian-type scam spam        http://www.ccil.org/~cowan