[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

Thomas Lord scripsit:

> My plan (and stalled code) works that way.  If a
> string contains only codepoints in 0..255, store it as bytes.
> 0..ffff, use 16-bits, otherwise, use 32.

This is a plausible design.  If you are willing to pay more time to save
some more space, you could have multiple flavors of single-byte strings
based on SCSU dynamic windows.  Keep a single overhead byte T with each
single-byte string that indicates the meaning of the byte range 80-FF:

Value of T      Unicode offset  Comment
01..67          x*80            half-blocks from U+0080 to U+3380 
68..A7          x*80+AC00       half-blocks from U+E000 to U+FF80 
F9              00C0            Latin-1 letters + half of Latin Extended-A 
FA              0250            IPA Extensions
FB              0370            Greek 
FC              0530            Armenian 
FD              3040            Hiragana 
FE              30A0            Katakana
FF              FF60            Halfwidth Katakana

So your byte strings (range U+0000..U+00FF) would have an T byte of 01.
Of course there is no requirement to implement this entire scheme;
you can cherry-pick particular T values that make sense.

As you read this, I don't want you to feel      John Cowan 
sorry for me, because, I believe everyone       jcowan@xxxxxxxxxxxxxxxxx
will die someday.                               http://www.reutershealth.com
        --From a Nigerian-type scam spam        http://www.ccil.org/~cowan