[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: encoding strings in memory

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: bburger@xxxxxxxxxxx
Subject: Re: encoding strings in memory
From: Per Bothner <per@xxxxxxxxxxx>
Date: Wed, 13 Jul 2005 09:21:49 -0700
Cc: srfi-75@xxxxxxxxxxxxxxxxx
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
In-reply-to: <OF88411C9D.3C2F72CF-ON8525703C.005F138D-0525703D.00507674@xxxxxxxxxxx>
References: <OF88411C9D.3C2F72CF-ON8525703C.005F138D-0525703D.00507674@xxxxxxxxxxx>
User-agent: Mozilla Thunderbird 1.0.2-6 (X11/20050513)

bburger@xxxxxxxxxxx wrote:

1. Strings will almost certainly have to be represented as arrays of32-bit entities, since string-set! allows one to whack any character.This representation wastes memory, since the overwhelmingly common caseis to use characters only from the Basic Multilingual Plane (0x0000 to0xFFFF). For applications we write, the majority of characters areASCII, even though our software is used around the world. Consequently,we use UTF-8 for storing strings, even though we run on MicrosoftWindows (UTF-16-LE).

We have teh same problem in the Java world. Native strings andcharacters are 16-bit Unicode. This would fine 99% of the time.However, use of character above 0xFFFF requires using surrogate pairs.

The problem is string-ref and string-set!. Existing Java-String-basedencodings have string-ref return *half* of a surrogate pair. This is noproblem for most applications, if you just want to print or copystrings. It's not really a problem for intelligent code that deals withcomposed characters which needs to work with variable-length stringsanyway. It is a problem for intermediate code that does something witheach individual character.

Note that even these applications don't actually need a linear mappingfrom indexes to characters. I.e. arithmetic on indexes in a string isnever (well, hardly ever) useful or meaningful. All we need is a"position" magic cookie, similar to stdio's fpos_t.

One solution is to have multiple "modes". A string may start out in8-bit mode, and switch to 16-bit code when a 16-bit character isinserted, and then switch to 32-bit mode when a still larger characteris inserted. This means the entire string has to be copied when asingle character is inserted, but the amortized cost per character isconstant. It also means that we need 32- bits per character for theentire string, even if there is only a single character > 0xFFFF.

2. Changing strings to use 32-bit characters will make foreign functioninterfaces difficult, since the major platforms use UTF-16-LE and UTF-8.It will also break all existing foreign-function code that relies onstrings being 8-bit bytes.

The "mode-switching" solution doesn't solve that problem - it makes itworse.

It seems to me that keeping char 8-bit and string as an array of 8-bitbytes would be the least disruptive change.


But what does char-ref return?

I have an idea; see next message.
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/

References:
- encoding strings in memory
  - From: bburger

Prev by Date: Re: naming comments
Next by Date: Re: case mappings
Previous by thread: encoding strings in memory
Next by thread: constant-time access to variable-width encodings
Index(es):
- Date
- Thread