This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
From: Tom Lord <lord@xxxxxxx> Subject: Re: strings draft Date: Sun, 25 Jan 2004 19:27:58 -0800 (PST) > I'm not aware enough about the details of Shiro's. All I'm (sort of) > aware of is that he's dealing with a EUCJP -- which sounds very > challenging if you want to wind up with an implementation suitable for > intensive string processing. (Unicode is similarly challenging.) To be precise, what I'm dealing with is to use a CES-independent multibyte string representation, currently including utf-8, EUC-JP, and Shift_JIS. EUC-CN, EUC-TW and EUC-KR should be supported easily. I exclude stateful encodings, like ISO2022 with stateful escape sequences. (EUC is subset of ISO2022, but it only uses single shift escape, thus effectively it's stateless). Tom, if you have specific instance that discourages mb string, I'm curious about hearing it, either off-list or on-list. The following is a discussion that why I think mb string is feasible. Those who aren't interested can skip it. * * * "Intensive string processing" would vary for application domains. The domain I'm looking at has these properties: * very frequent use of regexp. * strings are hardly mutated. * lots of data passing between external programs/libraries. Regexp engine can be implemented on multibyte strings almost as efficient as "uniform character array" string, by compiling regexp into octet-stream NFA/DFA. Currently the only penalty of my implementation is when you use a character range including large character set. It can be optimized, I think. Regexp is heavily used to extract a part of string. Returning substring directly is very efficient if you share the string body. Using string indices can be actually less efficient, even if you use uniform character array strings. Multibyte representation doesn't necessarily put a penalty to use large corpora; e.g. suffix array can be constructed and used efficiently using byte index (actually, any kind of string reference). Most external libraries and programs nowadays require strings to be passed in some sort of multibyte format. If you can use the same multibyte format internally, sending and receiving data have little overhead. It may not help when you're writing a program that will be used on wide variety of environments, but it is an advantage if you're writing in-house tools where you have knowledge of which encoding is used in the environment. Of course I don't insist multibyte strings is generally superior. Actually I'm not 100% sure multibyte strings doesn't have serious problems. But I was curious, so I started implementing it, and haven't seen a serious problem yet, though there are some unresolved issues (like how to tread illegal byte sequences). --shiro