This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
Per Bothner writes: > If you have the luxury of reading your entire file into memory (and in > the process expanding its size by a good bit) you can of course do all > kinds of processing and index-building. I have text files containing 100MB worth of UTF-8 encoded text with character offsets in supplemental files. This happens regularly in corpus linguistics. > It appears (from http://www.jorendorff.com/articles/unicode/python.html) > that Python unicode strings are UTF-16 strings, so character offsets > will break as soon as you go beyond the Basic Multilingual Plane. > Scheme implementations can of course fix this, though it means using > 4 bytes per character. Hence the discussion. Yes, it falls apart with Astral plane characters, but these are fortunately rare. When you build the Python interpreter you can set the size of internal Unicode characters: 2-bytes or 4-bytes. I use a 4-byte Unicode build of the interpreter when I deal with Astral plane. -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"