This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.
Bradd, what prompted my comments were your own following comments: > ... Storing data in non-canonical form is not "broken." Also, there's > more than one canonical form. ... Programs which disagree on the form > of the I/O will need to translate between the two. > > ... That wouldn't help unless they agree to write the *same* canonical > format. ... And I agree that neither programs, platforms, nor even users can often agree on the use of any single encoding form; therefore it would seem that it then becomes the obligation of the programming language to enable the specification of program code which can access and process data encoded in arbitrary forms. So therefore I can't see how you can conclude that adopting a standard encoding specification for text (or any data of any type for that matter) accomplishes anything other than preventing that programming language from being able to access and manipulate data stored in other formats, which you seem to have recognized the necessity for? As a simple example (which scheme should be capable of): - lets presume I have a text file encoded in IBM-PC's old character set which defines the upper 128 characters as graphic characters, which for whatever reason I need to read, and do something with. If scheme presumes that all characters are UTF8 encoded Unicode characters for example, it will mistakenly assume that when it encounters an upper 128 8-bit IBM-PC encoded character in the position of where it expects a new code-point, it may mistakenly merge it with N successive bytes of data depending on their continued misinterpretation as being UTF8 encoded characters; at which point in time, all I've got is a mess; as what began as a simple sequence of bytes representing distinct 8-bit non-Unicode encoded characters, are now dispersed in Unicode code-points which have consumed varying number of source data bytes each depending on their values, which had no relationship to Unicode to begin with. - correspondingly, the same problem would occur attempting to read any arbitrary non-Unicode data; which is likely significant, as most program code doesn't process text, it processes binary sampled data; like the 10's of processors that are likely in your own car, v.s. the likely single processor sitting on your desk. I don't believe that scheme's intent was to be restricted to only being able to natively access and process text, much less only text encoded in any particular format, do you? (I honestly think you're viewing scheme and it's potential applicability within the broader computing industry, and it's corresponding practical requirements too narrowly, with no disrespect intended. Maybe I'm missing the boat, but from the best I can tell, all discussions seem to be leading to the erroneous presumption that it's adequate for scheme to restrict itself to exclusively processing data originating, and destined as Unicode encoded text, which would be most unfortunate.) -paul- > From: "Bradd W. Szonye" <bradd+srfi@xxxxxxxxxx> > Date: Thu, 12 Feb 2004 22:35:57 -0800 > To: srfi-52@xxxxxxxxxxxxxxxxx > Subject: Re: Encodings. > Resent-From: srfi-52@xxxxxxxxxxxxxxxxx > Resent-Date: Fri, 13 Feb 2004 07:36:06 +0100 (NFT) > > Paul Schlie wrote: >> I'm apologize if my tone was interpreted as being antagonistic. >> >> Although I may have abused the use of "canonical", my intent was to >> suggest that raw data I/O represents the fundamental basis required to >> support arbitrarily encoded data access .... > > I agree. But have we actually advocated arbitrarily encoded data? > There's two levels here: How you encode the codepoints (UTF-8, UTF-16, > UTF-32, something else) and how you normalize the codepoints. The first > choice need not be arbitrary or even standardized. But no matter how you > do it, you'll need to deal with normalization if you're using Unicode. > >> and in that respect, tried to suggest that null-encoding may be >> thought of as root canonical encoded form (where a null encoding >> transform does nothing, therefore lossless, fully preserving all the >> originally encoded data states in their native form) .... > > That's not generally possible with Unicode. There is no single, standard > normalization form, and applications must be prepared to deal with that. > It's one of the consequences of using Unicode. If you ignore it, you > will not be able to process text efficiently. Fortunately, it's not a > big deal to normalize graphemes. > >> However under no circumstances should scheme I/O be presumed to be >> based on any particular character encoding which may be different than >> the host platforms presumption .... > > Again, I don't think anyone has proposed this. The whole SRFI is > intended for platforms that *are* well-suited to using Unicode. But even > in that case, you need to deal with normalization issues. > -- > Bradd W. Szonye > http://www.szonye.com/bradd >