[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encodings.

This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.



Bradd, what prompted my comments were your own following comments:

> ... Storing data in non-canonical form is not "broken." Also, there's
> more than one canonical form. ... Programs which disagree on the form
> of the I/O will need to translate between the two.
>
> ... That wouldn't help unless they agree to write the *same* canonical
> format. ...

And I agree that neither programs, platforms, nor even users can often agree
on the use of any single encoding form; therefore it would seem that it then
becomes the obligation of the programming language to enable the
specification of program code which can access and process data encoded in
arbitrary forms.

So therefore I can't see how you can conclude that adopting a standard
encoding specification for text (or any data of any type for that matter)
accomplishes anything other than preventing that programming language from
being able to access and manipulate data stored in other formats, which you
seem to have recognized the necessity for?

As a simple example (which scheme should be capable of):

- lets presume I have a text file encoded in IBM-PC's old character set
  which defines the upper 128 characters as graphic characters, which
  for whatever reason I need to read, and do something with. If scheme
  presumes that all characters are UTF8 encoded Unicode characters for
  example, it will mistakenly assume that when it encounters an upper 128
  8-bit IBM-PC encoded character in the position of where it expects a new
  code-point, it may mistakenly merge it with N successive bytes of data
  depending on their continued misinterpretation as being UTF8 encoded
  characters; at which point in time, all I've got is a mess; as what began
  as a simple sequence of bytes representing distinct 8-bit non-Unicode
  encoded characters, are now dispersed in Unicode code-points which have
  consumed varying number of source data bytes each depending on their
  values, which had no relationship to Unicode to begin with.

- correspondingly, the same problem would occur attempting to read any
  arbitrary non-Unicode data; which is likely significant, as most program
  code doesn't process text, it processes binary sampled data; like the 10's
  of processors that are likely in your own car, v.s. the likely single
  processor sitting on your desk.

I don't believe that scheme's intent was to be restricted to only being able
to natively access and process text, much less only text encoded in any
particular format, do you?

(I honestly think you're viewing scheme and it's potential applicability
within the broader computing industry, and it's corresponding practical
requirements too narrowly, with no disrespect intended. Maybe I'm missing
the boat, but from the best I can tell, all discussions seem to be leading
to the erroneous presumption that it's adequate for scheme to restrict
itself to exclusively processing data originating, and destined as Unicode
encoded text, which would be most unfortunate.)

-paul-

> From: "Bradd W. Szonye" <bradd+srfi@xxxxxxxxxx>
> Date: Thu, 12 Feb 2004 22:35:57 -0800
> To: srfi-52@xxxxxxxxxxxxxxxxx
> Subject: Re: Encodings.
> Resent-From: srfi-52@xxxxxxxxxxxxxxxxx
> Resent-Date: Fri, 13 Feb 2004 07:36:06 +0100 (NFT)
> 
> Paul Schlie wrote:
>> I'm apologize if my tone was interpreted as being antagonistic.
>> 
>> Although I may have abused the use of "canonical", my intent was to
>> suggest that raw data I/O represents the fundamental basis required to
>> support arbitrarily encoded data access ....
> 
> I agree. But have we actually advocated arbitrarily encoded data?
> There's two levels here: How you encode the codepoints (UTF-8, UTF-16,
> UTF-32, something else) and how you normalize the codepoints. The first
> choice need not be arbitrary or even standardized. But no matter how you
> do it, you'll need to deal with normalization if you're using Unicode.
> 
>> and in that respect, tried to suggest that null-encoding may be
>> thought of as root canonical encoded form (where a null encoding
>> transform does nothing, therefore lossless, fully preserving all the
>> originally encoded data states in their native form) ....
> 
> That's not generally possible with Unicode. There is no single, standard
> normalization form, and applications must be prepared to deal with that.
> It's one of the consequences of using Unicode. If you ignore it, you
> will not be able to process text efficiently. Fortunately, it's not a
> big deal to normalize graphemes.
> 
>> However under no circumstances should scheme I/O be presumed to be
>> based on any particular character encoding which may be different than
>> the host platforms presumption ....
> 
> Again, I don't think anyone has proposed this. The whole SRFI is
> intended for platforms that *are* well-suited to using Unicode. But even
> in that case, you need to deal with normalization issues.
> -- 
> Bradd W. Szonye
> http://www.szonye.com/bradd
>