[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: vector compare

This page is part of the web mail archives of SRFI 67 from before July 7th, 2015. The new archives for SRFI 67 contain all messages, not just those from before July 7th, 2015.




Per Bothner wrote:
> Do you have any examples or use cases or algorithms that are simpler
> when using this "natural order"?  What prior art is there for
> using this order?

Built-in lexicographical orders on vectors have hurt me more than once
for ordering actual 'vectors', i.e. elements of vector spaces (abelian group
with a field acting on it) of different dimension. By using the lex order you
effectively fix an embedding of the lower-dimensional vector spaces into
the higher---and that is often the wrong thing to do.

Concerning prior art: Are you really asking me why anybody anywhere would
like to order sequences by length and then lexicographically, instead just
lexicographically, and where you can find published references to programs
that do that?

Actually, I do not intend to settle this question by digging up 'prior art' and
looking where the bigger pile is. Let us focus on which problem we would
like to solve, and how the solutions fit with the structures that are present in
Scheme. The advantage of ordering LIST and VECTOR the same is that
it is easier to remember. The advantage of ordering them differently is that
you get the choice---based on the basic operations.

> > If I understand correctly, your argument is that LIST and VECTOR should
> > both be interpreted as mere implementations of some abstract SEQUENCE
> > data type,
>
> I'm saying that some Scheme implementations (at least Kawa) and at
> least one SRFI (44) does this.  I'm also pointing out that Common Lisp
> does this.  I.e. this is not a far-fetched concept.


SRFI 44 defines a lot and implements nothing. I think SRFI 44 represents a good deal of
good work. Yet it is still to be seen that this is really the way data structures should be
organized in Scheme. Comparing containers, for example, is a problem: SRFI 44 defines
nothing for that but its design excludes several possible approaches already.

Concerning Kawa and Common Lisp, the problem is the same: The object system brings
structure to the world of data types---but it also imposes a certain point of view.
This point of is certainly plausible, but it is not the only one, and its benefits are overrated.

My point of view is that the user should be enabled to construct exactly the data structures
required for an application from prefab building blocks---instead of finding a fully organized
world already. Initially the organized world is nice, but sooner or later you hit brick walls,
and the more complex it gets the more that hurts.

Now, if you have sequence data that is naturally ordered lexicographically than you can
use LISTs or VECTORs ordered as lists to represent the data. If you have sequences
that are naturally ordered by length and then lexicographically you can use VECTORs
or LISTs ordered as vectors. And in both cases it is likely that the concrete default
representation has something to do with the application already. (Which is the result of
the 'natural ordering based on the available primitive operations' thing.)

> It is true that if an ordering is defined for sequence,
> then it would be bad to define inconsistent orderings for different
> sub-types of sequence.  I'm making a weaker claim: if you define an
> operation F on type A and on type B that are both sub-types of a common
> super-type S then good object-oriented design suggests you should
> define the operation F to have common properties (axioms) for both
> A and B.


And this what this SRFI is doing: Vector, List, Number, String, Symbol, Char, etc. are all
subtypes of the overall Scheme object type---and they all provide a compare procedure
that satisfies certain axioms. Furthermore the orderings of number types are required
to be compatible with the tower of numeric data types in R5RS, even though the tower
does not really define a subtype hierarchy in the strong sense, because that makes
sense mathematically. Finally, default-compare refines pair-compare. And that is as far
as the subtype thing goes in this SRFI. In particular, LIST and VECTOR are *not* seen
as two concrete realizations of a common container supertype.

Now your point of view is that Vector and List should be interpreted as subtypes
of  a Sequence type---because SRFI 44, Kawa, and Common Lisp do that way
and it is "not a far-fetched concept." (Quite natural even, I would say.)
But I do not find this concept back in the definition of Scheme as it is now,
and there are alternative approaches that give you more choice.

Don't get me wrong, I am in fact a fan of the Sequence approach. I even experimented
with my own Sequence data type as an add-on for Scheme, but not as an abstract
type (or interface) to unify LIST and VECTOR. The sequences I experimented with
were self-optimizing (amortized) data structures that eventually adapt to their use
case (in particular LIFO, FIFO, or random-access) and are reasonably efficient
(i.e. O(n log n) with small constants) over an extremely large scale of size, including
external memory algorithms. In the end, the implementation went out of hand for
a hobby project and I had to stop. Now these sequences do not fit SRFI 44, nor
Kawa, nor Common LISP and they coexist with LIST and VECTOR.

> Even if you view vectors and lists as both sequences, it doesn't
> follow that on operation F must be defined in terms of sequences.
> However, it's usually a good idea.
>
> Moreover, some schemes provide "uniform vectors", and there are SRFIs
> that define them.  What is the natural order for a uniform vector?
> By your argument, and I think "common sense" (admittedly unreliable)
> it should be the same order as for normal vectors.  Certainly if you
> believe uniform vectors "inherit" from an "abstract vector" type.
>
> Now consider strings.  Is a string a vector?  If you have uniform
> vectors, then it seems very strange to not view a string as a uniform
> vector.  It follows that the ordering for strings should be the same
> as for uniform vectors.
>
> Strings use lexicographic order.
> Hence uniform vectors should use lexicographic order.
> Hence vectors should use lexicographic order.


My whole point is that the order to be defined for a data type does not follow
from the data type or its supertypes but is part of what the data type means.
Now if the uniform vectors represent elements of vector spaces than you might
want a different ordering than if they represent strings.

Btw., strings are one the most exceptional things around. The lex. ordering is
historic and not at all 'God-given'. Japanese dictionaries contain several
alternative orderings: By stroke number, by radical (prototypes of parts of the
actual characters, most are also characters in themselves), or by pronounciation.
"The natural order" of strings is a matter of taste. In Germany the treatment of
Umlauts has changed during the past 30 years: Today ä is inserted where
'ae' would be. But I have lots of dictionaries at home that insert ä where
'a' would be. Using Unicode scalar values ä has a high scalar value
so it would be inserted after 'z'. Ordering strings by length>lex can also make
sense in some applications.
 
So I have difficulties following you argumentation above.
My point of view is:

Yes, strings use lex. order because that makes sense as a general default.
No, that does not imply that uniform vectors should also use lex. order.
They can better provide one or more orders that make sense for
their intended applications. And no, that still does not imply that the built-in
VECTOR type should be ordered in any particular way for the sake of
uniformity.

Historically, "uniform vectors" do not allow you to specify the ordering you
want to use---but they should. Lukily, in the majority of the cases it is sufficient
to define any ordering at all. If u8vector or anything provides a default order,
I would make it length>lex because that is the most natural and efficient for
this type, unless you tell me that you always use them in hashing in which
case I would make the default order by hashcode, then length>lex. ;-)

Sebastian.