I used this technique to prove libc++'s implementation of std::sort is O(n^2). There's still an open bug report for this in libc++: https://llvm.org/bugs/show_bug.cgi?id=20837.
On a related note, I'm talking to the libc++ and libstdc++ developers about replacing introsort with pdqsort (https://github.com/orlp/pdqsort) as the implementation for std::sort. It solves the worst case in a subtly different (and potentially more efficient) way than introsort. Although that's not the real selling point.
EDIT: I forgot I still use my old username on hacker news. To prevent confusion, I'm also the author of pdqsort.
I don't get how you could create an adversarial input in advance for an implementation that uses a randomized-partition (index of pivot drawn from a uniform distribution).
Can anyone fill me in?
What this article describes (for that, as far as I can tell) is an adversary that modifies the input while quick sort is working. How realistic is that?
That algorithm is only used to create an input pattern. When it is fed back to the same sorting function, it will cause it to go quadratic. If your target is using the same sorting function it will do the same on their side.
This attack is realistic. I have tried it on the glibc qsort, generic quicksort, and Visual Studio qsort, and they were vulnerable.
I'm interested in the way you use the word "attack."
I was looking at it from the perspective of you could never run into an input that would consistently sort in O(n^2) using randomized-quicksort.
But your perspective (maybe this article's perspective) is that some malicious adversary could force the sorter to do extra work by knowing in advance the steps that they will take. Am I getting that right? Like from a flops security stand point?
If so, why care about this situation? This only seems like a realistic scenario in very weird domains, like amazon web services messing with everyone's calls to some standard sorting code to charge them more cents for compute time or something...
If I'm not mistaken, the attack is aimed at a quicksort implementation where the pivot value is chosen as a median of first, middle, and last values. It would not work on randomized quicksort, of course.
What's the problem? From D. Knuth's The
Art of Computer Programming: Sorting
and Searching,
(1) if sorting n records
by comparing pairs of
keys, then the Gleason
bound shows that
on average can't sort
faster than O( n ln(n) ).
(2) Heap sort sorts n records in both worst
case and average case in time O( n ln(n) ).
Net, when sorting by comparing
pairs of keys, in main memory, on
a simple machine architecture,
with a single processor, ...,
can't improve on Heap sort.
Big O hides all constants. A hybrid algorithm using Quicksort for the heavy lifting is O(n log n) in the worst case, but is a significant amount faster than heapsort - mainly due to cache effects.
Now and long the usual criterion of
performance is big-O notation,
and that is what I considered,
with the Gleason bound and heap sort
and, thus, showed that, with various
common assumptions, can't do better
than heap sort. As far as I know, I
was correct.
So, to claim to beat heap sort, have to
set aside some of the usual assumptions
and the big-O criterion.
Quicksort is great with a cache since
as it has great locality of reference
when working with the partitions
that fit into the cache. But, worst
case of Quicksort is still O( n^2 ) so
what happens if Quicksort encounters
such a partition?
Just intuitively, it appears that heap
sort has a tough time making much
exploitation of
usual cache designs. We know that.
There is a modification of heap sort
that helps it work better when
have to do virtual memory paging.
I used to get torqued at considering
big-O, and did this just
because it ignores constants;
as a programmer, I worked hard to
do well with such constants. But in the
end I gave in and gave up and just
accepted big-O notation as the
most appropriate, simple,
single criterion. And, for
judging algorithms by how they do with
cache exploitation, I gave up on that,
also: Or, broadly it's the responsibility of
such speedup techniques to work on
ordinary code written for a simple processor
architecture, not the responsibility of
such code to do tricky things to exploit
tricky architectures.
In the land of big-O, a factor of only
2 is considered not very exciting.
If relax some of the assumptions, then
can beat heap sort and O( n ln(n) ):
Use radix sort.
That was the sorting algorithm of
the old punch card sorting machines.
E.g., if have 100 billion records
to sort where each key is an
alpha-numeric string 10 characters long,
then radix sort does all the work
in just 10 passes of the 100 billion
records. Or for keys m characters long,
can sort n records in O( nm ). So
that's faster when m < ln(n). Or, to
make the arithmetic easier, m < log(n).
But log(100 billion) = 12, so
radix sort should be faster with
m = 10 and n = 100 billion.
So, for a hybrid sort routine,
consider also radix sort.
> But, worst case of Quicksort is still O( n^2 ) so what happens if Quicksort encounters such a partition?
I'd strongly suggest you to read the readme I wrote for https://github.com/orlp/pdqsort, which explains in detail how I beat heapsort (and other sorting algorithms). It also shows how I prevent the O(n^2) worst case.
> I used to get torqued at considering big-O, and did this just because it ignores constants; as a programmer, I worked hard to do well with such constants. But in the end I gave in and gave up and just accepted big-O notation as the most appropriate, simple, single criterion.
We have already established that O(n log n) is the lower bound, and we have reached this lower bound. This is where big O stops being useful. A metric that compares equal for every relevant algorithm (mergesort, introsort, pdqsort, timsort, smoothsort, heapsort) is not a useful metric.
> In the land of big-O, a factor of only 2 is considered not very exciting.
I live in the land of real-world performance, powered by actual benchmarks. In this land a factor of 2 is very exciting.
> So, for a hybrid sort routine, consider also radix sort.
I worked in the scope of std::sort, which is strictly a comparison sort.
> We have already established that O(n log n) is the lower bound, and we have reached this lower bound. This is where big O stops being useful. A metric that compares equal for every relevant algorithm (mergesort, introsort, pdqsort, timsort, smoothsort, heapsort) is not a useful metric.
With the explosion of cores and special purpose hardware, it is interesting to think of how long before sorting network routines are more practical. Dropping the runtime to O(lg n) would be enticing. :)
I assumed that we were talking about
in place sorting.
"I feel your pain." As I tried
to explain, the first time I saw
big-O notation taken seriously as
the main metric for evaluating the
actual running time of
actual code on actual
data on an actual computer,
I had your reaction, that
the constants are important
and the big-O criterion, not good.
There's another point: The
silent assumption of the big-O criterion
is that we are interested in (A) something
fundamental about algorithms, just the
algorithms, and not (B) actual running time
of actual code on actual data on actual
computers.
And big progress in algorithms is also
important and, as we know, in some
cases can give much better performance
gains in real computing than anything
a programmer can do without the big progress.
And we have a great example: There
were days before heap sort, shell sort,
quicksort when the in place sorting
algorithm people coded were bubble sort
or something closely related and also O(n^2).
Then in practice, the difference was just
huge.
Early in my career, I ran into that at
Georgetown University: A prof had coded
up some software for teaching statistics
and had used the IBM Scientific Subroutine
Package (SSP) for a most of the actual
calculations. One of the SSP routines
was to find ranks, and it had two
loops, each essentially bubble sort.
In testing, the thing ran all lunch time.
Well, the data to be ranked was
32 bit integers, and an array index
was a 16 bit integer. So, I was
able to do a tricky overlay of
two 16 bit integers on the 32 bit
data, use heap sort, and get O( n ln(n) ).
In actually running time on nearly
any possible real data, my code
as so much faster than IBM's
SSP that I blew the doors off the
SSP.
Lesson 1: O( n ln(n) ) instead of O( n^2 )
can be much better in practice.
Lesson 2: Progress in big-O in algorithms
can be terrific stuff.
Back to practice, early in writing the
software for my startup, I tried to
write a polymorphic version of heap
sort. I wrote in Microsoft's Visual
Basic .NET. For the data type to
be sorted, I passed just the type
object. The software ran correctly
right away. Then I did some careful
timings -- the software commonly ran
ballpark 10 times longer than I
would have guessed. What the heck?
Well, the run time was smart enough
to figure out the actual data type
and do the comparison, but the overhead
here was enormous. Then I learned
that for polymorphic code I was
supposed to write some interfaces,
one for each data type to be sorted,
and for sorting a particular data
type pass the corresponding interface.
Okay. My confusion was that that use
of an interface didn't sound
polymorphic to me and sounded
no different than what I'd long
done passing entry variables
in PL/I and Fortran. Okay, for
polymorphism, use interfaces.
And in the code of the interface,
have to work essentially without
strong typing, essentially with
untyped pointers. Okay -- been there,
done that, in both PL/I and Fortran;
I was surprised that a serious
object oriented language with
polymorphism would have essentially
just syntactic sugar over what I'd
been doing in PL/I and Fortran. Okay.
Lesson learned: There are cases of
hype in parts of programming languages.
Heck, in PL/I my approach to sort an
array of PL/I structures would be
to pass a PL/I structure that
had an element variable that was
an entry variable to a procedure
that would do the comparisons.
That is, could have some conventions
that would let PL/I structures
be self-describing. Then
could write code that was more
general. Maybe under
the covers, that's just what's
going in the Microsoft common language
run time (CLR) that Visual Basic .NET
used for me without letting me know
-- more syntactic sugar?
So, when I rewrote my polymorphic
heap sort to do the comparisons
with a passed interface, the performance
got okay. Still, I have
non-polymorphic versions for the
more important element data types,
and they are faster in just they
way you emphasized. Yup, been
there, done that.
So, I, too, was concerned about actual
running time of actual data. But
for the algorithm itself, I
was still using just heap sort
and didn't try to write a
hybrid routine that sometimes
might exploit, say, radix sort.
On the pros and cons of using
big-O for the only criterion,
there's a big example in
linear programming:
In practice, the simplex
algorithm is fantastically
fast. Intuitively, in practice,
the early
iterations in effect focus
what to do to get to the
optimal solution (assuming feasibility,
no unboundedness, etc.).
But due to some work
of Klee and Minty, the worst
case of simplex is exponential
and just awful. Intuitively
the Klee and Minty examples
trick simplex into making
really bad choices!
So, there
was research to find a polynomial
algorithm for linear programming.
Eventually there was
an ellipsoid algorithm that
was polynomial but the constant
was so high that whenever
in practice ellipsoid beat
simplex both ran too long
to be practical.
So, really, the concern about simplex
and the effort for ellipsoid was
as progress just in algorithms and
in the big-O criterion. And in practice,
right away, the constant was
so large that ellipsoid was a flop.
Still, in algorithms, people
are straining to do better in
the sense of
the big-O criterion. Of course
the biggest effort here is
to find an algorithm that shows
that P = NP,
maybe the one of the most important
problems in both math and computer
science,
based on big-O for some polynomial.
With all of that, I decided in
my own work, for something
as simple as sorting, just to
go with heap sort, maybe even
a polymorphic heap sort.
Indeed, with all the emphasis on
polymorphism, people can't be
very concerned about constants! :-)!
I'm impressed you can write such a long story yet address none of my points. You should consider becoming a politician.
The only relevant piece of information I could find in your anecdote is this:
> I assumed that we were talking about in place sorting.
If we limit ourselves to in-place sorting algorithms then we still are left with block sort, pdqsort, introsort and heap sort, all of which run in O(n log n). My argument still holds.
I wrote that long post trying to explain
my attitude on sorting software from
50,000 or so feet up and, in particular,
why I have been happy just to use heap
sort and forget about looking for anything
faster on average.
I was reluctant to read the documentation
in your GitHub submission if only because
in my software development I have never
needed or used GitHub.
Also, I was reluctant to believe that
there could be any version of quicksort
that would have worst case running time of
O( n ln(n) ).
Why reluctant to believe?
First, sure, the usual version of
quicksort has us select a pivot value
for a partition from a "median of three"
keys from that partition.
Second, sure: For a good pivot value,
we want essentially the median of the keys
in the partition we are trying to split
into two partitions. And, sure, the
median of three approach to selecting a
pivot value is an approximation to the
median we really want and, thus, a
relatively good pivot value. So this
approach to selecting a pivot value
promises on average to get faster average
running times. Fine. Faster on average
is good. Terrific.
But this pivot selection rule says next to
nothing about worst case running time, and,
for claiming worst case running time of O(
n ln(n) ), this is a biggie issue.
Third, with this median of three approach
to selecting pivot values, we can think of
bad arrays of input keys that will cause
quicksort to run in time no faster than O(
n^2 ). Then maybe we can think of other
ways to select pivot values that result in
a new version of quicksort, a version that
runs in O( n ln(n) ) again on the bad
arrays. But with this change, here's the
problem: We are lacking a proof that the
worst case running time of the new version
of quicksort has worst case running time
of O( n ln(n) ).
That is, our new version of quicksort is
good for some old cases of bad arrays
but may have created for itself some new
cases of bad arrays. So, with no proof,
the new version of quicksort may also have
worst case running time of O( n^2 ) like
the old version did. The main difference
might be that the bad arrays for the new
version are more complicated to construct
than the bad arrays for the old case.
But I just looked at the file README.MD
you suggested. I see two issues:
First Issue:
> When a new pivot is chosen it's compared
to the greatest element in the partition
before it. If they compare equal we can
derive that there are no elements smaller
than the chosen pivot. When this happens
we switch strategy for this partition, and
filter out all elements equal to the
pivot.
If I understand this statement correctly,
I suspect that for some arrays of input
keys, the resulting algorithm won't
actually sort the keys. That is, we won't
have a sorting algorithm.
Second Issue:
> While this technically still allows for
a quadratic worst case, the chances of it
happening are astronomically small.
For proving worst case running time of O(
n ln(n) ), "astronomically small" is
irrelevant. Maybe your remark
"astronomically small" is just a side
comment and not relevant to PDQSORT having
worst case running time of O( n ln(n) ),
but from your documentation I can't easily
tell.
> A bad partition occurs when the position
of the pivot after partitioning is under
12.5% (1/8th) percentile or over 87,5%
percentile - the partition is highly
unbalanced. When this happens we will
shuffle four elements at fixed locations
for both partitions. This effectively
breaks up many patterns. If we encounter
more than log(n) bad partitions we will
switch to heap sort.
For this and the associated documentation,
I don't see a solid argument that PDQSORT
has worst case running time of O( n ln(n) ).
For some work, a big-O guarantee on worst
case performance is a biggie, as in
crucial to human life. Such a guarantee
is essentially a theorem in applied math
called computer science. A good example
of how to prove such theorems is, of
course, in D. Knuth, The Art of Computer
Programming, Volume 3, Sorting and
Searching.
Again, I have been happy just to use heap
sort and forget about looking for anything
faster on average.
> If I understand this statement correctly, I suspect that for some arrays of input keys, the resulting algorithm won't actually sort the keys. That is, we won't have a sorting algorithm.
You don't understand the statement correctly. However, this is not a trivial realization.
Because pdqsort always recurses on the left partition first it means that every element to the left of the current recursive call is equal to or less than every element within the current recursive call. So if we find that the selected pivot is equal to an element before the current recursive call we can conclude that there can only be elements equal to the pivot, and not smaller, in the current recursive call.
> Maybe your remark "astronomically small" is just a side comment and not relevant to PDQSORT having worst case running time of O( n ln(n) ).
Correct.
> For this and the associated documentation, I don't see a solid argument that PDQSORT has worst case running time of O( n ln(n) ).
There is a very simple argument. The 'bad partition counter' is initially log(n). If we encounter a bad partition, we can do up to O(n) work. This can happen log(n) times. So we can do up to O(n log n) work on bad partitions. If we then encounter another bad partition, we switch to heapsort, which is O(n log n).
If we encounter less than log(n) bad partitions Quicksort performed well enough to be O(n log n).
> In the land of big-O, a factor of only 2 is considered not very exciting
In the land of solving real-world problems a factor of 2 is extremely significant, especially when you're trying to get the most performance out of a power-constrained system. In fact, even a 1% increase in performance is significant.
> Now and long the usual criterion of performance is big-O notation, and that is what I considered, with the Gleason bound and heap sort and, thus, showed that, with various common assumptions, can't do better than heap sort. As far as I know, I was correct.
That's a little short sighted for real life performance. Memory access patterns, cache performance, and the hidden constant factor are all relevant for real performance. For example there's a 200x difference between main memory and cache access times. log2(1000000000000) is ~40, meaning poor cache performance can impact performance by more than a factor of log(n) in the Big-O notation.
There are also many cases where algorithms with better Big-O performance have constant factors so large that they perform worse for common problem sizes. Matrix multiplication is a good example. Strassen's algorithm has a better Big-O than the straightforward algorithm, but the constant factor is large enough that it only becomes faster on matrices larger than maybe 100x100. And there are algorithms even better than Strassen's, but their constant factors are so big they're effectively never better than Strassen's on real hardware.
In theory, there's no significant differences between theory and practice. In practice, things are never so simple.
When you use Big-Oh notation, O(f(n)), you claim that there exist one function F(n) that accurately characterizes the running time of an algorithm based on the size of the input, and that f(n) is the higher order component of that function. I looks something like this:
F(n) = A*f(n) + B*g(n) + C*h(n) + ... + K
where A,B,C,..K are constants and f,g,h,... are functions.
You approximate F(n) ~= Af(n) because for n big enough, this component will dominate the growth of such function. But A is different for different algorithms... so, in the case of sorting, you have several options - quicksort, mergesort, heapsort - that are in paper all "the same" but each has a different constant, and there are real advantages of using the one with the smallest constant.
Furthermore, if you are not trying to sort infinitely big data (most real world cases) you must consider that other components of the actual F(n) function, such as g and h, may dominate the performance of the algorithm in the range of data sizes that you are more likely to encounter. There are other measures of algorithmic complexity that deal with this (I recall there is one Small-Oh notation, among others), but they don't teach you that at school unless you are a posgraduate student doing highly relevant work in complexity theory. I suspect most guys, like myself, find out by not taking Knuth as holy gospel and reading further material.
And then there are the degenerate cases, where your data can get sorted in quasi-linear time if you apply one of the slow dumb O(n^2) dinosaurs.
On a related note, I'm talking to the libc++ and libstdc++ developers about replacing introsort with pdqsort (https://github.com/orlp/pdqsort) as the implementation for std::sort. It solves the worst case in a subtly different (and potentially more efficient) way than introsort. Although that's not the real selling point.
EDIT: I forgot I still use my old username on hacker news. To prevent confusion, I'm also the author of pdqsort.