PageRenderTime 83ms CodeModel.GetById 65ms app.highlight 6ms RepoModel.GetById 1ms app.codeStats 0ms

/Objects/listsort.txt

http://unladen-swallow.googlecode.com/
Plain Text | 677 lines | 542 code | 135 blank | 0 comment | 0 complexity | 8e7eb8fcf55485d0da6119cfe5bf89b0 MD5 | raw file
  1Intro
  2-----
  3This describes an adaptive, stable, natural mergesort, modestly called
  4timsort (hey, I earned it <wink>).  It has supernatural performance on many
  5kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
  6as few as N-1), yet as fast as Python's previous highly tuned samplesort
  7hybrid on random arrays.
  8
  9In a nutshell, the main routine marches over the array once, left to right,
 10alternately identifying the next run, then merging it into the previous
 11runs "intelligently".  Everything else is complication for speed, and some
 12hard-won measure of memory efficiency.
 13
 14
 15Comparison with Python's Samplesort Hybrid
 16------------------------------------------
 17+ timsort can require a temp array containing as many as N//2 pointers,
 18  which means as many as 2*N extra bytes on 32-bit boxes.  It can be
 19  expected to require a temp array this large when sorting random data; on
 20  data with significant structure, it may get away without using any extra
 21  heap memory.  This appears to be the strongest argument against it, but
 22  compared to the size of an object, 2 temp bytes worst-case (also expected-
 23  case for random data) doesn't scare me much.
 24
 25  It turns out that Perl is moving to a stable mergesort, and the code for
 26  that appears always to require a temp array with room for at least N
 27  pointers. (Note that I wouldn't want to do that even if space weren't an
 28  issue; I believe its efforts at memory frugality also save timsort
 29  significant pointer-copying costs, and allow it to have a smaller working
 30  set.)
 31
 32+ Across about four hours of generating random arrays, and sorting them
 33  under both methods, samplesort required about 1.5% more comparisons
 34  (the program is at the end of this file).
 35
 36+ In real life, this may be faster or slower on random arrays than
 37  samplesort was, depending on platform quirks.  Since it does fewer
 38  comparisons on average, it can be expected to do better the more
 39  expensive a comparison function is.  OTOH, it does more data movement
 40  (pointer copying) than samplesort, and that may negate its small
 41  comparison advantage (depending on platform quirks) unless comparison
 42  is very expensive.
 43
 44+ On arrays with many kinds of pre-existing order, this blows samplesort out
 45  of the water.  It's significantly faster than samplesort even on some
 46  cases samplesort was special-casing the snot out of.  I believe that lists
 47  very often do have exploitable partial order in real life, and this is the
 48  strongest argument in favor of timsort (indeed, samplesort's special cases
 49  for extreme partial order are appreciated by real users, and timsort goes
 50  much deeper than those, in particular naturally covering every case where
 51  someone has suggested "and it would be cool if list.sort() had a special
 52  case for this too ... and for that ...").
 53
 54+ Here are exact comparison counts across all the tests in sortperf.py,
 55  when run with arguments "15 20 1".
 56
 57  Column Key:
 58      *sort: random data
 59      \sort: descending data
 60      /sort: ascending data
 61      3sort: ascending, then 3 random exchanges
 62      +sort: ascending, then 10 random at the end
 63      ~sort: many duplicates
 64      =sort: all equal
 65      !sort: worst case scenario
 66
 67  First the trivial cases, trivial for samplesort because it special-cased
 68  them, and trivial for timsort because it naturally works on runs.  Within
 69  an "n" block, the first line gives the # of compares done by samplesort,
 70  the second line by timsort, and the third line is the percentage by
 71  which the samplesort count exceeds the timsort count:
 72
 73      n   \sort   /sort   =sort
 74-------  ------  ------  ------
 75  32768   32768   32767   32767  samplesort
 76          32767   32767   32767  timsort
 77          0.00%   0.00%   0.00%  (samplesort - timsort) / timsort
 78
 79  65536   65536   65535   65535
 80          65535   65535   65535
 81          0.00%   0.00%   0.00%
 82
 83 131072  131072  131071  131071
 84         131071  131071  131071
 85          0.00%   0.00%   0.00%
 86
 87 262144  262144  262143  262143
 88         262143  262143  262143
 89          0.00%   0.00%   0.00%
 90
 91 524288  524288  524287  524287
 92         524287  524287  524287
 93          0.00%   0.00%   0.00%
 94
 951048576 1048576 1048575 1048575
 96        1048575 1048575 1048575
 97          0.00%   0.00%   0.00%
 98
 99  The algorithms are effectively identical in these cases, except that
100  timsort does one less compare in \sort.
101
102  Now for the more interesting cases.  lg(n!) is the information-theoretic
103  limit for the best any comparison-based sorting algorithm can do on
104  average (across all permutations).  When a method gets significantly
105  below that, it's either astronomically lucky, or is finding exploitable
106  structure in the data.
107
108      n   lg(n!)    *sort    3sort     +sort   %sort    ~sort     !sort
109-------  -------   ------   -------  -------  ------  -------  --------
110  32768   444255   453096   453614    32908   452871   130491    469141 old
111                   448885    33016    33007    50426   182083     65534 new
112                    0.94% 1273.92%   -0.30%  798.09%  -28.33%   615.87% %ch from new
113
114  65536   954037   972699   981940    65686   973104   260029   1004607
115                   962991    65821    65808   101667   364341    131070
116                    1.01% 1391.83%   -0.19%  857.15%  -28.63%   666.47%
117
118 131072  2039137  2101881  2091491   131232  2092894   554790   2161379
119                  2057533   131410   131361   206193   728871    262142
120                    2.16% 1491.58%   -0.10%  915.02%  -23.88%   724.51%
121
122 262144  4340409  4464460  4403233   262314  4445884  1107842   4584560
123                  4377402   262437   262459   416347  1457945    524286
124                    1.99% 1577.82%   -0.06%  967.83%  -24.01%   774.44%
125
126 524288  9205096  9453356  9408463   524468  9441930  2218577   9692015
127                  9278734   524580   524633   837947  2916107   1048574
128                   1.88%  1693.52%   -0.03% 1026.79%  -23.92%   824.30%
129
1301048576 19458756 19950272 19838588  1048766 19912134  4430649  20434212
131                 19606028  1048958  1048941  1694896  5832445   2097150
132                    1.76% 1791.27%   -0.02% 1074.83%  -24.03%   874.38%
133
134  Discussion of cases:
135
136  *sort:  There's no structure in random data to exploit, so the theoretical
137  limit is lg(n!).  Both methods get close to that, and timsort is hugging
138  it (indeed, in a *marginal* sense, it's a spectacular improvement --
139  there's only about 1% left before hitting the wall, and timsort knows
140  darned well it's doing compares that won't pay on random data -- but so
141  does the samplesort hybrid).  For contrast, Hoare's original random-pivot
142  quicksort does about 39% more compares than the limit, and the median-of-3
143  variant about 19% more.
144
145  3sort, %sort, and !sort:  No contest; there's structure in this data, but
146  not of the specific kinds samplesort special-cases.  Note that structure
147  in !sort wasn't put there on purpose -- it was crafted as a worst case for
148  a previous quicksort implementation.  That timsort nails it came as a
149  surprise to me (although it's obvious in retrospect).
150
151  +sort:  samplesort special-cases this data, and does a few less compares
152  than timsort.  However, timsort runs this case significantly faster on all
153  boxes we have timings for, because timsort is in the business of merging
154  runs efficiently, while samplesort does much more data movement in this
155  (for it) special case.
156
157  ~sort:  samplesort's special cases for large masses of equal elements are
158  extremely effective on ~sort's specific data pattern, and timsort just
159  isn't going to get close to that, despite that it's clearly getting a
160  great deal of benefit out of the duplicates (the # of compares is much less
161  than lg(n!)).  ~sort has a perfectly uniform distribution of just 4
162  distinct values, and as the distribution gets more skewed, samplesort's
163  equal-element gimmicks become less effective, while timsort's adaptive
164  strategies find more to exploit; in a database supplied by Kevin Altis, a
165  sort on its highly skewed "on which stock exchange does this company's
166  stock trade?" field ran over twice as fast under timsort.
167
168  However, despite that timsort does many more comparisons on ~sort, and
169  that on several platforms ~sort runs highly significantly slower under
170  timsort, on other platforms ~sort runs highly significantly faster under
171  timsort.  No other kind of data has shown this wild x-platform behavior,
172  and we don't have an explanation for it.  The only thing I can think of
173  that could transform what "should be" highly significant slowdowns into
174  highly significant speedups on some boxes are catastrophic cache effects
175  in samplesort.
176
177  But timsort "should be" slower than samplesort on ~sort, so it's hard
178  to count that it isn't on some boxes as a strike against it <wink>.
179
180+ Here's the highwater mark for the number of heap-based temp slots (4
181  bytes each on this box) needed by each test, again with arguments
182  "15 20 1":
183
184   2**i  *sort \sort /sort  3sort  +sort  %sort  ~sort  =sort  !sort
185  32768  16384     0     0   6256      0  10821  12288      0  16383
186  65536  32766     0     0  21652      0  31276  24576      0  32767
187 131072  65534     0     0  17258      0  58112  49152      0  65535
188 262144 131072     0     0  35660      0 123561  98304      0 131071
189 524288 262142     0     0  31302      0 212057 196608      0 262143
1901048576 524286     0     0 312438      0 484942 393216      0 524287
191
192  Discussion:  The tests that end up doing (close to) perfectly balanced
193  merges (*sort, !sort) need all N//2 temp slots (or almost all).  ~sort
194  also ends up doing balanced merges, but systematically benefits a lot from
195  the preliminary pre-merge searches described under "Merge Memory" later.
196  %sort approaches having a balanced merge at the end because the random
197  selection of elements to replace is expected to produce an out-of-order
198  element near the midpoint.  \sort, /sort, =sort are the trivial one-run
199  cases, needing no merging at all.  +sort ends up having one very long run
200  and one very short, and so gets all the temp space it needs from the small
201  temparray member of the MergeState struct (note that the same would be
202  true if the new random elements were prefixed to the sorted list instead,
203  but not if they appeared "in the middle").  3sort approaches N//3 temp
204  slots twice, but the run lengths that remain after 3 random exchanges
205  clearly has very high variance.
206
207
208A detailed description of timsort follows.
209
210Runs
211----
212count_run() returns the # of elements in the next run.  A run is either
213"ascending", which means non-decreasing:
214
215    a0 <= a1 <= a2 <= ...
216
217or "descending", which means strictly decreasing:
218
219    a0 > a1 > a2 > ...
220
221Note that a run is always at least 2 long, unless we start at the array's
222last element.
223
224The definition of descending is strict, because the main routine reverses
225a descending run in-place, transforming a descending run into an ascending
226run.  Reversal is done via the obvious fast "swap elements starting at each
227end, and converge at the middle" method, and that can violate stability if
228the slice contains any equal elements.  Using a strict definition of
229descending ensures that a descending run contains distinct elements.
230
231If an array is random, it's very unlikely we'll see long runs.  If a natural
232run contains less than minrun elements (see next section), the main loop
233artificially boosts it to minrun elements, via a stable binary insertion sort
234applied to the right number of array elements following the short natural
235run.  In a random array, *all* runs are likely to be minrun long as a
236result.  This has two primary good effects:
237
2381. Random data strongly tends then toward perfectly balanced (both runs have
239   the same length) merges, which is the most efficient way to proceed when
240   data is random.
241
2422. Because runs are never very short, the rest of the code doesn't make
243   heroic efforts to shave a few cycles off per-merge overheads.  For
244   example, reasonable use of function calls is made, rather than trying to
245   inline everything.  Since there are no more than N/minrun runs to begin
246   with, a few "extra" function calls per merge is barely measurable.
247
248
249Computing minrun
250----------------
251If N < 64, minrun is N.  IOW, binary insertion sort is used for the whole
252array then; it's hard to beat that given the overheads of trying something
253fancier.
254
255When N is a power of 2, testing on random data showed that minrun values of
25616, 32, 64 and 128 worked about equally well.  At 256 the data-movement cost
257in binary insertion sort clearly hurt, and at 8 the increase in the number
258of function calls clearly hurt.  Picking *some* power of 2 is important
259here, so that the merges end up perfectly balanced (see next section).  We
260pick 32 as a good value in the sweet range; picking a value at the low end
261allows the adaptive gimmicks more opportunity to exploit shorter natural
262runs.
263
264Because sortperf.py only tries powers of 2, it took a long time to notice
265that 32 isn't a good choice for the general case!  Consider N=2112:
266
267>>> divmod(2112, 32)
268(66, 0)
269>>>
270
271If the data is randomly ordered, we're very likely to end up with 66 runs
272each of length 32.  The first 64 of these trigger a sequence of perfectly
273balanced merges (see next section), leaving runs of lengths 2048 and 64 to
274merge at the end.  The adaptive gimmicks can do that with fewer than 2048+64
275compares, but it's still more compares than necessary, and-- mergesort's
276bugaboo relative to samplesort --a lot more data movement (O(N) copies just
277to get 64 elements into place).
278
279If we take minrun=33 in this case, then we're very likely to end up with 64
280runs each of length 33, and then all merges are perfectly balanced.  Better!
281
282What we want to avoid is picking minrun such that in
283
284    q, r = divmod(N, minrun)
285
286q is a power of 2 and r>0 (then the last merge only gets r elements into
287place, and r < minrun is small compared to N), or q a little larger than a
288power of 2 regardless of r (then we've got a case similar to "2112", again
289leaving too little work for the last merge to do).
290
291Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a
292power of 2, or if that isn't possible, is close to, but strictly less than,
293a power of 2.  This is easier to do than it may sound:  take the first 6
294bits of N, and add 1 if any of the remaining bits are set.  In fact, that
295rule covers every case in this section, including small N and exact powers
296of 2; merge_compute_minrun() is a deceptively simple function.
297
298
299The Merge Pattern
300-----------------
301In order to exploit regularities in the data, we're merging on natural
302run lengths, and they can become wildly unbalanced.  That's a Good Thing
303for this sort!  It means we have to find a way to manage an assortment of
304potentially very different run lengths, though.
305
306Stability constrains permissible merging patterns.  For example, if we have
3073 consecutive runs of lengths
308
309    A:10000  B:20000  C:10000
310
311we dare not merge A with C first, because if A, B and C happen to contain
312a common element, it would get out of order wrt its occurrence(s) in B.  The
313merging must be done as (A+B)+C or A+(B+C) instead.
314
315So merging is always done on two consecutive runs at a time, and in-place,
316although this may require some temp memory (more on that later).
317
318When a run is identified, its base address and length are pushed on a stack
319in the MergeState struct.  merge_collapse() is then called to see whether it
320should merge it with preceding run(s).  We would like to delay merging as
321long as possible in order to exploit patterns that may come up later, but we
322like even more to do merging as soon as possible to exploit that the run just
323found is still high in the memory hierarchy.  We also can't delay merging
324"too long" because it consumes memory to remember the runs that are still
325unmerged, and the stack has a fixed size.
326
327What turned out to be a good compromise maintains two invariants on the
328stack entries, where A, B and C are the lengths of the three righmost not-yet
329merged slices:
330
3311.  A > B+C
3322.  B > C
333
334Note that, by induction, #2 implies the lengths of pending runs form a
335decreasing sequence.  #1 implies that, reading the lengths right to left,
336the pending-run lengths grow at least as fast as the Fibonacci numbers.
337Therefore the stack can never grow larger than about log_base_phi(N) entries,
338where phi = (1+sqrt(5))/2 ~= 1.618.  Thus a small # of stack slots suffice
339for very large arrays.
340
341If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
342freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
343e.g., if the last 3 entries are
344
345    A:30  B:20  C:10
346
347then B is merged with C, leaving
348
349    A:30  BC:30
350
351on the stack.  Or if they were
352
353    A:500  B:400:  C:1000
354
355then A is merged with B, leaving
356
357    AB:900  C:1000
358
359on the stack.
360
361In both examples, the stack configuration after the merge still violates
362invariant #2, and merge_collapse() goes on to continue merging runs until
363both invariants are satisfied.  As an extreme case, suppose we didn't do the
364minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
365and 2.  Nothing would get merged until the final 2 was seen, and that would
366trigger 7 perfectly balanced merges.
367
368The thrust of these rules when they trigger merging is to balance the run
369lengths as closely as possible, while keeping a low bound on the number of
370runs we have to remember.  This is maximally effective for random data,
371where all runs are likely to be of (artificially forced) length minrun, and
372then we get a sequence of perfectly balanced merges (with, perhaps, some
373oddballs at the end).
374
375OTOH, one reason this sort is so good for partly ordered data has to do
376with wildly unbalanced run lengths.
377
378
379Merge Memory
380------------
381Merging adjacent runs of lengths A and B in-place is very difficult.
382Theoretical constructions are known that can do it, but they're too difficult
383and slow for practical use.  But if we have temp memory equal to min(A, B),
384it's easy.
385
386If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
387and then we can do the obvious merge algorithm left to right, from the temp
388area and B, starting the stores into where A used to live.  There's always a
389free area in the original area comprising a number of elements equal to the
390number not yet merged from the temp array (trivially true at the start;
391proceed by induction).  The only tricky bit is that if a comparison raises an
392exception, we have to remember to copy the remaining elements back in from
393the temp area, lest the array end up with duplicate entries from B.  But
394that's exactly the same thing we need to do if we reach the end of B first,
395so the exit code is pleasantly common to both the normal and error cases.
396
397If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
398much the same, except that we need to merge right to left, copying B into a
399temp array and starting the stores at the right end of where B used to live.
400
401A refinement:  When we're about to merge adjacent runs A and B, we first do
402a form of binary search (more on that later) to see where B[0] should end up
403in A.  Elements in A preceding that point are already in their final
404positions, effectively shrinking the size of A.  Likewise we also search to
405see where A[-1] should end up in B, and elements of B after that point can
406also be ignored.  This cuts the amount of temp memory needed by the same
407amount.
408
409These preliminary searches may not pay off, and can be expected *not* to
410repay their cost if the data is random.  But they can win huge in all of
411time, copying, and memory savings when they do pay, so this is one of the
412"per-merge overheads" mentioned above that we're happy to endure because
413there is at most one very short run.  It's generally true in this algorithm
414that we're willing to gamble a little to win a lot, even though the net
415expectation is negative for random data.
416
417
418Merge Algorithms
419----------------
420merge_lo() and merge_hi() are where the bulk of the time is spent.  merge_lo
421deals with runs where A <= B, and merge_hi where A > B.  They don't know
422whether the data is clustered or uniform, but a lovely thing about merging
423is that many kinds of clustering "reveal themselves" by how many times in a
424row the winning merge element comes from the same run.  We'll only discuss
425merge_lo here; merge_hi is exactly analogous.
426
427Merging begins in the usual, obvious way, comparing the first element of A
428to the first of B, and moving B[0] to the merge area if it's less than A[0],
429else moving A[0] to the merge area.  Call that the "one pair at a time"
430mode.  The only twist here is keeping track of how many times in a row "the
431winner" comes from the same run.
432
433If that count reaches MIN_GALLOP, we switch to "galloping mode".  Here
434we *search* B for where A[0] belongs, and move over all the B's before
435that point in one chunk to the merge area, then move A[0] to the merge
436area.  Then we search A for where B[0] belongs, and similarly move a
437slice of A in one chunk.  Then back to searching B for where A[0] belongs,
438etc.  We stay in galloping mode until both searches find slices to copy
439less than MIN_GALLOP elements long, at which point we go back to one-pair-
440at-a-time mode.
441
442A refinement:  The MergeState struct contains the value of min_gallop that
443controls when we enter galloping mode, initialized to MIN_GALLOP.
444merge_lo() and merge_hi() adjust this higher when galloping isn't paying
445off, and lower when it is.
446
447
448Galloping
449---------
450Still without loss of generality, assume A is the shorter run.  In galloping
451mode, we first look for A[0] in B.  We do this via "galloping", comparing
452A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
453the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1].  This takes at most
454roughly lg(B) comparisons, and, unlike a straight binary search, favors
455finding the right spot early in B (more on that later).
456
457After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
458consecutive elements, and a straight binary search requires exactly k-1
459additional comparisons to nail it.  Then we copy all the B's up to that
460point in one chunk, and then copy A[0].  Note that no matter where A[0]
461belongs in B, the combination of galloping + binary search finds it in no
462more than about 2*lg(B) comparisons.
463
464If we did a straight binary search, we could find it in no more than
465ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
466comparisons no matter where A[0] belongs.  Straight binary search thus loses
467to galloping unless the run is quite long, and we simply can't guess
468whether it is in advance.
469
470If data is random and runs have the same length, A[0] belongs at B[0] half
471the time, at B[1] a quarter of the time, and so on:  a consecutive winning
472sub-run in B of length k occurs with probability 1/2**(k+1).  So long
473winning sub-runs are extremely unlikely in random data, and guessing that a
474winning sub-run is going to be long is a dangerous game.
475
476OTOH, if data is lopsided or lumpy or contains many duplicates, long
477stretches of winning sub-runs are very likely, and cutting the number of
478comparisons needed to find one from O(B) to O(log B) is a huge win.
479
480Galloping compromises by getting out fast if there isn't a long winning
481sub-run, yet finding such very efficiently when they exist.
482
483I first learned about the galloping strategy in a related context; see:
484
485    "Adaptive Set Intersections, Unions, and Differences" (2000)
486    Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
487
488and its followup(s).  An earlier paper called the same strategy
489"exponential search":
490
491   "Optimistic Sorting and Information Theoretic Complexity"
492   Peter McIlroy
493   SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
494   467-474, Austin, Texas, 25-27 January 1993.
495
496and it probably dates back to an earlier paper by Bentley and Yao.  The
497McIlroy paper in particular has good analysis of a mergesort that's
498probably strongly related to this one in its galloping strategy.
499
500
501Galloping with a Broken Leg
502---------------------------
503So why don't we always gallop?  Because it can lose, on two counts:
504
5051. While we're willing to endure small per-merge overheads, per-comparison
506   overheads are a different story.  Calling Yet Another Function per
507   comparison is expensive, and gallop_left() and gallop_right() are
508   too long-winded for sane inlining.
509
5102. Galloping can-- alas --require more comparisons than linear one-at-time
511   search, depending on the data.
512
513#2 requires details.  If A[0] belongs before B[0], galloping requires 1
514compare to determine that, same as linear search, except it costs more
515to call the gallop function.  If A[0] belongs right before B[1], galloping
516requires 2 compares, again same as linear search.  On the third compare,
517galloping checks A[0] against B[3], and if it's <=, requires one more
518compare to determine whether A[0] belongs at B[2] or B[3].  That's a total
519of 4 compares, but if A[0] does belong at B[2], linear search would have
520discovered that in only 3 compares, and that's a huge loss!  Really.  It's
521an increase of 33% in the number of compares needed, and comparisons are
522expensive in Python.
523
524index in B where    # compares linear  # gallop  # binary  gallop
525A[0] belongs        search needs       compares  compares  total
526----------------    -----------------  --------  --------  ------
527               0                    1         1         0       1
528
529               1                    2         2         0       2
530
531               2                    3         3         1       4
532               3                    4         3         1       4
533
534               4                    5         4         2       6
535               5                    6         4         2       6
536               6                    7         4         2       6
537               7                    8         4         2       6
538
539               8                    9         5         3       8
540               9                   10         5         3       8
541              10                   11         5         3       8
542              11                   12         5         3       8
543                                        ...
544
545In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
546to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
547The advantage of galloping is unbounded as i grows, but it doesn't win at
548all until i=6.  Before then, it loses twice (at i=2 and i=4), and ties
549at the other values.  At and after i=6, galloping always wins.
550
551We can't guess in advance when it's going to win, though, so we do one pair
552at a time until the evidence seems strong that galloping may pay.  MIN_GALLOP
553is 7, and that's pretty strong evidence.  However, if the data is random, it
554simply will trigger galloping mode purely by luck every now and again, and
555it's quite likely to hit one of the losing cases next.  On the other hand,
556in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
557"should be" then.  So the MergeState struct keeps a min_gallop variable
558that merge_lo and merge_hi adjust:  the longer we stay in galloping mode,
559the smaller min_gallop gets, making it easier to transition back to
560galloping mode (if we ever leave it in the current merge, and at the
561start of the next merge).  But whenever the gallop loop doesn't pay,
562min_gallop is increased by one, making it harder to transition back
563to galloping mode (and again both within a merge and across merges).  For
564random data, this all but eliminates the gallop penalty:  min_gallop grows
565large enough that we almost never get into galloping mode.  And for cases
566like ~sort, min_gallop can fall to as low as 1.  This seems to work well,
567but in all it's a minor improvement over using a fixed MIN_GALLOP value.
568
569
570Galloping Complication
571----------------------
572The description above was for merge_lo.  merge_hi has to merge "from the
573other end", and really needs to gallop starting at the last element in a run
574instead of the first.  Galloping from the first still works, but does more
575comparisons than it should (this is significant -- I timed it both ways).
576For this reason, the gallop_left() and gallop_right() functions have a
577"hint" argument, which is the index at which galloping should begin.  So
578galloping can actually start at any index, and proceed at offsets of 1, 3,
5797, 15, ... or -1, -3, -7, -15, ... from the starting index.
580
581In the code as I type it's always called with either 0 or n-1 (where n is
582the # of elements in a run).  It's tempting to try to do something fancier,
583melding galloping with some form of interpolation search; for example, if
584we're merging a run of length 1 with a run of length 10000, index 5000 is
585probably a better guess at the final result than either 0 or 9999.  But
586it's unclear how to generalize that intuition usefully, and merging of
587wildly unbalanced runs already enjoys excellent performance.
588
589~sort is a good example of when balanced runs could benefit from a better
590hint value:  to the extent possible, this would like to use a starting
591offset equal to the previous value of acount/bcount.  Doing so saves about
59210% of the compares in ~sort.  However, doing so is also a mixed bag,
593hurting other cases.
594
595
596Comparing Average # of Compares on Random Arrays
597------------------------------------------------
598[NOTE:  This was done when the new algorithm used about 0.1% more compares
599 on random data than does its current incarnation.]
600
601Here list.sort() is samplesort, and list.msort() this sort:
602
603"""
604import random
605from time import clock as now
606
607def fill(n):
608    from random import random
609    return [random() for i in xrange(n)]
610
611def mycmp(x, y):
612    global ncmp
613    ncmp += 1
614    return cmp(x, y)
615
616def timeit(values, method):
617    global ncmp
618    X = values[:]
619    bound = getattr(X, method)
620    ncmp = 0
621    t1 = now()
622    bound(mycmp)
623    t2 = now()
624    return t2-t1, ncmp
625
626format = "%5s  %9.2f  %11d"
627f2     = "%5s  %9.2f  %11.2f"
628
629def drive():
630    count = sst = sscmp = mst = mscmp = nelts = 0
631    while True:
632        n = random.randrange(100000)
633        nelts += n
634        x = fill(n)
635
636        t, c = timeit(x, 'sort')
637        sst += t
638        sscmp += c
639
640        t, c = timeit(x, 'msort')
641        mst += t
642        mscmp += c
643
644        count += 1
645        if count % 10:
646            continue
647
648        print "count", count, "nelts", nelts
649        print format % ("sort",  sst, sscmp)
650        print format % ("msort", mst, mscmp)
651        print f2     % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
652
653drive()
654"""
655
656I ran this on Windows and kept using the computer lightly while it was
657running.  time.clock() is wall-clock time on Windows, with better than
658microsecond resolution.  samplesort started with a 1.52% #-of-comparisons
659disadvantage, fell quickly to 1.48%, and then fluctuated within that small
660range.  Here's the last chunk of output before I killed the job:
661
662count 2630 nelts 130906543
663 sort    6110.80   1937887573
664msort    6002.78   1909389381
665            1.80         1.49
666
667We've done nearly 2 billion comparisons apiece at Python speed there, and
668that's enough <wink>.
669
670For random arrays of size 2 (yes, there are only 2 interesting ones),
671samplesort has a 50%(!) comparison disadvantage.  This is a consequence of
672samplesort special-casing at most one ascending run at the start, then
673falling back to the general case if it doesn't find an ascending run
674immediately.  The consequence is that it ends up using two compares to sort
675[2, 1].  Gratifyingly, timsort doesn't do any special-casing, so had to be
676taught how to deal with mixtures of ascending and descending runs
677efficiently in all cases.