Plain Text | 677 lines | 542 code | 135 blank | 0 comment | 0 complexity | 8e7eb8fcf55485d0da6119cfe5bf89b0 MD5 | raw file
1Intro 2----- 3This describes an adaptive, stable, natural mergesort, modestly called 4timsort (hey, I earned it <wink>). It has supernatural performance on many 5kinds of partially ordered arrays (less than lg(N!) comparisons needed, and 6as few as N-1), yet as fast as Python's previous highly tuned samplesort 7hybrid on random arrays. 8 9In a nutshell, the main routine marches over the array once, left to right, 10alternately identifying the next run, then merging it into the previous 11runs "intelligently". Everything else is complication for speed, and some 12hard-won measure of memory efficiency. 13 14 15Comparison with Python's Samplesort Hybrid 16------------------------------------------ 17+ timsort can require a temp array containing as many as N//2 pointers, 18 which means as many as 2*N extra bytes on 32-bit boxes. It can be 19 expected to require a temp array this large when sorting random data; on 20 data with significant structure, it may get away without using any extra 21 heap memory. This appears to be the strongest argument against it, but 22 compared to the size of an object, 2 temp bytes worst-case (also expected- 23 case for random data) doesn't scare me much. 24 25 It turns out that Perl is moving to a stable mergesort, and the code for 26 that appears always to require a temp array with room for at least N 27 pointers. (Note that I wouldn't want to do that even if space weren't an 28 issue; I believe its efforts at memory frugality also save timsort 29 significant pointer-copying costs, and allow it to have a smaller working 30 set.) 31 32+ Across about four hours of generating random arrays, and sorting them 33 under both methods, samplesort required about 1.5% more comparisons 34 (the program is at the end of this file). 35 36+ In real life, this may be faster or slower on random arrays than 37 samplesort was, depending on platform quirks. Since it does fewer 38 comparisons on average, it can be expected to do better the more 39 expensive a comparison function is. OTOH, it does more data movement 40 (pointer copying) than samplesort, and that may negate its small 41 comparison advantage (depending on platform quirks) unless comparison 42 is very expensive. 43 44+ On arrays with many kinds of pre-existing order, this blows samplesort out 45 of the water. It's significantly faster than samplesort even on some 46 cases samplesort was special-casing the snot out of. I believe that lists 47 very often do have exploitable partial order in real life, and this is the 48 strongest argument in favor of timsort (indeed, samplesort's special cases 49 for extreme partial order are appreciated by real users, and timsort goes 50 much deeper than those, in particular naturally covering every case where 51 someone has suggested "and it would be cool if list.sort() had a special 52 case for this too ... and for that ..."). 53 54+ Here are exact comparison counts across all the tests in sortperf.py, 55 when run with arguments "15 20 1". 56 57 Column Key: 58 *sort: random data 59 \sort: descending data 60 /sort: ascending data 61 3sort: ascending, then 3 random exchanges 62 +sort: ascending, then 10 random at the end 63 ~sort: many duplicates 64 =sort: all equal 65 !sort: worst case scenario 66 67 First the trivial cases, trivial for samplesort because it special-cased 68 them, and trivial for timsort because it naturally works on runs. Within 69 an "n" block, the first line gives the # of compares done by samplesort, 70 the second line by timsort, and the third line is the percentage by 71 which the samplesort count exceeds the timsort count: 72 73 n \sort /sort =sort 74------- ------ ------ ------ 75 32768 32768 32767 32767 samplesort 76 32767 32767 32767 timsort 77 0.00% 0.00% 0.00% (samplesort - timsort) / timsort 78 79 65536 65536 65535 65535 80 65535 65535 65535 81 0.00% 0.00% 0.00% 82 83 131072 131072 131071 131071 84 131071 131071 131071 85 0.00% 0.00% 0.00% 86 87 262144 262144 262143 262143 88 262143 262143 262143 89 0.00% 0.00% 0.00% 90 91 524288 524288 524287 524287 92 524287 524287 524287 93 0.00% 0.00% 0.00% 94 951048576 1048576 1048575 1048575 96 1048575 1048575 1048575 97 0.00% 0.00% 0.00% 98 99 The algorithms are effectively identical in these cases, except that 100 timsort does one less compare in \sort. 101 102 Now for the more interesting cases. lg(n!) is the information-theoretic 103 limit for the best any comparison-based sorting algorithm can do on 104 average (across all permutations). When a method gets significantly 105 below that, it's either astronomically lucky, or is finding exploitable 106 structure in the data. 107 108 n lg(n!) *sort 3sort +sort %sort ~sort !sort 109------- ------- ------ ------- ------- ------ ------- -------- 110 32768 444255 453096 453614 32908 452871 130491 469141 old 111 448885 33016 33007 50426 182083 65534 new 112 0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new 113 114 65536 954037 972699 981940 65686 973104 260029 1004607 115 962991 65821 65808 101667 364341 131070 116 1.01% 1391.83% -0.19% 857.15% -28.63% 666.47% 117 118 131072 2039137 2101881 2091491 131232 2092894 554790 2161379 119 2057533 131410 131361 206193 728871 262142 120 2.16% 1491.58% -0.10% 915.02% -23.88% 724.51% 121 122 262144 4340409 4464460 4403233 262314 4445884 1107842 4584560 123 4377402 262437 262459 416347 1457945 524286 124 1.99% 1577.82% -0.06% 967.83% -24.01% 774.44% 125 126 524288 9205096 9453356 9408463 524468 9441930 2218577 9692015 127 9278734 524580 524633 837947 2916107 1048574 128 1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30% 129 1301048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212 131 19606028 1048958 1048941 1694896 5832445 2097150 132 1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38% 133 134 Discussion of cases: 135 136 *sort: There's no structure in random data to exploit, so the theoretical 137 limit is lg(n!). Both methods get close to that, and timsort is hugging 138 it (indeed, in a *marginal* sense, it's a spectacular improvement -- 139 there's only about 1% left before hitting the wall, and timsort knows 140 darned well it's doing compares that won't pay on random data -- but so 141 does the samplesort hybrid). For contrast, Hoare's original random-pivot 142 quicksort does about 39% more compares than the limit, and the median-of-3 143 variant about 19% more. 144 145 3sort, %sort, and !sort: No contest; there's structure in this data, but 146 not of the specific kinds samplesort special-cases. Note that structure 147 in !sort wasn't put there on purpose -- it was crafted as a worst case for 148 a previous quicksort implementation. That timsort nails it came as a 149 surprise to me (although it's obvious in retrospect). 150 151 +sort: samplesort special-cases this data, and does a few less compares 152 than timsort. However, timsort runs this case significantly faster on all 153 boxes we have timings for, because timsort is in the business of merging 154 runs efficiently, while samplesort does much more data movement in this 155 (for it) special case. 156 157 ~sort: samplesort's special cases for large masses of equal elements are 158 extremely effective on ~sort's specific data pattern, and timsort just 159 isn't going to get close to that, despite that it's clearly getting a 160 great deal of benefit out of the duplicates (the # of compares is much less 161 than lg(n!)). ~sort has a perfectly uniform distribution of just 4 162 distinct values, and as the distribution gets more skewed, samplesort's 163 equal-element gimmicks become less effective, while timsort's adaptive 164 strategies find more to exploit; in a database supplied by Kevin Altis, a 165 sort on its highly skewed "on which stock exchange does this company's 166 stock trade?" field ran over twice as fast under timsort. 167 168 However, despite that timsort does many more comparisons on ~sort, and 169 that on several platforms ~sort runs highly significantly slower under 170 timsort, on other platforms ~sort runs highly significantly faster under 171 timsort. No other kind of data has shown this wild x-platform behavior, 172 and we don't have an explanation for it. The only thing I can think of 173 that could transform what "should be" highly significant slowdowns into 174 highly significant speedups on some boxes are catastrophic cache effects 175 in samplesort. 176 177 But timsort "should be" slower than samplesort on ~sort, so it's hard 178 to count that it isn't on some boxes as a strike against it <wink>. 179 180+ Here's the highwater mark for the number of heap-based temp slots (4 181 bytes each on this box) needed by each test, again with arguments 182 "15 20 1": 183 184 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort 185 32768 16384 0 0 6256 0 10821 12288 0 16383 186 65536 32766 0 0 21652 0 31276 24576 0 32767 187 131072 65534 0 0 17258 0 58112 49152 0 65535 188 262144 131072 0 0 35660 0 123561 98304 0 131071 189 524288 262142 0 0 31302 0 212057 196608 0 262143 1901048576 524286 0 0 312438 0 484942 393216 0 524287 191 192 Discussion: The tests that end up doing (close to) perfectly balanced 193 merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort 194 also ends up doing balanced merges, but systematically benefits a lot from 195 the preliminary pre-merge searches described under "Merge Memory" later. 196 %sort approaches having a balanced merge at the end because the random 197 selection of elements to replace is expected to produce an out-of-order 198 element near the midpoint. \sort, /sort, =sort are the trivial one-run 199 cases, needing no merging at all. +sort ends up having one very long run 200 and one very short, and so gets all the temp space it needs from the small 201 temparray member of the MergeState struct (note that the same would be 202 true if the new random elements were prefixed to the sorted list instead, 203 but not if they appeared "in the middle"). 3sort approaches N//3 temp 204 slots twice, but the run lengths that remain after 3 random exchanges 205 clearly has very high variance. 206 207 208A detailed description of timsort follows. 209 210Runs 211---- 212count_run() returns the # of elements in the next run. A run is either 213"ascending", which means non-decreasing: 214 215 a0 <= a1 <= a2 <= ... 216 217or "descending", which means strictly decreasing: 218 219 a0 > a1 > a2 > ... 220 221Note that a run is always at least 2 long, unless we start at the array's 222last element. 223 224The definition of descending is strict, because the main routine reverses 225a descending run in-place, transforming a descending run into an ascending 226run. Reversal is done via the obvious fast "swap elements starting at each 227end, and converge at the middle" method, and that can violate stability if 228the slice contains any equal elements. Using a strict definition of 229descending ensures that a descending run contains distinct elements. 230 231If an array is random, it's very unlikely we'll see long runs. If a natural 232run contains less than minrun elements (see next section), the main loop 233artificially boosts it to minrun elements, via a stable binary insertion sort 234applied to the right number of array elements following the short natural 235run. In a random array, *all* runs are likely to be minrun long as a 236result. This has two primary good effects: 237 2381. Random data strongly tends then toward perfectly balanced (both runs have 239 the same length) merges, which is the most efficient way to proceed when 240 data is random. 241 2422. Because runs are never very short, the rest of the code doesn't make 243 heroic efforts to shave a few cycles off per-merge overheads. For 244 example, reasonable use of function calls is made, rather than trying to 245 inline everything. Since there are no more than N/minrun runs to begin 246 with, a few "extra" function calls per merge is barely measurable. 247 248 249Computing minrun 250---------------- 251If N < 64, minrun is N. IOW, binary insertion sort is used for the whole 252array then; it's hard to beat that given the overheads of trying something 253fancier. 254 255When N is a power of 2, testing on random data showed that minrun values of 25616, 32, 64 and 128 worked about equally well. At 256 the data-movement cost 257in binary insertion sort clearly hurt, and at 8 the increase in the number 258of function calls clearly hurt. Picking *some* power of 2 is important 259here, so that the merges end up perfectly balanced (see next section). We 260pick 32 as a good value in the sweet range; picking a value at the low end 261allows the adaptive gimmicks more opportunity to exploit shorter natural 262runs. 263 264Because sortperf.py only tries powers of 2, it took a long time to notice 265that 32 isn't a good choice for the general case! Consider N=2112: 266 267>>> divmod(2112, 32) 268(66, 0) 269>>> 270 271If the data is randomly ordered, we're very likely to end up with 66 runs 272each of length 32. The first 64 of these trigger a sequence of perfectly 273balanced merges (see next section), leaving runs of lengths 2048 and 64 to 274merge at the end. The adaptive gimmicks can do that with fewer than 2048+64 275compares, but it's still more compares than necessary, and-- mergesort's 276bugaboo relative to samplesort --a lot more data movement (O(N) copies just 277to get 64 elements into place). 278 279If we take minrun=33 in this case, then we're very likely to end up with 64 280runs each of length 33, and then all merges are perfectly balanced. Better! 281 282What we want to avoid is picking minrun such that in 283 284 q, r = divmod(N, minrun) 285 286q is a power of 2 and r>0 (then the last merge only gets r elements into 287place, and r < minrun is small compared to N), or q a little larger than a 288power of 2 regardless of r (then we've got a case similar to "2112", again 289leaving too little work for the last merge to do). 290 291Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a 292power of 2, or if that isn't possible, is close to, but strictly less than, 293a power of 2. This is easier to do than it may sound: take the first 6 294bits of N, and add 1 if any of the remaining bits are set. In fact, that 295rule covers every case in this section, including small N and exact powers 296of 2; merge_compute_minrun() is a deceptively simple function. 297 298 299The Merge Pattern 300----------------- 301In order to exploit regularities in the data, we're merging on natural 302run lengths, and they can become wildly unbalanced. That's a Good Thing 303for this sort! It means we have to find a way to manage an assortment of 304potentially very different run lengths, though. 305 306Stability constrains permissible merging patterns. For example, if we have 3073 consecutive runs of lengths 308 309 A:10000 B:20000 C:10000 310 311we dare not merge A with C first, because if A, B and C happen to contain 312a common element, it would get out of order wrt its occurrence(s) in B. The 313merging must be done as (A+B)+C or A+(B+C) instead. 314 315So merging is always done on two consecutive runs at a time, and in-place, 316although this may require some temp memory (more on that later). 317 318When a run is identified, its base address and length are pushed on a stack 319in the MergeState struct. merge_collapse() is then called to see whether it 320should merge it with preceding run(s). We would like to delay merging as 321long as possible in order to exploit patterns that may come up later, but we 322like even more to do merging as soon as possible to exploit that the run just 323found is still high in the memory hierarchy. We also can't delay merging 324"too long" because it consumes memory to remember the runs that are still 325unmerged, and the stack has a fixed size. 326 327What turned out to be a good compromise maintains two invariants on the 328stack entries, where A, B and C are the lengths of the three righmost not-yet 329merged slices: 330 3311. A > B+C 3322. B > C 333 334Note that, by induction, #2 implies the lengths of pending runs form a 335decreasing sequence. #1 implies that, reading the lengths right to left, 336the pending-run lengths grow at least as fast as the Fibonacci numbers. 337Therefore the stack can never grow larger than about log_base_phi(N) entries, 338where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice 339for very large arrays. 340 341If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the 342freshness-in-cache reason), and the new run replaces the A,B or B,C entries; 343e.g., if the last 3 entries are 344 345 A:30 B:20 C:10 346 347then B is merged with C, leaving 348 349 A:30 BC:30 350 351on the stack. Or if they were 352 353 A:500 B:400: C:1000 354 355then A is merged with B, leaving 356 357 AB:900 C:1000 358 359on the stack. 360 361In both examples, the stack configuration after the merge still violates 362invariant #2, and merge_collapse() goes on to continue merging runs until 363both invariants are satisfied. As an extreme case, suppose we didn't do the 364minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2, 365and 2. Nothing would get merged until the final 2 was seen, and that would 366trigger 7 perfectly balanced merges. 367 368The thrust of these rules when they trigger merging is to balance the run 369lengths as closely as possible, while keeping a low bound on the number of 370runs we have to remember. This is maximally effective for random data, 371where all runs are likely to be of (artificially forced) length minrun, and 372then we get a sequence of perfectly balanced merges (with, perhaps, some 373oddballs at the end). 374 375OTOH, one reason this sort is so good for partly ordered data has to do 376with wildly unbalanced run lengths. 377 378 379Merge Memory 380------------ 381Merging adjacent runs of lengths A and B in-place is very difficult. 382Theoretical constructions are known that can do it, but they're too difficult 383and slow for practical use. But if we have temp memory equal to min(A, B), 384it's easy. 385 386If A is smaller (function merge_lo), copy A to a temp array, leave B alone, 387and then we can do the obvious merge algorithm left to right, from the temp 388area and B, starting the stores into where A used to live. There's always a 389free area in the original area comprising a number of elements equal to the 390number not yet merged from the temp array (trivially true at the start; 391proceed by induction). The only tricky bit is that if a comparison raises an 392exception, we have to remember to copy the remaining elements back in from 393the temp area, lest the array end up with duplicate entries from B. But 394that's exactly the same thing we need to do if we reach the end of B first, 395so the exit code is pleasantly common to both the normal and error cases. 396 397If B is smaller (function merge_hi, which is merge_lo's "mirror image"), 398much the same, except that we need to merge right to left, copying B into a 399temp array and starting the stores at the right end of where B used to live. 400 401A refinement: When we're about to merge adjacent runs A and B, we first do 402a form of binary search (more on that later) to see where B should end up 403in A. Elements in A preceding that point are already in their final 404positions, effectively shrinking the size of A. Likewise we also search to 405see where A[-1] should end up in B, and elements of B after that point can 406also be ignored. This cuts the amount of temp memory needed by the same 407amount. 408 409These preliminary searches may not pay off, and can be expected *not* to 410repay their cost if the data is random. But they can win huge in all of 411time, copying, and memory savings when they do pay, so this is one of the 412"per-merge overheads" mentioned above that we're happy to endure because 413there is at most one very short run. It's generally true in this algorithm 414that we're willing to gamble a little to win a lot, even though the net 415expectation is negative for random data. 416 417 418Merge Algorithms 419---------------- 420merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo 421deals with runs where A <= B, and merge_hi where A > B. They don't know 422whether the data is clustered or uniform, but a lovely thing about merging 423is that many kinds of clustering "reveal themselves" by how many times in a 424row the winning merge element comes from the same run. We'll only discuss 425merge_lo here; merge_hi is exactly analogous. 426 427Merging begins in the usual, obvious way, comparing the first element of A 428to the first of B, and moving B to the merge area if it's less than A, 429else moving A to the merge area. Call that the "one pair at a time" 430mode. The only twist here is keeping track of how many times in a row "the 431winner" comes from the same run. 432 433If that count reaches MIN_GALLOP, we switch to "galloping mode". Here 434we *search* B for where A belongs, and move over all the B's before 435that point in one chunk to the merge area, then move A to the merge 436area. Then we search A for where B belongs, and similarly move a 437slice of A in one chunk. Then back to searching B for where A belongs, 438etc. We stay in galloping mode until both searches find slices to copy 439less than MIN_GALLOP elements long, at which point we go back to one-pair- 440at-a-time mode. 441 442A refinement: The MergeState struct contains the value of min_gallop that 443controls when we enter galloping mode, initialized to MIN_GALLOP. 444merge_lo() and merge_hi() adjust this higher when galloping isn't paying 445off, and lower when it is. 446 447 448Galloping 449--------- 450Still without loss of generality, assume A is the shorter run. In galloping 451mode, we first look for A in B. We do this via "galloping", comparing 452A in turn to B, B, B, B, ..., B[2**j - 1], ..., until finding 453the k such that B[2**(k-1) - 1] < A <= B[2**k - 1]. This takes at most 454roughly lg(B) comparisons, and, unlike a straight binary search, favors 455finding the right spot early in B (more on that later). 456 457After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1 458consecutive elements, and a straight binary search requires exactly k-1 459additional comparisons to nail it. Then we copy all the B's up to that 460point in one chunk, and then copy A. Note that no matter where A 461belongs in B, the combination of galloping + binary search finds it in no 462more than about 2*lg(B) comparisons. 463 464If we did a straight binary search, we could find it in no more than 465ceiling(lg(B+1)) comparisons -- but straight binary search takes that many 466comparisons no matter where A belongs. Straight binary search thus loses 467to galloping unless the run is quite long, and we simply can't guess 468whether it is in advance. 469 470If data is random and runs have the same length, A belongs at B half 471the time, at B a quarter of the time, and so on: a consecutive winning 472sub-run in B of length k occurs with probability 1/2**(k+1). So long 473winning sub-runs are extremely unlikely in random data, and guessing that a 474winning sub-run is going to be long is a dangerous game. 475 476OTOH, if data is lopsided or lumpy or contains many duplicates, long 477stretches of winning sub-runs are very likely, and cutting the number of 478comparisons needed to find one from O(B) to O(log B) is a huge win. 479 480Galloping compromises by getting out fast if there isn't a long winning 481sub-run, yet finding such very efficiently when they exist. 482 483I first learned about the galloping strategy in a related context; see: 484 485 "Adaptive Set Intersections, Unions, and Differences" (2000) 486 Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro 487 488and its followup(s). An earlier paper called the same strategy 489"exponential search": 490 491 "Optimistic Sorting and Information Theoretic Complexity" 492 Peter McIlroy 493 SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp 494 467-474, Austin, Texas, 25-27 January 1993. 495 496and it probably dates back to an earlier paper by Bentley and Yao. The 497McIlroy paper in particular has good analysis of a mergesort that's 498probably strongly related to this one in its galloping strategy. 499 500 501Galloping with a Broken Leg 502--------------------------- 503So why don't we always gallop? Because it can lose, on two counts: 504 5051. While we're willing to endure small per-merge overheads, per-comparison 506 overheads are a different story. Calling Yet Another Function per 507 comparison is expensive, and gallop_left() and gallop_right() are 508 too long-winded for sane inlining. 509 5102. Galloping can-- alas --require more comparisons than linear one-at-time 511 search, depending on the data. 512 513#2 requires details. If A belongs before B, galloping requires 1 514compare to determine that, same as linear search, except it costs more 515to call the gallop function. If A belongs right before B, galloping 516requires 2 compares, again same as linear search. On the third compare, 517galloping checks A against B, and if it's <=, requires one more 518compare to determine whether A belongs at B or B. That's a total 519of 4 compares, but if A does belong at B, linear search would have 520discovered that in only 3 compares, and that's a huge loss! Really. It's 521an increase of 33% in the number of compares needed, and comparisons are 522expensive in Python. 523 524index in B where # compares linear # gallop # binary gallop 525A belongs search needs compares compares total 526---------------- ----------------- -------- -------- ------ 527 0 1 1 0 1 528 529 1 2 2 0 2 530 531 2 3 3 1 4 532 3 4 3 1 4 533 534 4 5 4 2 6 535 5 6 4 2 6 536 6 7 4 2 6 537 7 8 4 2 6 538 539 8 9 5 3 8 540 9 10 5 3 8 541 10 11 5 3 8 542 11 12 5 3 8 543 ... 544 545In general, if A belongs at B[i], linear search requires i+1 comparisons 546to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons. 547The advantage of galloping is unbounded as i grows, but it doesn't win at 548all until i=6. Before then, it loses twice (at i=2 and i=4), and ties 549at the other values. At and after i=6, galloping always wins. 550 551We can't guess in advance when it's going to win, though, so we do one pair 552at a time until the evidence seems strong that galloping may pay. MIN_GALLOP 553is 7, and that's pretty strong evidence. However, if the data is random, it 554simply will trigger galloping mode purely by luck every now and again, and 555it's quite likely to hit one of the losing cases next. On the other hand, 556in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it 557"should be" then. So the MergeState struct keeps a min_gallop variable 558that merge_lo and merge_hi adjust: the longer we stay in galloping mode, 559the smaller min_gallop gets, making it easier to transition back to 560galloping mode (if we ever leave it in the current merge, and at the 561start of the next merge). But whenever the gallop loop doesn't pay, 562min_gallop is increased by one, making it harder to transition back 563to galloping mode (and again both within a merge and across merges). For 564random data, this all but eliminates the gallop penalty: min_gallop grows 565large enough that we almost never get into galloping mode. And for cases 566like ~sort, min_gallop can fall to as low as 1. This seems to work well, 567but in all it's a minor improvement over using a fixed MIN_GALLOP value. 568 569 570Galloping Complication 571---------------------- 572The description above was for merge_lo. merge_hi has to merge "from the 573other end", and really needs to gallop starting at the last element in a run 574instead of the first. Galloping from the first still works, but does more 575comparisons than it should (this is significant -- I timed it both ways). 576For this reason, the gallop_left() and gallop_right() functions have a 577"hint" argument, which is the index at which galloping should begin. So 578galloping can actually start at any index, and proceed at offsets of 1, 3, 5797, 15, ... or -1, -3, -7, -15, ... from the starting index. 580 581In the code as I type it's always called with either 0 or n-1 (where n is 582the # of elements in a run). It's tempting to try to do something fancier, 583melding galloping with some form of interpolation search; for example, if 584we're merging a run of length 1 with a run of length 10000, index 5000 is 585probably a better guess at the final result than either 0 or 9999. But 586it's unclear how to generalize that intuition usefully, and merging of 587wildly unbalanced runs already enjoys excellent performance. 588 589~sort is a good example of when balanced runs could benefit from a better 590hint value: to the extent possible, this would like to use a starting 591offset equal to the previous value of acount/bcount. Doing so saves about 59210% of the compares in ~sort. However, doing so is also a mixed bag, 593hurting other cases. 594 595 596Comparing Average # of Compares on Random Arrays 597------------------------------------------------ 598[NOTE: This was done when the new algorithm used about 0.1% more compares 599 on random data than does its current incarnation.] 600 601Here list.sort() is samplesort, and list.msort() this sort: 602 603""" 604import random 605from time import clock as now 606 607def fill(n): 608 from random import random 609 return [random() for i in xrange(n)] 610 611def mycmp(x, y): 612 global ncmp 613 ncmp += 1 614 return cmp(x, y) 615 616def timeit(values, method): 617 global ncmp 618 X = values[:] 619 bound = getattr(X, method) 620 ncmp = 0 621 t1 = now() 622 bound(mycmp) 623 t2 = now() 624 return t2-t1, ncmp 625 626format = "%5s %9.2f %11d" 627f2 = "%5s %9.2f %11.2f" 628 629def drive(): 630 count = sst = sscmp = mst = mscmp = nelts = 0 631 while True: 632 n = random.randrange(100000) 633 nelts += n 634 x = fill(n) 635 636 t, c = timeit(x, 'sort') 637 sst += t 638 sscmp += c 639 640 t, c = timeit(x, 'msort') 641 mst += t 642 mscmp += c 643 644 count += 1 645 if count % 10: 646 continue 647 648 print "count", count, "nelts", nelts 649 print format % ("sort", sst, sscmp) 650 print format % ("msort", mst, mscmp) 651 print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp) 652 653drive() 654""" 655 656I ran this on Windows and kept using the computer lightly while it was 657running. time.clock() is wall-clock time on Windows, with better than 658microsecond resolution. samplesort started with a 1.52% #-of-comparisons 659disadvantage, fell quickly to 1.48%, and then fluctuated within that small 660range. Here's the last chunk of output before I killed the job: 661 662count 2630 nelts 130906543 663 sort 6110.80 1937887573 664msort 6002.78 1909389381 665 1.80 1.49 666 667We've done nearly 2 billion comparisons apiece at Python speed there, and 668that's enough <wink>. 669 670For random arrays of size 2 (yes, there are only 2 interesting ones), 671samplesort has a 50%(!) comparison disadvantage. This is a consequence of 672samplesort special-casing at most one ascending run at the start, then 673falling back to the general case if it doesn't find an ascending run 674immediately. The consequence is that it ends up using two compares to sort 675[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be 676taught how to deal with mixtures of ascending and descending runs 677efficiently in all cases.