/Objects/listsort.txt

http://unladen-swallow.googlecode.com/ · Plain Text · 677 lines · 542 code · 135 blank · 0 comment · 0 complexity · 8e7eb8fcf55485d0da6119cfe5bf89b0 MD5 · raw file

  1. Intro
  2. -----
  3. This describes an adaptive, stable, natural mergesort, modestly called
  4. timsort (hey, I earned it <wink>). It has supernatural performance on many
  5. kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
  6. as few as N-1), yet as fast as Python's previous highly tuned samplesort
  7. hybrid on random arrays.
  8. In a nutshell, the main routine marches over the array once, left to right,
  9. alternately identifying the next run, then merging it into the previous
  10. runs "intelligently". Everything else is complication for speed, and some
  11. hard-won measure of memory efficiency.
  12. Comparison with Python's Samplesort Hybrid
  13. ------------------------------------------
  14. + timsort can require a temp array containing as many as N//2 pointers,
  15. which means as many as 2*N extra bytes on 32-bit boxes. It can be
  16. expected to require a temp array this large when sorting random data; on
  17. data with significant structure, it may get away without using any extra
  18. heap memory. This appears to be the strongest argument against it, but
  19. compared to the size of an object, 2 temp bytes worst-case (also expected-
  20. case for random data) doesn't scare me much.
  21. It turns out that Perl is moving to a stable mergesort, and the code for
  22. that appears always to require a temp array with room for at least N
  23. pointers. (Note that I wouldn't want to do that even if space weren't an
  24. issue; I believe its efforts at memory frugality also save timsort
  25. significant pointer-copying costs, and allow it to have a smaller working
  26. set.)
  27. + Across about four hours of generating random arrays, and sorting them
  28. under both methods, samplesort required about 1.5% more comparisons
  29. (the program is at the end of this file).
  30. + In real life, this may be faster or slower on random arrays than
  31. samplesort was, depending on platform quirks. Since it does fewer
  32. comparisons on average, it can be expected to do better the more
  33. expensive a comparison function is. OTOH, it does more data movement
  34. (pointer copying) than samplesort, and that may negate its small
  35. comparison advantage (depending on platform quirks) unless comparison
  36. is very expensive.
  37. + On arrays with many kinds of pre-existing order, this blows samplesort out
  38. of the water. It's significantly faster than samplesort even on some
  39. cases samplesort was special-casing the snot out of. I believe that lists
  40. very often do have exploitable partial order in real life, and this is the
  41. strongest argument in favor of timsort (indeed, samplesort's special cases
  42. for extreme partial order are appreciated by real users, and timsort goes
  43. much deeper than those, in particular naturally covering every case where
  44. someone has suggested "and it would be cool if list.sort() had a special
  45. case for this too ... and for that ...").
  46. + Here are exact comparison counts across all the tests in sortperf.py,
  47. when run with arguments "15 20 1".
  48. Column Key:
  49. *sort: random data
  50. \sort: descending data
  51. /sort: ascending data
  52. 3sort: ascending, then 3 random exchanges
  53. +sort: ascending, then 10 random at the end
  54. ~sort: many duplicates
  55. =sort: all equal
  56. !sort: worst case scenario
  57. First the trivial cases, trivial for samplesort because it special-cased
  58. them, and trivial for timsort because it naturally works on runs. Within
  59. an "n" block, the first line gives the # of compares done by samplesort,
  60. the second line by timsort, and the third line is the percentage by
  61. which the samplesort count exceeds the timsort count:
  62. n \sort /sort =sort
  63. ------- ------ ------ ------
  64. 32768 32768 32767 32767 samplesort
  65. 32767 32767 32767 timsort
  66. 0.00% 0.00% 0.00% (samplesort - timsort) / timsort
  67. 65536 65536 65535 65535
  68. 65535 65535 65535
  69. 0.00% 0.00% 0.00%
  70. 131072 131072 131071 131071
  71. 131071 131071 131071
  72. 0.00% 0.00% 0.00%
  73. 262144 262144 262143 262143
  74. 262143 262143 262143
  75. 0.00% 0.00% 0.00%
  76. 524288 524288 524287 524287
  77. 524287 524287 524287
  78. 0.00% 0.00% 0.00%
  79. 1048576 1048576 1048575 1048575
  80. 1048575 1048575 1048575
  81. 0.00% 0.00% 0.00%
  82. The algorithms are effectively identical in these cases, except that
  83. timsort does one less compare in \sort.
  84. Now for the more interesting cases. lg(n!) is the information-theoretic
  85. limit for the best any comparison-based sorting algorithm can do on
  86. average (across all permutations). When a method gets significantly
  87. below that, it's either astronomically lucky, or is finding exploitable
  88. structure in the data.
  89. n lg(n!) *sort 3sort +sort %sort ~sort !sort
  90. ------- ------- ------ ------- ------- ------ ------- --------
  91. 32768 444255 453096 453614 32908 452871 130491 469141 old
  92. 448885 33016 33007 50426 182083 65534 new
  93. 0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new
  94. 65536 954037 972699 981940 65686 973104 260029 1004607
  95. 962991 65821 65808 101667 364341 131070
  96. 1.01% 1391.83% -0.19% 857.15% -28.63% 666.47%
  97. 131072 2039137 2101881 2091491 131232 2092894 554790 2161379
  98. 2057533 131410 131361 206193 728871 262142
  99. 2.16% 1491.58% -0.10% 915.02% -23.88% 724.51%
  100. 262144 4340409 4464460 4403233 262314 4445884 1107842 4584560
  101. 4377402 262437 262459 416347 1457945 524286
  102. 1.99% 1577.82% -0.06% 967.83% -24.01% 774.44%
  103. 524288 9205096 9453356 9408463 524468 9441930 2218577 9692015
  104. 9278734 524580 524633 837947 2916107 1048574
  105. 1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30%
  106. 1048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212
  107. 19606028 1048958 1048941 1694896 5832445 2097150
  108. 1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38%
  109. Discussion of cases:
  110. *sort: There's no structure in random data to exploit, so the theoretical
  111. limit is lg(n!). Both methods get close to that, and timsort is hugging
  112. it (indeed, in a *marginal* sense, it's a spectacular improvement --
  113. there's only about 1% left before hitting the wall, and timsort knows
  114. darned well it's doing compares that won't pay on random data -- but so
  115. does the samplesort hybrid). For contrast, Hoare's original random-pivot
  116. quicksort does about 39% more compares than the limit, and the median-of-3
  117. variant about 19% more.
  118. 3sort, %sort, and !sort: No contest; there's structure in this data, but
  119. not of the specific kinds samplesort special-cases. Note that structure
  120. in !sort wasn't put there on purpose -- it was crafted as a worst case for
  121. a previous quicksort implementation. That timsort nails it came as a
  122. surprise to me (although it's obvious in retrospect).
  123. +sort: samplesort special-cases this data, and does a few less compares
  124. than timsort. However, timsort runs this case significantly faster on all
  125. boxes we have timings for, because timsort is in the business of merging
  126. runs efficiently, while samplesort does much more data movement in this
  127. (for it) special case.
  128. ~sort: samplesort's special cases for large masses of equal elements are
  129. extremely effective on ~sort's specific data pattern, and timsort just
  130. isn't going to get close to that, despite that it's clearly getting a
  131. great deal of benefit out of the duplicates (the # of compares is much less
  132. than lg(n!)). ~sort has a perfectly uniform distribution of just 4
  133. distinct values, and as the distribution gets more skewed, samplesort's
  134. equal-element gimmicks become less effective, while timsort's adaptive
  135. strategies find more to exploit; in a database supplied by Kevin Altis, a
  136. sort on its highly skewed "on which stock exchange does this company's
  137. stock trade?" field ran over twice as fast under timsort.
  138. However, despite that timsort does many more comparisons on ~sort, and
  139. that on several platforms ~sort runs highly significantly slower under
  140. timsort, on other platforms ~sort runs highly significantly faster under
  141. timsort. No other kind of data has shown this wild x-platform behavior,
  142. and we don't have an explanation for it. The only thing I can think of
  143. that could transform what "should be" highly significant slowdowns into
  144. highly significant speedups on some boxes are catastrophic cache effects
  145. in samplesort.
  146. But timsort "should be" slower than samplesort on ~sort, so it's hard
  147. to count that it isn't on some boxes as a strike against it <wink>.
  148. + Here's the highwater mark for the number of heap-based temp slots (4
  149. bytes each on this box) needed by each test, again with arguments
  150. "15 20 1":
  151. 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort
  152. 32768 16384 0 0 6256 0 10821 12288 0 16383
  153. 65536 32766 0 0 21652 0 31276 24576 0 32767
  154. 131072 65534 0 0 17258 0 58112 49152 0 65535
  155. 262144 131072 0 0 35660 0 123561 98304 0 131071
  156. 524288 262142 0 0 31302 0 212057 196608 0 262143
  157. 1048576 524286 0 0 312438 0 484942 393216 0 524287
  158. Discussion: The tests that end up doing (close to) perfectly balanced
  159. merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort
  160. also ends up doing balanced merges, but systematically benefits a lot from
  161. the preliminary pre-merge searches described under "Merge Memory" later.
  162. %sort approaches having a balanced merge at the end because the random
  163. selection of elements to replace is expected to produce an out-of-order
  164. element near the midpoint. \sort, /sort, =sort are the trivial one-run
  165. cases, needing no merging at all. +sort ends up having one very long run
  166. and one very short, and so gets all the temp space it needs from the small
  167. temparray member of the MergeState struct (note that the same would be
  168. true if the new random elements were prefixed to the sorted list instead,
  169. but not if they appeared "in the middle"). 3sort approaches N//3 temp
  170. slots twice, but the run lengths that remain after 3 random exchanges
  171. clearly has very high variance.
  172. A detailed description of timsort follows.
  173. Runs
  174. ----
  175. count_run() returns the # of elements in the next run. A run is either
  176. "ascending", which means non-decreasing:
  177. a0 <= a1 <= a2 <= ...
  178. or "descending", which means strictly decreasing:
  179. a0 > a1 > a2 > ...
  180. Note that a run is always at least 2 long, unless we start at the array's
  181. last element.
  182. The definition of descending is strict, because the main routine reverses
  183. a descending run in-place, transforming a descending run into an ascending
  184. run. Reversal is done via the obvious fast "swap elements starting at each
  185. end, and converge at the middle" method, and that can violate stability if
  186. the slice contains any equal elements. Using a strict definition of
  187. descending ensures that a descending run contains distinct elements.
  188. If an array is random, it's very unlikely we'll see long runs. If a natural
  189. run contains less than minrun elements (see next section), the main loop
  190. artificially boosts it to minrun elements, via a stable binary insertion sort
  191. applied to the right number of array elements following the short natural
  192. run. In a random array, *all* runs are likely to be minrun long as a
  193. result. This has two primary good effects:
  194. 1. Random data strongly tends then toward perfectly balanced (both runs have
  195. the same length) merges, which is the most efficient way to proceed when
  196. data is random.
  197. 2. Because runs are never very short, the rest of the code doesn't make
  198. heroic efforts to shave a few cycles off per-merge overheads. For
  199. example, reasonable use of function calls is made, rather than trying to
  200. inline everything. Since there are no more than N/minrun runs to begin
  201. with, a few "extra" function calls per merge is barely measurable.
  202. Computing minrun
  203. ----------------
  204. If N < 64, minrun is N. IOW, binary insertion sort is used for the whole
  205. array then; it's hard to beat that given the overheads of trying something
  206. fancier.
  207. When N is a power of 2, testing on random data showed that minrun values of
  208. 16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
  209. in binary insertion sort clearly hurt, and at 8 the increase in the number
  210. of function calls clearly hurt. Picking *some* power of 2 is important
  211. here, so that the merges end up perfectly balanced (see next section). We
  212. pick 32 as a good value in the sweet range; picking a value at the low end
  213. allows the adaptive gimmicks more opportunity to exploit shorter natural
  214. runs.
  215. Because sortperf.py only tries powers of 2, it took a long time to notice
  216. that 32 isn't a good choice for the general case! Consider N=2112:
  217. >>> divmod(2112, 32)
  218. (66, 0)
  219. >>>
  220. If the data is randomly ordered, we're very likely to end up with 66 runs
  221. each of length 32. The first 64 of these trigger a sequence of perfectly
  222. balanced merges (see next section), leaving runs of lengths 2048 and 64 to
  223. merge at the end. The adaptive gimmicks can do that with fewer than 2048+64
  224. compares, but it's still more compares than necessary, and-- mergesort's
  225. bugaboo relative to samplesort --a lot more data movement (O(N) copies just
  226. to get 64 elements into place).
  227. If we take minrun=33 in this case, then we're very likely to end up with 64
  228. runs each of length 33, and then all merges are perfectly balanced. Better!
  229. What we want to avoid is picking minrun such that in
  230. q, r = divmod(N, minrun)
  231. q is a power of 2 and r>0 (then the last merge only gets r elements into
  232. place, and r < minrun is small compared to N), or q a little larger than a
  233. power of 2 regardless of r (then we've got a case similar to "2112", again
  234. leaving too little work for the last merge to do).
  235. Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a
  236. power of 2, or if that isn't possible, is close to, but strictly less than,
  237. a power of 2. This is easier to do than it may sound: take the first 6
  238. bits of N, and add 1 if any of the remaining bits are set. In fact, that
  239. rule covers every case in this section, including small N and exact powers
  240. of 2; merge_compute_minrun() is a deceptively simple function.
  241. The Merge Pattern
  242. -----------------
  243. In order to exploit regularities in the data, we're merging on natural
  244. run lengths, and they can become wildly unbalanced. That's a Good Thing
  245. for this sort! It means we have to find a way to manage an assortment of
  246. potentially very different run lengths, though.
  247. Stability constrains permissible merging patterns. For example, if we have
  248. 3 consecutive runs of lengths
  249. A:10000 B:20000 C:10000
  250. we dare not merge A with C first, because if A, B and C happen to contain
  251. a common element, it would get out of order wrt its occurrence(s) in B. The
  252. merging must be done as (A+B)+C or A+(B+C) instead.
  253. So merging is always done on two consecutive runs at a time, and in-place,
  254. although this may require some temp memory (more on that later).
  255. When a run is identified, its base address and length are pushed on a stack
  256. in the MergeState struct. merge_collapse() is then called to see whether it
  257. should merge it with preceding run(s). We would like to delay merging as
  258. long as possible in order to exploit patterns that may come up later, but we
  259. like even more to do merging as soon as possible to exploit that the run just
  260. found is still high in the memory hierarchy. We also can't delay merging
  261. "too long" because it consumes memory to remember the runs that are still
  262. unmerged, and the stack has a fixed size.
  263. What turned out to be a good compromise maintains two invariants on the
  264. stack entries, where A, B and C are the lengths of the three righmost not-yet
  265. merged slices:
  266. 1. A > B+C
  267. 2. B > C
  268. Note that, by induction, #2 implies the lengths of pending runs form a
  269. decreasing sequence. #1 implies that, reading the lengths right to left,
  270. the pending-run lengths grow at least as fast as the Fibonacci numbers.
  271. Therefore the stack can never grow larger than about log_base_phi(N) entries,
  272. where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice
  273. for very large arrays.
  274. If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
  275. freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
  276. e.g., if the last 3 entries are
  277. A:30 B:20 C:10
  278. then B is merged with C, leaving
  279. A:30 BC:30
  280. on the stack. Or if they were
  281. A:500 B:400: C:1000
  282. then A is merged with B, leaving
  283. AB:900 C:1000
  284. on the stack.
  285. In both examples, the stack configuration after the merge still violates
  286. invariant #2, and merge_collapse() goes on to continue merging runs until
  287. both invariants are satisfied. As an extreme case, suppose we didn't do the
  288. minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
  289. and 2. Nothing would get merged until the final 2 was seen, and that would
  290. trigger 7 perfectly balanced merges.
  291. The thrust of these rules when they trigger merging is to balance the run
  292. lengths as closely as possible, while keeping a low bound on the number of
  293. runs we have to remember. This is maximally effective for random data,
  294. where all runs are likely to be of (artificially forced) length minrun, and
  295. then we get a sequence of perfectly balanced merges (with, perhaps, some
  296. oddballs at the end).
  297. OTOH, one reason this sort is so good for partly ordered data has to do
  298. with wildly unbalanced run lengths.
  299. Merge Memory
  300. ------------
  301. Merging adjacent runs of lengths A and B in-place is very difficult.
  302. Theoretical constructions are known that can do it, but they're too difficult
  303. and slow for practical use. But if we have temp memory equal to min(A, B),
  304. it's easy.
  305. If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
  306. and then we can do the obvious merge algorithm left to right, from the temp
  307. area and B, starting the stores into where A used to live. There's always a
  308. free area in the original area comprising a number of elements equal to the
  309. number not yet merged from the temp array (trivially true at the start;
  310. proceed by induction). The only tricky bit is that if a comparison raises an
  311. exception, we have to remember to copy the remaining elements back in from
  312. the temp area, lest the array end up with duplicate entries from B. But
  313. that's exactly the same thing we need to do if we reach the end of B first,
  314. so the exit code is pleasantly common to both the normal and error cases.
  315. If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
  316. much the same, except that we need to merge right to left, copying B into a
  317. temp array and starting the stores at the right end of where B used to live.
  318. A refinement: When we're about to merge adjacent runs A and B, we first do
  319. a form of binary search (more on that later) to see where B[0] should end up
  320. in A. Elements in A preceding that point are already in their final
  321. positions, effectively shrinking the size of A. Likewise we also search to
  322. see where A[-1] should end up in B, and elements of B after that point can
  323. also be ignored. This cuts the amount of temp memory needed by the same
  324. amount.
  325. These preliminary searches may not pay off, and can be expected *not* to
  326. repay their cost if the data is random. But they can win huge in all of
  327. time, copying, and memory savings when they do pay, so this is one of the
  328. "per-merge overheads" mentioned above that we're happy to endure because
  329. there is at most one very short run. It's generally true in this algorithm
  330. that we're willing to gamble a little to win a lot, even though the net
  331. expectation is negative for random data.
  332. Merge Algorithms
  333. ----------------
  334. merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo
  335. deals with runs where A <= B, and merge_hi where A > B. They don't know
  336. whether the data is clustered or uniform, but a lovely thing about merging
  337. is that many kinds of clustering "reveal themselves" by how many times in a
  338. row the winning merge element comes from the same run. We'll only discuss
  339. merge_lo here; merge_hi is exactly analogous.
  340. Merging begins in the usual, obvious way, comparing the first element of A
  341. to the first of B, and moving B[0] to the merge area if it's less than A[0],
  342. else moving A[0] to the merge area. Call that the "one pair at a time"
  343. mode. The only twist here is keeping track of how many times in a row "the
  344. winner" comes from the same run.
  345. If that count reaches MIN_GALLOP, we switch to "galloping mode". Here
  346. we *search* B for where A[0] belongs, and move over all the B's before
  347. that point in one chunk to the merge area, then move A[0] to the merge
  348. area. Then we search A for where B[0] belongs, and similarly move a
  349. slice of A in one chunk. Then back to searching B for where A[0] belongs,
  350. etc. We stay in galloping mode until both searches find slices to copy
  351. less than MIN_GALLOP elements long, at which point we go back to one-pair-
  352. at-a-time mode.
  353. A refinement: The MergeState struct contains the value of min_gallop that
  354. controls when we enter galloping mode, initialized to MIN_GALLOP.
  355. merge_lo() and merge_hi() adjust this higher when galloping isn't paying
  356. off, and lower when it is.
  357. Galloping
  358. ---------
  359. Still without loss of generality, assume A is the shorter run. In galloping
  360. mode, we first look for A[0] in B. We do this via "galloping", comparing
  361. A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
  362. the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most
  363. roughly lg(B) comparisons, and, unlike a straight binary search, favors
  364. finding the right spot early in B (more on that later).
  365. After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
  366. consecutive elements, and a straight binary search requires exactly k-1
  367. additional comparisons to nail it. Then we copy all the B's up to that
  368. point in one chunk, and then copy A[0]. Note that no matter where A[0]
  369. belongs in B, the combination of galloping + binary search finds it in no
  370. more than about 2*lg(B) comparisons.
  371. If we did a straight binary search, we could find it in no more than
  372. ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
  373. comparisons no matter where A[0] belongs. Straight binary search thus loses
  374. to galloping unless the run is quite long, and we simply can't guess
  375. whether it is in advance.
  376. If data is random and runs have the same length, A[0] belongs at B[0] half
  377. the time, at B[1] a quarter of the time, and so on: a consecutive winning
  378. sub-run in B of length k occurs with probability 1/2**(k+1). So long
  379. winning sub-runs are extremely unlikely in random data, and guessing that a
  380. winning sub-run is going to be long is a dangerous game.
  381. OTOH, if data is lopsided or lumpy or contains many duplicates, long
  382. stretches of winning sub-runs are very likely, and cutting the number of
  383. comparisons needed to find one from O(B) to O(log B) is a huge win.
  384. Galloping compromises by getting out fast if there isn't a long winning
  385. sub-run, yet finding such very efficiently when they exist.
  386. I first learned about the galloping strategy in a related context; see:
  387. "Adaptive Set Intersections, Unions, and Differences" (2000)
  388. Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
  389. and its followup(s). An earlier paper called the same strategy
  390. "exponential search":
  391. "Optimistic Sorting and Information Theoretic Complexity"
  392. Peter McIlroy
  393. SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
  394. 467-474, Austin, Texas, 25-27 January 1993.
  395. and it probably dates back to an earlier paper by Bentley and Yao. The
  396. McIlroy paper in particular has good analysis of a mergesort that's
  397. probably strongly related to this one in its galloping strategy.
  398. Galloping with a Broken Leg
  399. ---------------------------
  400. So why don't we always gallop? Because it can lose, on two counts:
  401. 1. While we're willing to endure small per-merge overheads, per-comparison
  402. overheads are a different story. Calling Yet Another Function per
  403. comparison is expensive, and gallop_left() and gallop_right() are
  404. too long-winded for sane inlining.
  405. 2. Galloping can-- alas --require more comparisons than linear one-at-time
  406. search, depending on the data.
  407. #2 requires details. If A[0] belongs before B[0], galloping requires 1
  408. compare to determine that, same as linear search, except it costs more
  409. to call the gallop function. If A[0] belongs right before B[1], galloping
  410. requires 2 compares, again same as linear search. On the third compare,
  411. galloping checks A[0] against B[3], and if it's <=, requires one more
  412. compare to determine whether A[0] belongs at B[2] or B[3]. That's a total
  413. of 4 compares, but if A[0] does belong at B[2], linear search would have
  414. discovered that in only 3 compares, and that's a huge loss! Really. It's
  415. an increase of 33% in the number of compares needed, and comparisons are
  416. expensive in Python.
  417. index in B where # compares linear # gallop # binary gallop
  418. A[0] belongs search needs compares compares total
  419. ---------------- ----------------- -------- -------- ------
  420. 0 1 1 0 1
  421. 1 2 2 0 2
  422. 2 3 3 1 4
  423. 3 4 3 1 4
  424. 4 5 4 2 6
  425. 5 6 4 2 6
  426. 6 7 4 2 6
  427. 7 8 4 2 6
  428. 8 9 5 3 8
  429. 9 10 5 3 8
  430. 10 11 5 3 8
  431. 11 12 5 3 8
  432. ...
  433. In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
  434. to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
  435. The advantage of galloping is unbounded as i grows, but it doesn't win at
  436. all until i=6. Before then, it loses twice (at i=2 and i=4), and ties
  437. at the other values. At and after i=6, galloping always wins.
  438. We can't guess in advance when it's going to win, though, so we do one pair
  439. at a time until the evidence seems strong that galloping may pay. MIN_GALLOP
  440. is 7, and that's pretty strong evidence. However, if the data is random, it
  441. simply will trigger galloping mode purely by luck every now and again, and
  442. it's quite likely to hit one of the losing cases next. On the other hand,
  443. in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
  444. "should be" then. So the MergeState struct keeps a min_gallop variable
  445. that merge_lo and merge_hi adjust: the longer we stay in galloping mode,
  446. the smaller min_gallop gets, making it easier to transition back to
  447. galloping mode (if we ever leave it in the current merge, and at the
  448. start of the next merge). But whenever the gallop loop doesn't pay,
  449. min_gallop is increased by one, making it harder to transition back
  450. to galloping mode (and again both within a merge and across merges). For
  451. random data, this all but eliminates the gallop penalty: min_gallop grows
  452. large enough that we almost never get into galloping mode. And for cases
  453. like ~sort, min_gallop can fall to as low as 1. This seems to work well,
  454. but in all it's a minor improvement over using a fixed MIN_GALLOP value.
  455. Galloping Complication
  456. ----------------------
  457. The description above was for merge_lo. merge_hi has to merge "from the
  458. other end", and really needs to gallop starting at the last element in a run
  459. instead of the first. Galloping from the first still works, but does more
  460. comparisons than it should (this is significant -- I timed it both ways).
  461. For this reason, the gallop_left() and gallop_right() functions have a
  462. "hint" argument, which is the index at which galloping should begin. So
  463. galloping can actually start at any index, and proceed at offsets of 1, 3,
  464. 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
  465. In the code as I type it's always called with either 0 or n-1 (where n is
  466. the # of elements in a run). It's tempting to try to do something fancier,
  467. melding galloping with some form of interpolation search; for example, if
  468. we're merging a run of length 1 with a run of length 10000, index 5000 is
  469. probably a better guess at the final result than either 0 or 9999. But
  470. it's unclear how to generalize that intuition usefully, and merging of
  471. wildly unbalanced runs already enjoys excellent performance.
  472. ~sort is a good example of when balanced runs could benefit from a better
  473. hint value: to the extent possible, this would like to use a starting
  474. offset equal to the previous value of acount/bcount. Doing so saves about
  475. 10% of the compares in ~sort. However, doing so is also a mixed bag,
  476. hurting other cases.
  477. Comparing Average # of Compares on Random Arrays
  478. ------------------------------------------------
  479. [NOTE: This was done when the new algorithm used about 0.1% more compares
  480. on random data than does its current incarnation.]
  481. Here list.sort() is samplesort, and list.msort() this sort:
  482. """
  483. import random
  484. from time import clock as now
  485. def fill(n):
  486. from random import random
  487. return [random() for i in xrange(n)]
  488. def mycmp(x, y):
  489. global ncmp
  490. ncmp += 1
  491. return cmp(x, y)
  492. def timeit(values, method):
  493. global ncmp
  494. X = values[:]
  495. bound = getattr(X, method)
  496. ncmp = 0
  497. t1 = now()
  498. bound(mycmp)
  499. t2 = now()
  500. return t2-t1, ncmp
  501. format = "%5s %9.2f %11d"
  502. f2 = "%5s %9.2f %11.2f"
  503. def drive():
  504. count = sst = sscmp = mst = mscmp = nelts = 0
  505. while True:
  506. n = random.randrange(100000)
  507. nelts += n
  508. x = fill(n)
  509. t, c = timeit(x, 'sort')
  510. sst += t
  511. sscmp += c
  512. t, c = timeit(x, 'msort')
  513. mst += t
  514. mscmp += c
  515. count += 1
  516. if count % 10:
  517. continue
  518. print "count", count, "nelts", nelts
  519. print format % ("sort", sst, sscmp)
  520. print format % ("msort", mst, mscmp)
  521. print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
  522. drive()
  523. """
  524. I ran this on Windows and kept using the computer lightly while it was
  525. running. time.clock() is wall-clock time on Windows, with better than
  526. microsecond resolution. samplesort started with a 1.52% #-of-comparisons
  527. disadvantage, fell quickly to 1.48%, and then fluctuated within that small
  528. range. Here's the last chunk of output before I killed the job:
  529. count 2630 nelts 130906543
  530. sort 6110.80 1937887573
  531. msort 6002.78 1909389381
  532. 1.80 1.49
  533. We've done nearly 2 billion comparisons apiece at Python speed there, and
  534. that's enough <wink>.
  535. For random arrays of size 2 (yes, there are only 2 interesting ones),
  536. samplesort has a 50%(!) comparison disadvantage. This is a consequence of
  537. samplesort special-casing at most one ascending run at the start, then
  538. falling back to the general case if it doesn't find an ascending run
  539. immediately. The consequence is that it ends up using two compares to sort
  540. [2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be
  541. taught how to deal with mixtures of ascending and descending runs
  542. efficiently in all cases.