PageRenderTime 45ms CodeModel.GetById 13ms RepoModel.GetById 1ms app.codeStats 0ms

/blog/draft/io-improvements.rst

https://bitbucket.org/pypy/extradoc/
ReStructuredText | 88 lines | 73 code | 15 blank | 0 comment | 0 complexity | 3bb6bb19785b94c947cb574e0bd2eba0 MD5 | raw file
  1. Hello everyone!
  2. We've wrapped up the Warsaw sprint, so I would like to describe some
  3. branches which have been recently merged and which improved the I/O and the
  4. GC: `gc_no_cleanup_nursery`_ and `gc-incminimark-pinning`_.
  5. .. _`gc_no_cleanup_nursery`: https://bitbucket.org/pypy/pypy/commits/9e2f7a37c1e2
  6. .. _`gc-incminimark-pinning`: https://bitbucket.org/pypy/pypy/commits/64017d818038
  7. The first branch was started by Wenzhu Man for her Google Summer of Code
  8. and finished by Maciej Fijałkowski and Armin Rigo.
  9. The PyPy GC works by allocating new objects in the young object
  10. area (the nursery), simply by incrementing a pointer. After each minor
  11. collection, the nursery has to be cleaned up. For simplicity, the GC used
  12. to do it by zeroing the whole nursery.
  13. This approach has bad effects on the cache, since you zero a large piece of
  14. memory at once and do unnecessary work for things that don't require zeroing
  15. like large strings. We mitigated the first problem somewhat with incremental
  16. nursery zeroing, but this branch removes the zeroing completely, thus
  17. improving the string handling and recursive code (since jitframes don't
  18. requires zeroed memory either). I measured the effect on two examples:
  19. a recursive implementation of `fibonacci`_ and `gcbench`_,
  20. to measure GC performance.
  21. .. _`fibonacci`: https://bitbucket.org/pypy/benchmarks/src/69152c2aee7766051aab15735b0b82a46b82b802/own/fib.py?at=default
  22. .. _`gcbench`: https://bitbucket.org/pypy/benchmarks/src/69152c2aee7766051aab15735b0b82a46b82b802/own/gcbench.py?at=default
  23. The results for fibonacci and gcbench are below (normalized to cpython
  24. 2.7). Benchmarks were run 50 times each (note that the big standard
  25. deviation comes mostly from the warmup at the beginning, true figures
  26. are smaller):
  27. +----------------+------------------+-------------------------+--------------------+
  28. | benchmark | CPython | PyPy 2.4 | PyPy non-zero |
  29. +----------------+------------------+-------------------------+--------------------+
  30. | fibonacci | 4.8+-0.15 (1.0x) | 0.59+-0.07 (8.1x) | 0.45+-0.07 (10.6x) |
  31. +----------------+------------------+-------------------------+--------------------+
  32. | gcbench | 22+-0.36 (1.0x) | 1.34+-0.28 (16.4x) | 1.02+-0.15 (21.6x) |
  33. +----------------+------------------+-------------------------+--------------------+
  34. The second branch was done by Gregor Wegberg for his master thesis and finished
  35. by Maciej Fijałkowski and Armin Rigo. Because of the way it works, the PyPy GC from
  36. time to time moves the objects in memory, meaning that their address can change.
  37. Therefore, if you want to pass pointers to some external C function (for
  38. example, write(2) or read(2)), you need to ensure that the objects they are
  39. pointing to will not be moved by the GC (e.g. when running a different thread).
  40. PyPy up to 2.4 solves the problem by copying the data into or from a non-movable buffer, which
  41. is obviously inefficient.
  42. The branch introduce the concept of "pinning", which allows us to inform the
  43. GC that it is not allowed to move a certain object for a short period of time.
  44. This introduces a bit of extra complexity
  45. in the garbage collector, but improves the I/O performance quite drastically,
  46. because we no longer need the extra copy to and from the non-movable buffers.
  47. In `this benchmark`_, which does I/O in a loop,
  48. we either write a number of bytes from a freshly allocated string into
  49. /dev/null or read a number of bytes from /dev/full. I'm showing the results
  50. for PyPy 2.4, PyPy with non-zero-nursery and PyPy with non-zero-nursery and
  51. object pinning. Those are wall times for cases using ``os.read/os.write``
  52. and ``file.read/file.write``, normalized against CPython 2.7.
  53. Benchmarks were done using PyPy 2.4 and revisions ``85646d1d07fb`` for
  54. non-zero-nursery and ``3d8fe96dc4d9`` for non-zero-nursery and pinning.
  55. The benchmarks were run once, since the standard deviation was small.
  56. XXXX
  57. What we can see is that ``os.read`` and ``os.write`` both improved greatly
  58. and outperforms CPython now for each combination. ``file`` operations are
  59. a little more tricky, and while those branches improved the situation a bit,
  60. the improvement is not as drastic as in ``os`` versions. It really should not
  61. be the case and it showcases how our ``file`` buffering is inferior to CPython.
  62. We plan on removing our own buffering and using ``FILE*`` in C in the near future,
  63. so we should outperform CPython on those too (since our allocations are cheaper).
  64. If you look carefully in the benchmark, the write function is copied three times.
  65. This hack is intended to avoid JIT overspecializing the assembler code, which happens
  66. because the buffering code was written way before the JIT was done. In fact, our buffering
  67. is hilariously bad, but if stars align correctly it can be JIT-compiled to something
  68. that's not half bad. Try removing the hack and seeing how the performance of the last
  69. benchmark drops :-) Again, this hack should be absolutely unnecessary once we remove
  70. our own buffering, stay tuned for more.
  71. Cheers,
  72. fijal
  73. .. _`this benchmark`: https://bitbucket.org/pypy/benchmarks/src/69152c2aee7766051aab15735b0b82a46b82b802/io/iobasic.py?at=default