PageRenderTime 64ms CodeModel.GetById 22ms RepoModel.GetById 1ms app.codeStats 0ms

/doc/notebooks/dplyr.md

https://bitbucket.org/breisfeld/rpy2_w32_fix
Markdown | 169 lines | 131 code | 38 blank | 0 comment | 0 complexity | b6056a82c7d124f92df100874777bc2e MD5 | raw file
Possible License(s): GPL-2.0, BSD-3-Clause
  1. # dplyr in Python
  2. We need 2 things for this:
  3. 1- A data frame (using one of R's demo datasets).
  4. ```python
  5. from rpy2.robjects.packages import importr, data
  6. datasets = importr('datasets')
  7. mtcars_env = data(datasets).fetch('mtcars')
  8. mtcars = mtcars_env['mtcars']
  9. ```
  10. In addition to that, and because this tutorial is in a notebook,
  11. we initialize HTML rendering for R objects (pretty display of
  12. R data frames).
  13. ```python
  14. import rpy2.ipython.html
  15. rpy2.ipython.html.init_printing()
  16. ```
  17. 2- dplyr
  18. ```python
  19. from rpy2.robjects.lib.dplyr import DataFrame
  20. ```
  21. With this we have the choice of chaining (D3-style)
  22. ```python
  23. dataf = (DataFrame(mtcars).
  24. filter('gear>3').
  25. mutate(powertoweight='hp*36/wt').
  26. group_by('gear').
  27. summarize(mean_ptw='mean(powertoweight)'))
  28. dataf
  29. ```
  30. or piping (magrittr style).
  31. ```python
  32. from rpy2.robjects.lib.dplyr import (filter,
  33. mutate,
  34. group_by,
  35. summarize)
  36. dataf = (DataFrame(mtcars) >>
  37. filter('gear>3') >>
  38. mutate(powertoweight='hp*36/wt') >>
  39. group_by('gear') >>
  40. summarize(mean_ptw='mean(powertoweight)'))
  41. dataf
  42. ```
  43. The strings passed to the dplyr function are evaluated as expression,
  44. just like this is happening when using dplyr in R. This means that
  45. when writing `mean(powertoweight)` the R function `mean()` is used.
  46. Using a Python function is not too difficult though. We can just
  47. call Python back from R:
  48. ```python
  49. from rpy2.rinterface import rternalize
  50. @rternalize
  51. def mean_np(x):
  52. import numpy
  53. return numpy.mean(x)
  54. from rpy2.robjects import globalenv
  55. globalenv['mean_np'] = mean_np
  56. dataf = (DataFrame(mtcars) >>
  57. filter('gear>3') >>
  58. mutate(powertoweight='hp*36/wt') >>
  59. group_by('gear') >>
  60. summarize(mean_ptw='mean(powertoweight)',
  61. mean_np_ptw='mean_np(powertoweight)'))
  62. dataf
  63. ```
  64. It is also possible to carry this out without having to
  65. place the custom function in R's global environment.
  66. ```python
  67. del(globalenv['mean_np'])
  68. ```
  69. ```python
  70. from rpy2.robjects.lib.dplyr import StringInEnv
  71. from rpy2.robjects import Environment
  72. my_env = Environment()
  73. my_env['mean_np'] = mean_np
  74. dataf = (DataFrame(mtcars) >>
  75. filter('gear>3') >>
  76. mutate(powertoweight='hp*36/wt') >>
  77. group_by('gear') >>
  78. summarize(mean_ptw='mean(powertoweight)',
  79. mean_np_ptw=StringInEnv('mean_np(powertoweight)',
  80. my_env)))
  81. dataf
  82. ```
  83. **note**: rpy2's interface to dplyr is implementing a fix to the (non-?)issue 1323
  84. (https://github.com/hadley/dplyr/issues/1323)
  85. The seamless translation of transformations to SQL whenever the
  86. data are in a table can be used directly. Since we are lifting
  87. the original implementation of `dplyr`, it *just works*.
  88. ```python
  89. from rpy2.robjects.lib.dplyr import dplyr
  90. # in-memory SQLite database broken in dplyr's src_sqlite
  91. # db = dplyr.src_sqlite(":memory:")
  92. import tempfile
  93. with tempfile.NamedTemporaryFile() as db_fh:
  94. db = dplyr.src_sqlite(db_fh.name)
  95. # copy the table to that database
  96. dataf_db = DataFrame(mtcars).copy_to(db, name="mtcars")
  97. res = (dataf_db >>
  98. filter('gear>3') >>
  99. mutate(powertoweight='hp*36/wt') >>
  100. group_by('gear') >>
  101. summarize(mean_ptw='mean(powertoweight)'))
  102. print(res)
  103. #
  104. ```
  105. Since we are manipulating R objects, anything available to R is also available
  106. to us. If we want to see the SQL code generated that's:
  107. ```python
  108. print(res.rx2("query")["sql"])
  109. ```
  110. And if the starting point is a pandas data frame,
  111. do the following and start over again.
  112. ```python
  113. from rpy2.robjects import pandas2ri
  114. from rpy2.robjects import default_converter
  115. from rpy2.robjects.conversion import localconverter
  116. with localconverter(default_converter + pandas2ri.converter) as cv:
  117. mtcars = mtcars_env['mtcars']
  118. mtcars = pandas2ri.ri2py(mtcars)
  119. print(type(mtcars))
  120. ```
  121. Using a local converter lets us also go from the pandas data frame to our dplyr-augmented R data frame.
  122. ```python
  123. with localconverter(default_converter + pandas2ri.converter) as cv:
  124. dataf = (DataFrame(mtcars).
  125. filter('gear>=3').
  126. mutate(powertoweight='hp*36/wt').
  127. group_by('gear').
  128. summarize(mean_ptw='mean(powertoweight)'))
  129. dataf
  130. ```
  131. **Reuse. Get things done. Don't reimplement.**