PageRenderTime 51ms CodeModel.GetById 21ms RepoModel.GetById 0ms app.codeStats 0ms

/doc/source/comparison_with_r.rst

http://github.com/pydata/pandas
ReStructuredText | 479 lines | 334 code | 145 blank | 0 comment | 0 complexity | 67224ca59de884b90ae5c618a2320834 MD5 | raw file
Possible License(s): BSD-3-Clause, Apache-2.0
  1. .. currentmodule:: pandas
  2. .. _compare_with_r:
  3. .. ipython:: python
  4. :suppress:
  5. import pandas as pd
  6. import numpy as np
  7. options.display.max_rows=15
  8. Comparison with R / R libraries
  9. *******************************
  10. Since ``pandas`` aims to provide a lot of the data manipulation and analysis
  11. functionality that people use `R <http://www.r-project.org/>`__ for, this page
  12. was started to provide a more detailed look at the `R language
  13. <http://en.wikipedia.org/wiki/R_(programming_language)>`__ and its many third
  14. party libraries as they relate to ``pandas``. In comparisons with R and CRAN
  15. libraries, we care about the following things:
  16. - **Functionality / flexibility**: what can/cannot be done with each tool
  17. - **Performance**: how fast are operations. Hard numbers/benchmarks are
  18. preferable
  19. - **Ease-of-use**: Is one tool easier/harder to use (you may have to be
  20. the judge of this, given side-by-side code comparisons)
  21. This page is also here to offer a bit of a translation guide for users of these
  22. R packages.
  23. Base R
  24. ------
  25. Slicing with R's |c|_
  26. ~~~~~~~~~~~~~~~~~~~~~
  27. R makes it easy to access ``data.frame`` columns by name
  28. .. code-block:: r
  29. df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5))
  30. df[, c("a", "c", "e")]
  31. or by integer location
  32. .. code-block:: r
  33. df <- data.frame(matrix(rnorm(1000), ncol=100))
  34. df[, c(1:10, 25:30, 40, 50:100)]
  35. Selecting multiple columns by name in ``pandas`` is straightforward
  36. .. ipython:: python
  37. df = DataFrame(np.random.randn(10, 3), columns=list('abc'))
  38. df[['a', 'c']]
  39. df.loc[:, ['a', 'c']]
  40. Selecting multiple noncontiguous columns by integer location can be achieved
  41. with a combination of the ``iloc`` indexer attribute and ``numpy.r_``.
  42. .. ipython:: python
  43. named = list('abcdefg')
  44. n = 30
  45. columns = named + np.arange(len(named), n).tolist()
  46. df = DataFrame(np.random.randn(n, n), columns=columns)
  47. df.iloc[:, np.r_[:10, 24:30]]
  48. |aggregate|_
  49. ~~~~~~~~~~~~
  50. In R you may want to split data into subsets and compute the mean for each.
  51. Using a data.frame called ``df`` and splitting it into groups ``by1`` and
  52. ``by2``:
  53. .. code-block:: r
  54. df <- data.frame(
  55. v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
  56. v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
  57. by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
  58. by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
  59. aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)
  60. The :meth:`~pandas.DataFrame.groupby` method is similar to base R ``aggregate``
  61. function.
  62. .. ipython:: python
  63. from pandas import DataFrame
  64. df = DataFrame({
  65. 'v1': [1,3,5,7,8,3,5,np.nan,4,5,7,9],
  66. 'v2': [11,33,55,77,88,33,55,np.nan,44,55,77,99],
  67. 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
  68. 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
  69. np.nan]
  70. })
  71. g = df.groupby(['by1','by2'])
  72. g[['v1','v2']].mean()
  73. For more details and examples see :ref:`the groupby documentation
  74. <groupby.split>`.
  75. |match|_
  76. ~~~~~~~~~~~~
  77. A common way to select data in R is using ``%in%`` which is defined using the
  78. function ``match``. The operator ``%in%`` is used to return a logical vector
  79. indicating if there is a match or not:
  80. .. code-block:: r
  81. s <- 0:4
  82. s %in% c(2,4)
  83. The :meth:`~pandas.DataFrame.isin` method is similar to R ``%in%`` operator:
  84. .. ipython:: python
  85. s = pd.Series(np.arange(5),dtype=np.float32)
  86. s.isin([2, 4])
  87. The ``match`` function returns a vector of the positions of matches
  88. of its first argument in its second:
  89. .. code-block:: r
  90. s <- 0:4
  91. match(s, c(2,4))
  92. The :meth:`~pandas.core.groupby.GroupBy.apply` method can be used to replicate
  93. this:
  94. .. ipython:: python
  95. s = pd.Series(np.arange(5),dtype=np.float32)
  96. pd.Series(pd.match(s,[2,4],np.nan))
  97. For more details and examples see :ref:`the reshaping documentation
  98. <indexing.basics.indexing_isin>`.
  99. |tapply|_
  100. ~~~~~~~~~
  101. ``tapply`` is similar to ``aggregate``, but data can be in a ragged array,
  102. since the subclass sizes are possibly irregular. Using a data.frame called
  103. ``baseball``, and retrieving information based on the array ``team``:
  104. .. code-block:: r
  105. baseball <-
  106. data.frame(team = gl(5, 5,
  107. labels = paste("Team", LETTERS[1:5])),
  108. player = sample(letters, 25),
  109. batting.average = runif(25, .200, .400))
  110. tapply(baseball$batting.average, baseball.example$team,
  111. max)
  112. In ``pandas`` we may use :meth:`~pandas.pivot_table` method to handle this:
  113. .. ipython:: python
  114. import random
  115. import string
  116. baseball = DataFrame({
  117. 'team': ["team %d" % (x+1) for x in range(5)]*5,
  118. 'player': random.sample(list(string.ascii_lowercase),25),
  119. 'batting avg': np.random.uniform(.200, .400, 25)
  120. })
  121. baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max)
  122. For more details and examples see :ref:`the reshaping documentation
  123. <reshaping.pivot>`.
  124. |subset|_
  125. ~~~~~~~~~~
  126. .. versionadded:: 0.13
  127. The :meth:`~pandas.DataFrame.query` method is similar to the base R ``subset``
  128. function. In R you might want to get the rows of a ``data.frame`` where one
  129. column's values are less than another column's values:
  130. .. code-block:: r
  131. df <- data.frame(a=rnorm(10), b=rnorm(10))
  132. subset(df, a <= b)
  133. df[df$a <= df$b,] # note the comma
  134. In ``pandas``, there are a few ways to perform subsetting. You can use
  135. :meth:`~pandas.DataFrame.query` or pass an expression as if it were an
  136. index/slice as well as standard boolean indexing:
  137. .. ipython:: python
  138. df = DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
  139. df.query('a <= b')
  140. df[df.a <= df.b]
  141. df.loc[df.a <= df.b]
  142. For more details and examples see :ref:`the query documentation
  143. <indexing.query>`.
  144. |with|_
  145. ~~~~~~~~
  146. .. versionadded:: 0.13
  147. An expression using a data.frame called ``df`` in R with the columns ``a`` and
  148. ``b`` would be evaluated using ``with`` like so:
  149. .. code-block:: r
  150. df <- data.frame(a=rnorm(10), b=rnorm(10))
  151. with(df, a + b)
  152. df$a + df$b # same as the previous expression
  153. In ``pandas`` the equivalent expression, using the
  154. :meth:`~pandas.DataFrame.eval` method, would be:
  155. .. ipython:: python
  156. df = DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
  157. df.eval('a + b')
  158. df.a + df.b # same as the previous expression
  159. In certain cases :meth:`~pandas.DataFrame.eval` will be much faster than
  160. evaluation in pure Python. For more details and examples see :ref:`the eval
  161. documentation <enhancingperf.eval>`.
  162. zoo
  163. ---
  164. xts
  165. ---
  166. plyr
  167. ----
  168. ``plyr`` is an R library for the split-apply-combine strategy for data
  169. analysis. The functions revolve around three data structures in R, ``a``
  170. for ``arrays``, ``l`` for ``lists``, and ``d`` for ``data.frame``. The
  171. table below shows how these data structures could be mapped in Python.
  172. +------------+-------------------------------+
  173. | R | Python |
  174. +============+===============================+
  175. | array | list |
  176. +------------+-------------------------------+
  177. | lists | dictionary or list of objects |
  178. +------------+-------------------------------+
  179. | data.frame | dataframe |
  180. +------------+-------------------------------+
  181. |ddply|_
  182. ~~~~~~~~
  183. An expression using a data.frame called ``df`` in R where you want to
  184. summarize ``x`` by ``month``:
  185. .. code-block:: r
  186. require(plyr)
  187. df <- data.frame(
  188. x = runif(120, 1, 168),
  189. y = runif(120, 7, 334),
  190. z = runif(120, 1.7, 20.7),
  191. month = rep(c(5,6,7,8),30),
  192. week = sample(1:4, 120, TRUE)
  193. )
  194. ddply(df, .(month, week), summarize,
  195. mean = round(mean(x), 2),
  196. sd = round(sd(x), 2))
  197. In ``pandas`` the equivalent expression, using the
  198. :meth:`~pandas.DataFrame.groupby` method, would be:
  199. .. ipython:: python
  200. df = DataFrame({
  201. 'x': np.random.uniform(1., 168., 120),
  202. 'y': np.random.uniform(7., 334., 120),
  203. 'z': np.random.uniform(1.7, 20.7, 120),
  204. 'month': [5,6,7,8]*30,
  205. 'week': np.random.randint(1,4, 120)
  206. })
  207. grouped = df.groupby(['month','week'])
  208. print grouped['x'].agg([np.mean, np.std])
  209. For more details and examples see :ref:`the groupby documentation
  210. <groupby.aggregate>`.
  211. reshape / reshape2
  212. ------------------
  213. |meltarray|_
  214. ~~~~~~~~~~~~~
  215. An expression using a 3 dimensional array called ``a`` in R where you want to
  216. melt it into a data.frame:
  217. .. code-block:: r
  218. a <- array(c(1:23, NA), c(2,3,4))
  219. data.frame(melt(a))
  220. In Python, since ``a`` is a list, you can simply use list comprehension.
  221. .. ipython:: python
  222. a = np.array(list(range(1,24))+[np.NAN]).reshape(2,3,4)
  223. DataFrame([tuple(list(x)+[val]) for x, val in np.ndenumerate(a)])
  224. |meltlist|_
  225. ~~~~~~~~~~~~
  226. An expression using a list called ``a`` in R where you want to melt it
  227. into a data.frame:
  228. .. code-block:: r
  229. a <- as.list(c(1:4, NA))
  230. data.frame(melt(a))
  231. In Python, this list would be a list of tuples, so
  232. :meth:`~pandas.DataFrame` method would convert it to a dataframe as required.
  233. .. ipython:: python
  234. a = list(enumerate(list(range(1,5))+[np.NAN]))
  235. DataFrame(a)
  236. For more details and examples see :ref:`the Into to Data Structures
  237. documentation <basics.dataframe.from_items>`.
  238. |meltdf|_
  239. ~~~~~~~~~~~~~~~~
  240. An expression using a data.frame called ``cheese`` in R where you want to
  241. reshape the data.frame:
  242. .. code-block:: r
  243. cheese <- data.frame(
  244. first = c('John', 'Mary'),
  245. last = c('Doe', 'Bo'),
  246. height = c(5.5, 6.0),
  247. weight = c(130, 150)
  248. )
  249. melt(cheese, id=c("first", "last"))
  250. In Python, the :meth:`~pandas.melt` method is the R equivalent:
  251. .. ipython:: python
  252. cheese = DataFrame({'first' : ['John', 'Mary'],
  253. 'last' : ['Doe', 'Bo'],
  254. 'height' : [5.5, 6.0],
  255. 'weight' : [130, 150]})
  256. pd.melt(cheese, id_vars=['first', 'last'])
  257. cheese.set_index(['first', 'last']).stack() # alternative way
  258. For more details and examples see :ref:`the reshaping documentation
  259. <reshaping.melt>`.
  260. |cast|_
  261. ~~~~~~~
  262. In R ``acast`` is an expression using a data.frame called ``df`` in R to cast
  263. into a higher dimensional array:
  264. .. code-block:: r
  265. df <- data.frame(
  266. x = runif(12, 1, 168),
  267. y = runif(12, 7, 334),
  268. z = runif(12, 1.7, 20.7),
  269. month = rep(c(5,6,7),4),
  270. week = rep(c(1,2), 6)
  271. )
  272. mdf <- melt(df, id=c("month", "week"))
  273. acast(mdf, week ~ month ~ variable, mean)
  274. In Python the best way is to make use of :meth:`~pandas.pivot_table`:
  275. .. ipython:: python
  276. df = DataFrame({
  277. 'x': np.random.uniform(1., 168., 12),
  278. 'y': np.random.uniform(7., 334., 12),
  279. 'z': np.random.uniform(1.7, 20.7, 12),
  280. 'month': [5,6,7]*4,
  281. 'week': [1,2]*6
  282. })
  283. mdf = pd.melt(df, id_vars=['month', 'week'])
  284. pd.pivot_table(mdf, values='value', index=['variable','week'],
  285. columns=['month'], aggfunc=np.mean)
  286. Similarly for ``dcast`` which uses a data.frame called ``df`` in R to
  287. aggregate information based on ``Animal`` and ``FeedType``:
  288. .. code-block:: r
  289. df <- data.frame(
  290. Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
  291. 'Animal2', 'Animal3'),
  292. FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
  293. Amount = c(10, 7, 4, 2, 5, 6, 2)
  294. )
  295. dcast(df, Animal ~ FeedType, sum, fill=NaN)
  296. # Alternative method using base R
  297. with(df, tapply(Amount, list(Animal, FeedType), sum))
  298. Python can approach this in two different ways. Firstly, similar to above
  299. using :meth:`~pandas.pivot_table`:
  300. .. ipython:: python
  301. df = DataFrame({
  302. 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
  303. 'Animal2', 'Animal3'],
  304. 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],
  305. 'Amount': [10, 7, 4, 2, 5, 6, 2],
  306. })
  307. df.pivot_table(values='Amount', index='Animal', columns='FeedType', aggfunc='sum')
  308. The second approach is to use the :meth:`~pandas.DataFrame.groupby` method:
  309. .. ipython:: python
  310. df.groupby(['Animal','FeedType'])['Amount'].sum()
  311. For more details and examples see :ref:`the reshaping documentation
  312. <reshaping.pivot>` or :ref:`the groupby documentation<groupby.split>`.
  313. .. |c| replace:: ``c``
  314. .. _c: http://stat.ethz.ch/R-manual/R-patched/library/base/html/c.html
  315. .. |aggregate| replace:: ``aggregate``
  316. .. _aggregate: http://finzi.psych.upenn.edu/R/library/stats/html/aggregate.html
  317. .. |match| replace:: ``match`` / ``%in%``
  318. .. _match: http://finzi.psych.upenn.edu/R/library/base/html/match.html
  319. .. |tapply| replace:: ``tapply``
  320. .. _tapply: http://finzi.psych.upenn.edu/R/library/base/html/tapply.html
  321. .. |with| replace:: ``with``
  322. .. _with: http://finzi.psych.upenn.edu/R/library/base/html/with.html
  323. .. |subset| replace:: ``subset``
  324. .. _subset: http://finzi.psych.upenn.edu/R/library/base/html/subset.html
  325. .. |ddply| replace:: ``ddply``
  326. .. _ddply: http://www.inside-r.org/packages/cran/plyr/docs/ddply
  327. .. |meltarray| replace:: ``melt.array``
  328. .. _meltarray: http://www.inside-r.org/packages/cran/reshape2/docs/melt.array
  329. .. |meltlist| replace:: ``melt.list``
  330. .. meltlist: http://www.inside-r.org/packages/cran/reshape2/docs/melt.list
  331. .. |meltdf| replace:: ``melt.data.frame``
  332. .. meltdf: http://www.inside-r.org/packages/cran/reshape2/docs/melt.data.frame
  333. .. |cast| replace:: ``cast``
  334. .. cast: http://www.inside-r.org/packages/cran/reshape2/docs/cast