PageRenderTime 53ms CodeModel.GetById 26ms RepoModel.GetById 0ms app.codeStats 0ms

/docs/manual/source/datacollection/analytics-ipynb.html.md.erb

https://gitlab.com/ggsaavedra/PredictionIO
Ruby HTML | 98 lines | 74 code | 24 blank | 0 comment | 3 complexity | a5dd0f80650e5dea9a6abb5e4c3f44ec MD5 | raw file
  1. ---
  2. title: Machine Learning Analytics with IPython Notebook
  3. ---
  4. [IPython Notebook](http://ipython.org/notebook.html) is a very powerful
  5. interactive computational environment, and with
  6. [PredictionIO](https://prediction.io),
  7. [PySpark](http://spark.apache.org/docs/latest/api/python/) and [Spark
  8. SQL](https://spark.apache.org/sql/), you can easily analyze your collected
  9. events when you are developing or tuning your engine.
  10. ## Prerequisites
  11. Before you begin, please make sure you have the latest stable IPython installed,
  12. and that the command `ipython` can be accessed from your shell's search path.
  13. <%= partial 'shared/datacollection/parquet' %>
  14. ## Preparing IPython Notebook
  15. Launch IPython Notebook with PySpark using the following command, with
  16. `$SPARK_HOME` replaced by the location of Apache Spark.
  17. ```
  18. $ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --pylab inline" $SPARK_HOME/bin/pyspark
  19. ```
  20. By default, you should be able to access your IPython Notebook via web browser
  21. at http://localhost:8888.
  22. Let's initialize our notebook for the following code in the first cell.
  23. ```python
  24. import pandas as pd
  25. def rows_to_df(rows):
  26. return pd.DataFrame(map(lambda e: e.asDict(), rows))
  27. from pyspark.sql import SQLContext
  28. sqlc = SQLContext(sc)
  29. rdd = sqlc.parquetFile("/tmp/movies")
  30. rdd.registerTempTable("events")
  31. ```
  32. ![Initialization for IPython Notebook](/images/datacollection/ipynb-01.png)
  33. `rows_to_df(rows)` will come in handy when we want to dump the results from
  34. Spark SQL using IPython Notebook's native table rendering.
  35. ## Performing Analysis with Spark SQL
  36. If all steps above ran successfully, you should have a ready-to-use analytics
  37. environment by now. Let's try a few examples to see if everything is functional.
  38. In the second cell, put in this piece of code and run it.
  39. ```python
  40. summary = sqlc.sql("SELECT "
  41. "entityType, event, targetEntityType, COUNT(*) AS c "
  42. "FROM events "
  43. "GROUP BY entityType, event, targetEntityType").collect()
  44. rows_to_df(summary)
  45. ```
  46. You should see the following screen.
  47. ![Summary of Events](/images/datacollection/ipynb-02.png)
  48. We can also plot our data, in the next two cells.
  49. ```python
  50. import matplotlib.pyplot as plt
  51. count = map(lambda e: e.c, summary)
  52. event = map(lambda e: "%s (%d)" % (e.event, e.c), summary)
  53. colors = ['gold', 'lightskyblue']
  54. plt.pie(count, labels=event, colors=colors, startangle=90, autopct="%1.1f%%")
  55. plt.axis('equal')
  56. plt.show()
  57. ```
  58. ![Summary in Pie Chart](/images/datacollection/ipynb-03.png)
  59. ```python
  60. ratings = sqlc.sql("SELECT properties.rating AS r, COUNT(*) AS c "
  61. "FROM events "
  62. "WHERE properties.rating IS NOT NULL "
  63. "GROUP BY properties.rating "
  64. "ORDER BY r").collect()
  65. count = map(lambda e: e.c, ratings)
  66. rating = map(lambda e: "%s (%d)" % (e.r, e.c), ratings)
  67. colors = ['yellowgreen', 'plum', 'gold', 'lightskyblue', 'lightcoral']
  68. plt.pie(count, labels=rating, colors=colors, startangle=90,
  69. autopct="%1.1f%%")
  70. plt.axis('equal')
  71. plt.show()
  72. ```
  73. ![Breakdown of Ratings](/images/datacollection/ipynb-04.png)
  74. Happy analyzing!