/docs/manual/source/datacollection/analytics-ipynb.html.md.erb
Ruby HTML | 98 lines | 74 code | 24 blank | 0 comment | 3 complexity | a5dd0f80650e5dea9a6abb5e4c3f44ec MD5 | raw file
- ---
- title: Machine Learning Analytics with IPython Notebook
- ---
- [IPython Notebook](http://ipython.org/notebook.html) is a very powerful
- interactive computational environment, and with
- [PredictionIO](https://prediction.io),
- [PySpark](http://spark.apache.org/docs/latest/api/python/) and [Spark
- SQL](https://spark.apache.org/sql/), you can easily analyze your collected
- events when you are developing or tuning your engine.
- ## Prerequisites
- Before you begin, please make sure you have the latest stable IPython installed,
- and that the command `ipython` can be accessed from your shell's search path.
- <%= partial 'shared/datacollection/parquet' %>
- ## Preparing IPython Notebook
- Launch IPython Notebook with PySpark using the following command, with
- `$SPARK_HOME` replaced by the location of Apache Spark.
- ```
- $ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --pylab inline" $SPARK_HOME/bin/pyspark
- ```
- By default, you should be able to access your IPython Notebook via web browser
- at http://localhost:8888.
- Let's initialize our notebook for the following code in the first cell.
- ```python
- import pandas as pd
- def rows_to_df(rows):
- return pd.DataFrame(map(lambda e: e.asDict(), rows))
- from pyspark.sql import SQLContext
- sqlc = SQLContext(sc)
- rdd = sqlc.parquetFile("/tmp/movies")
- rdd.registerTempTable("events")
- ```
- ![Initialization for IPython Notebook](/images/datacollection/ipynb-01.png)
- `rows_to_df(rows)` will come in handy when we want to dump the results from
- Spark SQL using IPython Notebook's native table rendering.
- ## Performing Analysis with Spark SQL
- If all steps above ran successfully, you should have a ready-to-use analytics
- environment by now. Let's try a few examples to see if everything is functional.
- In the second cell, put in this piece of code and run it.
- ```python
- summary = sqlc.sql("SELECT "
- "entityType, event, targetEntityType, COUNT(*) AS c "
- "FROM events "
- "GROUP BY entityType, event, targetEntityType").collect()
- rows_to_df(summary)
- ```
- You should see the following screen.
- ![Summary of Events](/images/datacollection/ipynb-02.png)
- We can also plot our data, in the next two cells.
- ```python
- import matplotlib.pyplot as plt
- count = map(lambda e: e.c, summary)
- event = map(lambda e: "%s (%d)" % (e.event, e.c), summary)
- colors = ['gold', 'lightskyblue']
- plt.pie(count, labels=event, colors=colors, startangle=90, autopct="%1.1f%%")
- plt.axis('equal')
- plt.show()
- ```
- ![Summary in Pie Chart](/images/datacollection/ipynb-03.png)
- ```python
- ratings = sqlc.sql("SELECT properties.rating AS r, COUNT(*) AS c "
- "FROM events "
- "WHERE properties.rating IS NOT NULL "
- "GROUP BY properties.rating "
- "ORDER BY r").collect()
- count = map(lambda e: e.c, ratings)
- rating = map(lambda e: "%s (%d)" % (e.r, e.c), ratings)
- colors = ['yellowgreen', 'plum', 'gold', 'lightskyblue', 'lightcoral']
- plt.pie(count, labels=rating, colors=colors, startangle=90,
- autopct="%1.1f%%")
- plt.axis('equal')
- plt.show()
- ```
- ![Breakdown of Ratings](/images/datacollection/ipynb-04.png)
- Happy analyzing!