PageRenderTime 27ms CodeModel.GetById 22ms RepoModel.GetById 1ms app.codeStats 0ms

/docs/manual/obsolete/tutorials/enginebuilders/local-helloworld.html.md

https://gitlab.com/admin-github-cloud/incubator-predictionio
Markdown | 546 lines | 404 code | 142 blank | 0 comment | 0 complexity | bc21132f7b13469cfb64742f3d0f71ba MD5 | raw file
  1. ---
  2. title: Building the "HelloWorld" Engine
  3. ---
  4. # Building the "HelloWorld" Engine
  5. This is a step-by-step guide on building your first predictive engine on
  6. PredictionIO. The engine will use historical temperature data to predict the
  7. temperature of a certain day in a week.
  8. > You need to build PredictionIO from source in order to build your own engine.
  9. Please follow instructions to build from source
  10. [here](/install/install-sourcecode.html).
  11. Completed source code can also be found at
  12. `$PIO_HOME/examples/scala-local-helloworld` and
  13. `$PIO_HOME/examples/java-local-helloworld`, where `$PIO_HOME` is the root
  14. directory of the PredictionIO source code tree.
  15. ## Data Set
  16. This engine will read a historial daily temperatures as training data set. A very simple data set is prepared for you.
  17. First, create a directory somewhere and copy the data set over. Replace `path/to/data.csv` with your path which stores the training data.
  18. ```console
  19. $ cp $PIO_HOME/examples/data/helloworld/data1.csv path/to/data.csv
  20. ```
  21. ## 1. Create a new Engine
  22. ```console
  23. $ $PIO_HOME/bin/pio new HelloWorld
  24. $ cd HelloWorld
  25. ```
  26. A new engine project directory `HelloWorld` is created. You should see the following files being created inside this new project directory:
  27. ```
  28. build.sbt
  29. engine.json
  30. params/
  31. project/
  32. src/
  33. ```
  34. <div class="tabs">
  35. <div data-tab="Scala" data-lang="scala">
  36. You can find the Scala engine template in <code>src/main/scala/Engine.scala</code>. Please follow the instructions below to edit this file.
  37. </div>
  38. <div data-tab="Java" data-lang="java">
  39. <strong>NOTE:</strong>
  40. The template is created for Scala codes. For Java, need to do the following:
  41. Under <code>HelloWorld</code> directory:
  42. ```bash
  43. $ rm -rf src/main/scala
  44. $ mkdir -p src/main/java
  45. ```
  46. </div>
  47. </div>
  48. ## 2. Define Data Types
  49. ### Define Training Data
  50. <div class="tabs">
  51. <div data-tab="Scala" data-lang="scala">
  52. Edit <code>src/main/scala/Engine.scala</code>:
  53. ```scala
  54. class MyTrainingData(
  55. val temperatures: List[(String, Double)]
  56. ) extends Serializable
  57. ```
  58. </div>
  59. <div data-tab="Java" data-lang="java">
  60. Create a new file <code>src/main/java/MyTrainingData.java</code>:
  61. ```java
  62. package myorg;
  63. import java.io.Serializable;
  64. import java.util.List;
  65. public class MyTrainingData implements Serializable {
  66. List<DayTemperature> temperatures;
  67. public MyTrainingData(List<DayTemperature> temperatures) {
  68. this.temperatures = temperatures;
  69. }
  70. public static class DayTemperature implements Serializable {
  71. String day;
  72. Double temperature;
  73. public DayTemperature(String day, Double temperature) {
  74. this.day = day;
  75. this.temperature = temperature;
  76. }
  77. }
  78. }
  79. ```
  80. </div>
  81. </div>
  82. ### Define Query
  83. <div class="tabs">
  84. <div data-tab="Scala" data-lang="scala">
  85. Edit <code>src/main/scala/Engine.scala</code>:
  86. ```scala
  87. class MyQuery(
  88. val day: String
  89. ) extends Serializable
  90. ```
  91. </div>
  92. <div data-tab="Java" data-lang="java">
  93. Create a new file <code>src/main/java/MyQuery.java</code>:
  94. ```java
  95. package myorg;
  96. import java.io.Serializable;
  97. public class MyQuery implements Serializable {
  98. String day;
  99. public MyQuery(String day) {
  100. this.day = day;
  101. }
  102. }
  103. ```
  104. </div>
  105. </div>
  106. ### Define Model
  107. <div class="tabs">
  108. <div data-tab="Scala" data-lang="scala">
  109. Edit <code>src/main/scala/Engine.scala</code>:
  110. ```scala
  111. import scala.collection.immutable.HashMap
  112. class MyModel(
  113. val temperatures: HashMap[String, Double]
  114. ) extends Serializable {
  115. override def toString = temperatures.toString
  116. }
  117. ```
  118. </div>
  119. <div data-tab="Java" data-lang="java">
  120. Create a new file <code>src/main/java/MyModel.java</code>:
  121. ```java
  122. package myorg;
  123. import java.io.Serializable;
  124. import java.util.Map;
  125. public class MyModel implements Serializable {
  126. Map<String, Double> temperatures;
  127. public MyModel(Map<String, Double> temperatures) {
  128. this.temperatures = temperatures;
  129. }
  130. @Override
  131. public String toString() {
  132. return temperatures.toString();
  133. }
  134. }
  135. ```
  136. </div>
  137. </div>
  138. ### Define Predicted Result
  139. <div class="tabs">
  140. <div data-tab="Scala" data-lang="scala">
  141. Edit <code>src/main/scala/Engine.scala</code>:
  142. ```scala
  143. class MyPredictedResult(
  144. val temperature: Double
  145. ) extends Serializable
  146. ```
  147. </div>
  148. <div data-tab="Java" data-lang="java">
  149. Create a new file <code>src/main/java/MyPredictedResult.java</code>:
  150. ```java
  151. package myorg;
  152. import java.io.Serializable;
  153. public class MyPredictedResult implements Serializable {
  154. Double temperature;
  155. public MyPredictedResult(Double temperature) {
  156. this.temperature = temperature;
  157. }
  158. }
  159. ```
  160. </div>
  161. </div>
  162. ## 3. Implement the Data Source
  163. <div class="tabs">
  164. <div data-tab="Scala" data-lang="scala">
  165. Edit <code>src/main/scala/Engine.scala</code>:
  166. ```scala
  167. import scala.io.Source
  168. class MyDataSource extends LDataSource[EmptyDataSourceParams, EmptyDataParams,
  169. MyTrainingData, MyQuery, EmptyActualResult] {
  170. override def readTraining(): MyTrainingData = {
  171. val lines = Source.fromFile("path/to/data.csv").getLines()
  172. .toList.map { line =>
  173. val data = line.split(",")
  174. (data(0), data(1).toDouble)
  175. }
  176. new MyTrainingData(lines)
  177. }
  178. }
  179. ```
  180. </div>
  181. <div data-tab="Java" data-lang="java">
  182. Create a new file <code>src/main/java/MyDataSource.java</code>:
  183. ```java
  184. package myorg;
  185. import org.apache.predictionio.controller.java.*;
  186. import java.util.List;
  187. import java.util.ArrayList;
  188. import java.io.FileReader;
  189. import java.io.BufferedReader;
  190. public class MyDataSource extends LJavaDataSource<
  191. EmptyDataSourceParams, EmptyDataParams, MyTrainingData, MyQuery, EmptyActualResult> {
  192. @Override
  193. public MyTrainingData readTraining() {
  194. List<MyTrainingData.DayTemperature> temperatures =
  195. new ArrayList<MyTrainingData.DayTemperature>();
  196. try {
  197. BufferedReader reader =
  198. new BufferedReader(new FileReader("path/to/data.csv"));
  199. String line;
  200. while ((line = reader.readLine()) != null) {
  201. String[] tokens = line.split(",");
  202. temperatures.add(
  203. new MyTrainingData.DayTemperature(tokens[0],
  204. Double.parseDouble(tokens[1])));
  205. }
  206. reader.close();
  207. } catch (Exception e) {
  208. System.exit(1);
  209. }
  210. return new MyTrainingData(temperatures);
  211. }
  212. }
  213. ```
  214. </div>
  215. </div>
  216. **NOTE**: You need to update the `path/to/data.csv` in this code with the correct path that store the training data.
  217. ## 4. Implement an Algorithm
  218. <div class="tabs">
  219. <div data-tab="Scala" data-lang="scala">
  220. Edit <code>src/main/scala/Engine.scala</code>:
  221. ```scala
  222. class MyAlgorithm extends LAlgorithm[EmptyAlgorithmParams, MyTrainingData,
  223. MyModel, MyQuery, MyPredictedResult] {
  224. override
  225. def train(pd: MyTrainingData): MyModel = {
  226. // calculate average value of each day
  227. val average = pd.temperatures
  228. .groupBy(_._1) // group by day
  229. .mapValues{ list =>
  230. val tempList = list.map(_._2) // get the temperature
  231. tempList.sum / tempList.size
  232. }
  233. // trait Map is not serializable, use concrete class HashMap
  234. new MyModel(HashMap[String, Double]() ++ average)
  235. }
  236. override
  237. def predict(model: MyModel, query: MyQuery): MyPredictedResult = {
  238. val temp = model.temperatures(query.day)
  239. new MyPredictedResult(temp)
  240. }
  241. }
  242. ```
  243. </div>
  244. <div data-tab="Java" data-lang="java">
  245. Create a new file <code>src/main/java/MyAlgorithm.java</code>:
  246. ```java
  247. package myorg;
  248. import org.apache.predictionio.controller.java.*;
  249. import java.util.Map;
  250. import java.util.HashMap;
  251. public class MyAlgorithm extends LJavaAlgorithm<
  252. EmptyAlgorithmParams, MyTrainingData, MyModel, MyQuery, MyPredictedResult> {
  253. @Override
  254. public MyModel train(MyTrainingData data) {
  255. Map<String, Double> sumMap = new HashMap<String, Double>();
  256. Map<String, Integer> countMap = new HashMap<String, Integer>();
  257. // calculate sum and count for each day
  258. for (MyTrainingData.DayTemperature temp : data.temperatures) {
  259. Double sum = sumMap.get(temp.day);
  260. Integer count = countMap.get(temp.day);
  261. if (sum == null) {
  262. sumMap.put(temp.day, temp.temperature);
  263. countMap.put(temp.day, 1);
  264. } else {
  265. sumMap.put(temp.day, sum + temp.temperature);
  266. countMap.put(temp.day, count + 1);
  267. }
  268. }
  269. // calculate the average
  270. Map<String, Double> averageMap = new HashMap<String, Double>();
  271. for (Map.Entry<String, Double> entry : sumMap.entrySet()) {
  272. String day = entry.getKey();
  273. Double average = entry.getValue() / countMap.get(day);
  274. averageMap.put(day, average);
  275. }
  276. return new MyModel(averageMap);
  277. }
  278. @Override
  279. public MyPredictedResult predict(MyModel model, MyQuery query) {
  280. Double temp = model.temperatures.get(query.day);
  281. return new MyPredictedResult(temp);
  282. }
  283. }
  284. ```
  285. </div>
  286. </div>
  287. ## 5. Implement EngineFactory
  288. <div class="tabs">
  289. <div data-tab="Scala" data-lang="scala">
  290. Edit <code>src/main/scala/Engine.scala</code>:
  291. ```scala
  292. object MyEngineFactory extends IEngineFactory {
  293. override
  294. def apply() = {
  295. /* SimpleEngine only requires one DataSouce and one Algorithm */
  296. new SimpleEngine(
  297. classOf[MyDataSource],
  298. classOf[MyAlgorithm]
  299. )
  300. }
  301. }
  302. ```
  303. </div>
  304. <div data-tab="Java" data-lang="java">
  305. Create a new file <code>src/main/java/MyEngineFactory.java</code>:
  306. ```java
  307. package myorg;
  308. import org.apache.predictionio.controller.java.*;
  309. public class MyEngineFactory implements IJavaEngineFactory {
  310. public JavaSimpleEngine<MyTrainingData, EmptyDataParams, MyQuery, MyPredictedResult,
  311. EmptyActualResult> apply() {
  312. return new JavaSimpleEngineBuilder<MyTrainingData, EmptyDataParams,
  313. MyQuery, MyPredictedResult, EmptyActualResult> ()
  314. .dataSourceClass(MyDataSource.class)
  315. .preparatorClass() // Use default Preparator
  316. .addAlgorithmClass("", MyAlgorithm.class)
  317. .servingClass() // Use default Serving
  318. .build();
  319. }
  320. }
  321. ```
  322. </div>
  323. </div>
  324. ## 6. Define engine.json
  325. You should see an engine.json created as follows:
  326. ```json
  327. {
  328. "id": "helloworld",
  329. "version": "0.0.1-SNAPSHOT",
  330. "name": "helloworld",
  331. "engineFactory": "myorg.MyEngineFactory"
  332. }
  333. ```
  334. If you follow this Hello World Engine tutorial and didn't modify any of the class and package name (`myorg`). You don't need to update this file.
  335. ## 7. Define Parameters
  336. You can safely delete the file `params/datasoruce.json` because this Hello World Engine doesn't take any parameters.
  337. ```
  338. $ rm params/datasource.json
  339. ```
  340. # Deploying the "HelloWorld" Engine Instance
  341. After the new engine is built, it is time to deploy an engine instance of it.
  342. ## 1. Register engine:
  343. ```bash
  344. $ $PIO_HOME/bin/pio register
  345. ```
  346. This command will compile the engine source code and build the necessary binary.
  347. ## 2. Train:
  348. ```bash
  349. $ $PIO_HOME/bin/pio train
  350. ```
  351. Example output:
  352. ```
  353. 2014-09-18 15:44:57,568 INFO spark.SparkContext - Job finished: collect at Workflow.scala:677, took 0.138356 s
  354. 2014-09-18 15:44:57,757 INFO workflow.CoreWorkflow$ - Saved engine instance with ID: zdoo7SGAT2GVX8dMJFzT5w
  355. ```
  356. This command produce an Engine Instance, which can be deployed.
  357. ## 3. Deploy:
  358. ```bash
  359. $ $PIO_HOME/bin/pio deploy
  360. ```
  361. You should see the following if the engine instance is deploy sucessfully:
  362. ```
  363. INFO] [10/13/2014 18:11:09.721] [pio-server-akka.actor.default-dispatcher-4] [akka://pio-server/user/IO-HTTP/listener-0] Bound to localhost/127.0.0.1:8000
  364. [INFO] [10/13/2014 18:11:09.724] [pio-server-akka.actor.default-dispatcher-7] [akka://pio-server/user/master] Bind successful. Ready to serve.
  365. ```
  366. Do not kill the deployed Engine Instance. You can retrieve the prediction by sending HTTP request to the engine instance.
  367. Open another terminal to execute the following:
  368. Retrieve temperature prediction for Monday:
  369. ```bash
  370. $ curl -H "Content-Type: application/json" -d '{ "day": "Mon" }' http://localhost:8000/queries.json
  371. ```
  372. You should see the following output:
  373. ```json
  374. {"temperature":75.5}
  375. ```
  376. You can send another query to retrieve prediction. For example, retrieve temperature prediction for Tuesday:
  377. ```bash
  378. $ curl -H "Content-Type: application/json" -d '{ "day": "Tue" }' http://localhost:8000/queries.json
  379. ```
  380. You should see the following output:
  381. ```json
  382. {"temperature":80.5}
  383. ```
  384. # Re-training The Engine
  385. Let's say you have collected more historial temperature data and want to re-train the Engine with updated data. You can simply execute `pio train` and `pio deploy` again.
  386. Another temperature data set is prepared for you. Run the following to update your data with this new data set. Replace the `path/to/data.csv` with your path used in the steps above.
  387. ```bash
  388. $ cp $PIO_HOME/examples/data/helloworld/data2.csv path/to/data.csv
  389. ```
  390. In another terminal, go to the `HelloWorld` engine directory. Execute `pio train` and `deploy` again to deploy the latest instance trained with the new data. It would automatically kill the old running engine instance.
  391. ```bash
  392. $ $PIO_HOME/bin/pio train
  393. $ $PIO_HOME/bin/pio deploy
  394. ```
  395. Retrieve temperature prediction for Monday again:
  396. ```bash
  397. $ curl -H "Content-Type: application/json" -d '{ "day": "Mon" }' http://localhost:8000/queries.json
  398. ```
  399. You should see the following output:
  400. ```json
  401. {"temperature":76.66666666666667}
  402. ```
  403. Check out [Java Parallel Helloworld tutorial](parallel-helloworld.html)
  404. if you are interested how things are done on the parallel side.