PageRenderTime 46ms CodeModel.GetById 15ms RepoModel.GetById 0ms app.codeStats 0ms

/README.md

http://github.com/velvia/ScalaStorm
Markdown | 229 lines | 178 code | 51 blank | 0 comment | 0 complexity | 47387446b20dccb066f305b155efeb56 MD5 | raw file
Possible License(s): Apache-2.0
  1. ScalaStorm provides a Scala DSL for Nathan Marz's [Storm](https://github.com/nathanmarz/storm) real-time computation system. It also provides a framework for Scala and SBT development of Storm topologies.
  2. For example, here is the SplitSentence bolt from the word count topology:
  3. ```scala
  4. class SplitSentence extends StormBolt(outputFields = List("word")) {
  5. def execute(t: Tuple) = t matchSeq {
  6. case Seq(sentence: String) => sentence split " " foreach
  7. { word => using anchor t emit (word) }
  8. t ack
  9. }
  10. }
  11. ```
  12. A couple things to note here:
  13. * The matchSeq DSL enables Scala pattern matching on Storm tuples. Notice how it gives you a nice way to name and identify the type of each component. Now imagine the ability to match on different tuple types, like in a join, easily and elegantly!
  14. * The emit DSL reads like English and easily takes multiple args (val1, val2, ...)
  15. * Output fields are easily declared
  16. * It's easy to see exactly when the emits and ack happen
  17. Useful features for Scala developers:
  18. * Auto-boxing of Scala primitives in tuple emit and matchSeq
  19. * A BoltDsl trait for using the DSL from any thread/actor/class
  20. 0.2.4
  21. =====
  22. Added support for multiple streams in Spouts:
  23. ```scala
  24. class MultiStreamSpout extends StormSpout(Map("city" -> List("city"), "browser" -> List("browser"))) {
  25. }
  26. ```
  27. Switched to Apache Storm distribution
  28. Build system updated to sbt 0.13.5
  29. Build system supports crosscompiling for scala 2.9/2.10
  30. ShutdownFunc trait added to StormSpout
  31. Please Read For 0.2.2 / Storm 0.8.0+ Users
  32. =========================================
  33. Storm 0.8.0 emits are no longer thread safe. You may see NullPointerExceptions with DisruptorQueue in the stack trace.
  34. If you are doing emits from multiple threads or actors, you will need to synchronize your emits or have them
  35. come from a single thread. You should synchronize on the collector instance:
  36. ```scala
  37. _collector.synchronized { tuple emit (val1, val2) }
  38. ```
  39. ## Functional Trident (NEW!)
  40. There is a sample Trident topology, in src/storm/scala/examples/trident. It features an
  41. experimental new DSL for doing functional Trident topologies (see FunctionalTrident.scala). I am
  42. currently soliciting feedback for this feature, so drop me a line if you like it.
  43. Getting Started
  44. ===============
  45. The latest version of scala-storm, 0.2.2, corresponds to Storm 0.8.1 and is available from Maven Central. Add this to your build.sbt:
  46. ```scala
  47. libraryDependencies += "com.github.velvia" %% "scala-storm" % "0.2.2"
  48. ```
  49. Version 0.2.0 is available from Maven central and corresponds to Storm 0.7.1.
  50. ```scala
  51. libraryDependencies += "com.github.velvia" %% "scala-storm" % "0.2.0"
  52. ```
  53. In both cases, you will need additional repos, as maven central does not host the Storm/clojure jars:
  54. ```scala
  55. resolvers ++= Seq("clojars" at "http://clojars.org/repo/",
  56. "clojure-releases" at "http://build.clojure.org/releases")
  57. ```
  58. If you want to build from source:
  59. * Download [sbt](https://github.com/harrah/xsbt/wiki) version 0.10.1 or above
  60. * clone this project
  61. * In the root project dir, type `sbt test:run`. SBT will automatically download all dependencies, compile the code, and give you a menu of topologies to run.
  62. To help you get started, the ExclamationTopology and WordCountTopology examples from storm starter have been included.
  63. Bolt DSL
  64. ========
  65. The Scala DSL for bolts is designed to support many different bolt designs, including all 10 variants of the collector emit() and emitDirect() APIs. Getting started consists of extending the StormBolt class, passing a list of output fields, and defining the execute method:
  66. ```scala
  67. class ExclamationBolt extends StormBolt(outputFields = List("word")) {
  68. def execute(t: Tuple) = {
  69. t emit (t.getString(0) + "!!!")
  70. t ack
  71. }
  72. }
  73. ```
  74. If you need to emit to multiple output streams, that can be done by passing a Map with the key being the stream name/Id, and the value being the list of fields for each stream (See the AggregationTopology example):
  75. ```scala
  76. class Splitter extends StormBolt(Map("city" -> List("city"), "browser" -> List("browser"))) {
  77. }
  78. ```
  79. BoltDsl trait
  80. -------------
  81. If you want to use the emit DSL described below in a thread or Actor, you can use the BoltDsl trait. You just have to initialise the _collector variable.
  82. ```scala
  83. class DataWorker(val collector: OutputCollector) extends Actor with BoltDsl {
  84. _collector = collector
  85. ...
  86. def receive = {
  87. no anchor emit (someString, someInt)
  88. }
  89. }
  90. ```
  91. matchSeq
  92. --------
  93. The `matchSeq` method passes the storm tuple as a Scala Seq to the given code block with one or more case statements. The case statements need to have Seq() in order to match the tuple. If none of the cases match, then by default a handler which throws a RuntimeError will be used. It is a good idea to include your own default handler.
  94. matchSeq allows easy naming and safe typing of tuple components, and allows easy parsing of different tuple types. Suppose that a bolt takes in a data stream from one source and a clock or timing-related stream from another source. It can be handled like this:
  95. ```scala
  96. def execute(t: Tuple) = t matchSeq {
  97. case Seq(username: String, followers: List[String]) => // process data
  98. case Seq(timestamp: Integer) => // process clock event
  99. }
  100. ```
  101. Unboxing will be automatically performed. Even though everything going over the wire has to be a subset of java.lang.Object, if you match on a Scala primitive, it will automatically unbox it for you.
  102. By default, if none of the cases are matched, then ScalaStorm will throw a RuntimeException with a message "unhandled tuple". This can be useful for debugging in local mode to quickly discover matching errors. If you want to handle the unhandled case yourself, simply add `case _ => ...` as the last case.
  103. emit and emitDirect
  104. -------------------
  105. emit takes a variable number of AnyRef arguments which make up the tuple to emit. emitDirect is the same but the first argument is the Int taskId, followed by a variable number of AnyRefs.
  106. To emit a tuple anchored on one tuple, where t is of type Tuple, do one of the following:
  107. ```scala
  108. using anchor t emit (val1, val2, ...)
  109. using anchor t emitDirect (taskId, val1, val2, ...)
  110. anchor(t) emit (val1, val2, ...)
  111. t emit (val1, val2, ...)
  112. ```
  113. To emit a tuple to a particular stream:
  114. ```scala
  115. using anchor t toStream 5 emit (val1, val2, ...)
  116. using anchor t toStream 5 emitDirect (taskId, val1, val2, ...)
  117. ```
  118. To emit anchored on multiple tuples (can be any Seq, not just a List):
  119. ```scala
  120. using anchor List(t1, t2) emit (val1, val2, ...)
  121. using anchor List(t1, t2) emitDirect (taskId, val1, val2, ...)
  122. ```
  123. To emit unanchored:
  124. ```scala
  125. using no anchor emit (val1, val2, ...)
  126. using no anchor emitDirect (taskId, val1, val2, ...)
  127. using no anchor toStream 5 emit (val1, val2, ...)
  128. using no anchor toStream 5 emitDirect (taskId, val1, val2, ...)
  129. ```
  130. ack
  131. ---
  132. ```scala
  133. t ack // Ack one tuple
  134. List(t1, t2) ack // Ack multiple tuples, in order of list
  135. ```
  136. A note on types supported by emit (...)
  137. ---------------------------------------
  138. Any scala type may be passed to emit() so long as it can be autoboxed into an AnyRef (java.lang.Object). This includes Scala Ints, Longs, and other basic types.
  139. Spout DSL
  140. =========
  141. The Scala Spout DSL is very similar to the Bolt DSL. One extends the StormSpout class, declaring the output fields, and defines the nextTuple method:
  142. ```scala
  143. class MySpout extends StormSpout(outputFields = List("word", "author")) {
  144. def nextTuple = {}
  145. }
  146. ```
  147. Spout emit DSL
  148. --------------
  149. The spout emit DSL is very similar to the bolt emit DSL. Again, all variants of the SpoutOutputCollector emit and emitDirect APIs are supported. The basic forms for emitting tuples are as follows:
  150. ```scala
  151. emit (val1, val2, ...)
  152. emitDirect (taskId, val1, val2, ...)
  153. ```
  154. To emit a tuple with a specific message ID:
  155. ```scala
  156. using msgId 9876 emit (val1, val2, ...)
  157. using msgId 9876 emitDirect (taskId, val1, val2, ...)
  158. ```
  159. To emit a tuple to a specific stream:
  160. ```scala
  161. toStream 6 emit (val1, val2, ...)
  162. toStream 6 emitDirect (taskId, val1, val2, ...)
  163. using msgId 9876 toStream 6 emit (val1, val2, ...)
  164. using msgId 9876 toStream 6 emitDirect (taskId, val1, val2, ...)
  165. ```
  166. Bolt and Spout Setup
  167. ====================
  168. You will probably need to initialize per-instance variables at each bolt and spout for all but the simplest of designs. You should not do this in the Bolt or Spout constructor, as the constructor is only called before submitting the Topology. What you instead need to do is to override the prepare() and open() methods, and do your setup in there, but there is a convenient `setup` DSL that lets you perform whatever per-instance initialization is needed, in a concise and consistent manner. To use it:
  169. ```scala
  170. class MyBolt extends StormBolt(List("word")) {
  171. var myIterator: Iterator[Int] = _
  172. setup { myIterator = ... }
  173. }
  174. ```
  175. License
  176. =======
  177. Apache 2.0. Please see LICENSE.md.
  178. All contents copyright (c) 2012, Evan Chan.