/docs/spark-sql-whole-stage-codegen.md

https://github.com/jaceklaskowski/mastering-spark-sql-book · Markdown · 94 lines · 60 code · 34 blank · 0 comment · 0 complexity · b04f3b1aca0107ae64d3406d60862a13 MD5 · raw file

  1. # Whole-Stage Java Code Generation
  2. **Whole-Stage Java Code Generation** (_Whole-Stage CodeGen_) is a physical query optimization in Spark SQL that fuses multiple physical operators (as a subtree of plans that [support code generation](physical-operators/CodegenSupport.md)) together into a single Java function.
  3. Whole-Stage Java Code Generation improves the execution performance of a query by collapsing a query tree into a single optimized function that eliminates virtual function calls and leverages CPU registers for intermediate data.
  4. !!! note
  5. Whole-Stage Code Generation is used by some modern massively parallel processing (MPP) databases to achieve a better query execution performance.
  6. See [Efficiently Compiling Efficient Query Plans for Modern Hardware (PDF)](http://www.vldb.org/pvldb/vol4/p539-neumann.pdf).
  7. !!! note
  8. [Janino](https://janino-compiler.github.io/janino/) is used to compile a Java source code into a Java class at runtime.
  9. ## CollapseCodegenStages Physical Preparation Rule
  10. Before a query is executed, [CollapseCodegenStages](physical-optimizations/CollapseCodegenStages.md) physical preparation rule finds the physical query plans that support codegen and collapses them together as `WholeStageCodegen` (possibly with [InputAdapter](physical-operators/InputAdapter.md) in-between for physical operators with no support for Java code generation).
  11. `CollapseCodegenStages` is part of the sequence of physical preparation rules [QueryExecution.preparations](QueryExecution.md#preparations) that will be applied in order to the physical plan before execution.
  12. ## debugCodegen
  13. [debugCodegen](spark-sql-debugging-query-execution.md#debugCodegen) or [QueryExecution.debug.codegen](QueryExecution.md#debug) methods allow to access the generated Java source code for a structured query.
  14. As of [Spark 3.0.0](https://issues.apache.org/jira/browse/SPARK-29061), `debugCodegen` prints Java bytecode statistics of generated classes (and compiled by Janino).
  15. ```text
  16. import org.apache.spark.sql.execution.debug._
  17. val q = "SELECT sum(v) FROM VALUES(1) t(v)"
  18. scala> sql(q).debugCodegen
  19. Found 2 WholeStageCodegen subtrees.
  20. == Subtree 1 / 2 (maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) ==
  21. ...
  22. == Subtree 2 / 2 (maxMethodCodeSize:139; maxConstantPoolSize:137(0.21% used); numInnerClasses:0) ==
  23. ```
  24. ## spark.sql.codegen.wholeStage
  25. Whole-Stage Code Generation is controlled by [spark.sql.codegen.wholeStage](spark-sql-properties.md#spark.sql.codegen.wholeStage) Spark internal property.
  26. Whole-Stage Code Generation is on by default.
  27. ```text
  28. assert(spark.sessionState.conf.wholeStageEnabled)
  29. ```
  30. ## Code Generation Paths
  31. Code generation paths were coined in https://github.com/apache/spark/commit/70221903f54eaa0514d5d189dfb6f175a62228a8[this commit].
  32. TIP: Review https://issues.apache.org/jira/browse/SPARK-12795[SPARK-12795 Whole stage codegen] to learn about the work to support it.
  33. ### Non-Whole-Stage-Codegen Path
  34. ### Produce Path
  35. Whole-stage-codegen "produce" path
  36. A [physical operator](physical-operators/SparkPlan.md) with [CodegenSupport](physical-operators/CodegenSupport.md) can [generate Java source code to process the rows from input RDDs](physical-operators/CodegenSupport.md#doProduce).
  37. ### Consume Path
  38. Whole-stage-codegen "consume" path
  39. === [[BenchmarkWholeStageCodegen]] BenchmarkWholeStageCodegen -- Performance Benchmark
  40. `BenchmarkWholeStageCodegen` class provides a benchmark to measure whole stage codegen performance.
  41. You can execute it using the command:
  42. ```
  43. build/sbt 'sql/testOnly *BenchmarkWholeStageCodegen'
  44. ```
  45. NOTE: You need to un-ignore tests in `BenchmarkWholeStageCodegen` by replacing `ignore` with `test`.
  46. ```
  47. $ build/sbt 'sql/testOnly *BenchmarkWholeStageCodegen'
  48. ...
  49. Running benchmark: range/limit/sum
  50. Running case: range/limit/sum codegen=false
  51. 22:55:23.028 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  52. Running case: range/limit/sum codegen=true
  53. Java HotSpot(TM) 64-Bit Server VM 1.8.0_77-b03 on Mac OS X 10.10.5
  54. Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  55. range/limit/sum: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
  56. -------------------------------------------------------------------------------------------
  57. range/limit/sum codegen=false 376 / 433 1394.5 0.7 1.0X
  58. range/limit/sum codegen=true 332 / 388 1581.3 0.6 1.1X
  59. [info] - range/limit/sum (10 seconds, 74 milliseconds)
  60. ```