PageRenderTime 58ms CodeModel.GetById 18ms RepoModel.GetById 1ms app.codeStats 0ms

/site/publish/docs/r0.11.1/perf.html

#
HTML | 1429 lines | 1226 code | 122 blank | 81 comment | 0 complexity | 3754579a554a65c83ec47285d3c70b6a MD5 | raw file

Large files files are truncated, but you can click here to view the full file

  1. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
  2. <html>
  3. <head>
  4. <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
  5. <meta content="Apache Forrest" name="Generator">
  6. <meta name="Forrest-version" content="0.9">
  7. <meta name="Forrest-skin-name" content="pelt">
  8. <title>Performance and Efficiency</title>
  9. <link type="text/css" href="skin/basic.css" rel="stylesheet">
  10. <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
  11. <link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
  12. <link type="text/css" href="skin/profile.css" rel="stylesheet">
  13. <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
  14. <link rel="shortcut icon" href="">
  15. </head>
  16. <body onload="init()">
  17. <script type="text/javascript">ndeSetTextSize();</script>
  18. <div id="top">
  19. <!--+
  20. |breadtrail
  21. +-->
  22. <div class="breadtrail">
  23. <a href="http://www.apache.org/">Apache</a> &gt; <a href="http://hadoop.apache.org/">Hadoop</a> &gt; <a href="http://hadoop.apache.org/pig/">Pig</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
  24. </div>
  25. <!--+
  26. |header
  27. +-->
  28. <div class="header">
  29. <!--+
  30. |start group logo
  31. +-->
  32. <div class="grouplogo">
  33. <a href="http://hadoop.apache.org/"><img class="logoImage" alt="Hadoop" src="images/hadoop-logo.jpg" title="Apache Hadoop"></a>
  34. </div>
  35. <!--+
  36. |end group logo
  37. +-->
  38. <!--+
  39. |start Project Logo
  40. +-->
  41. <div class="projectlogo">
  42. <a href="http://hadoop.apache.org/pig/"><img class="logoImage" alt="Pig" src="images/pig-logo.gif" title="A platform for analyzing large datasets."></a>
  43. </div>
  44. <!--+
  45. |end Project Logo
  46. +-->
  47. <!--+
  48. |start Search
  49. +-->
  50. <div class="searchbox">
  51. <form action="http://www.google.com/search" method="get" class="roundtopsmall">
  52. <input value="" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp;
  53. <input name="Search" value="Search" type="submit">
  54. </form>
  55. </div>
  56. <!--+
  57. |end search
  58. +-->
  59. <!--+
  60. |start Tabs
  61. +-->
  62. <ul id="tabs">
  63. <li>
  64. <a class="unselected" href="http://hadoop.apache.org/pig/">Project</a>
  65. </li>
  66. <li>
  67. <a class="unselected" href="http://wiki.apache.org/pig/">Wiki</a>
  68. </li>
  69. <li class="current">
  70. <a class="selected" href="index.html">Pig 0.11.1 Documentation</a>
  71. </li>
  72. </ul>
  73. <!--+
  74. |end Tabs
  75. +-->
  76. </div>
  77. </div>
  78. <div id="main">
  79. <div id="publishedStrip">
  80. <!--+
  81. |start Subtabs
  82. +-->
  83. <div id="level2tabs"></div>
  84. <!--+
  85. |end Endtabs
  86. +-->
  87. <script type="text/javascript"><!--
  88. document.write("Last Published: " + document.lastModified);
  89. // --></script>
  90. </div>
  91. <!--+
  92. |breadtrail
  93. +-->
  94. <div class="breadtrail">
  95. &nbsp;
  96. </div>
  97. <!--+
  98. |start Menu, mainarea
  99. +-->
  100. <!--+
  101. |start Menu
  102. +-->
  103. <div id="menu">
  104. <div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Pig</div>
  105. <div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;">
  106. <div class="menuitem">
  107. <a href="index.html">Overview</a>
  108. </div>
  109. <div class="menuitem">
  110. <a href="start.html">Getting Started</a>
  111. </div>
  112. <div class="menuitem">
  113. <a href="basic.html">Pig Latin Basics</a>
  114. </div>
  115. <div class="menuitem">
  116. <a href="func.html">Built In Functions</a>
  117. </div>
  118. <div class="menuitem">
  119. <a href="udf.html">User Defined Functions</a>
  120. </div>
  121. <div class="menuitem">
  122. <a href="cont.html">Control Structures</a>
  123. </div>
  124. <div class="menuitem">
  125. <a href="cmds.html">Shell and Utililty Commands</a>
  126. </div>
  127. <div class="menupage">
  128. <div class="menupagetitle">Performance and Efficiency</div>
  129. </div>
  130. <div class="menuitem">
  131. <a href="test.html">Testing and Diagnostics</a>
  132. </div>
  133. <div class="menuitem">
  134. <a href="pig-index.html">Index</a>
  135. </div>
  136. </div>
  137. <div onclick="SwitchMenu('menu_1.2', 'skin/')" id="menu_1.2Title" class="menutitle">Miscellaneous</div>
  138. <div id="menu_1.2" class="menuitemgroup">
  139. <div class="menuitem">
  140. <a href="api/">API Docs</a>
  141. </div>
  142. <div class="menuitem">
  143. <a href="jdiff/changes.html">API Changes</a>
  144. </div>
  145. <div class="menuitem">
  146. <a href="https://cwiki.apache.org/confluence/display/PIG">Wiki</a>
  147. </div>
  148. <div class="menuitem">
  149. <a href="https://cwiki.apache.org/confluence/display/PIG/FAQ">FAQ</a>
  150. </div>
  151. <div class="menuitem">
  152. <a href="http://hadoop.apache.org/pig/releases.html">Release Notes</a>
  153. </div>
  154. </div>
  155. <div id="credit"></div>
  156. <div id="roundbottom">
  157. <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
  158. <!--+
  159. |alternative credits
  160. +-->
  161. <div id="credit2"></div>
  162. </div>
  163. <!--+
  164. |end Menu
  165. +-->
  166. <!--+
  167. |start content
  168. +-->
  169. <div id="content">
  170. <div title="Portable Document Format" class="pdflink">
  171. <a class="dida" href="perf.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
  172. PDF</a>
  173. </div>
  174. <h1>Performance and Efficiency</h1>
  175. <div id="front-matter">
  176. <div id="minitoc-area">
  177. <ul class="minitoc">
  178. <li>
  179. <a href="#profiling">Timing your UDFs</a>
  180. </li>
  181. <li>
  182. <a href="#combiner">Combiner</a>
  183. <ul class="minitoc">
  184. <li>
  185. <a href="#When+the+Combiner+is+Used">When the Combiner is Used</a>
  186. </li>
  187. <li>
  188. <a href="#When+the+Combiner+is+Not+Used">When the Combiner is Not Used</a>
  189. </li>
  190. </ul>
  191. </li>
  192. <li>
  193. <a href="#hash-based-aggregation">Hash-based Aggregation in Map Task</a>
  194. </li>
  195. <li>
  196. <a href="#memory-management">Memory Management</a>
  197. </li>
  198. <li>
  199. <a href="#reducer-estimation">Reducer Estimation</a>
  200. </li>
  201. <li>
  202. <a href="#multi-query-execution">Multi-Query Execution</a>
  203. <ul class="minitoc">
  204. <li>
  205. <a href="#Turning+it+On+or+Off">Turning it On or Off</a>
  206. </li>
  207. <li>
  208. <a href="#How+it+Works">How it Works</a>
  209. </li>
  210. <li>
  211. <a href="#store-dump">Store vs. Dump</a>
  212. </li>
  213. <li>
  214. <a href="#error-handling">Error Handling</a>
  215. </li>
  216. <li>
  217. <a href="#backward-compatibility">Backward Compatibility</a>
  218. </li>
  219. <li>
  220. <a href="#Implicit-Dependencies">Implicit Dependencies</a>
  221. </li>
  222. </ul>
  223. </li>
  224. <li>
  225. <a href="#optimization-rules">Optimization Rules</a>
  226. <ul class="minitoc">
  227. <li>
  228. <a href="#FilterLogicExpressionSimplifier">FilterLogicExpressionSimplifier</a>
  229. </li>
  230. <li>
  231. <a href="#SplitFilter">SplitFilter</a>
  232. </li>
  233. <li>
  234. <a href="#PushUpFilter">PushUpFilter</a>
  235. </li>
  236. <li>
  237. <a href="#MergeFilter">MergeFilter</a>
  238. </li>
  239. <li>
  240. <a href="#PushDownForEachFlatten">PushDownForEachFlatten</a>
  241. </li>
  242. <li>
  243. <a href="#LimitOptimizer">LimitOptimizer</a>
  244. </li>
  245. <li>
  246. <a href="#ColumnMapKeyPrune">ColumnMapKeyPrune</a>
  247. </li>
  248. <li>
  249. <a href="#AddForEach">AddForEach</a>
  250. </li>
  251. <li>
  252. <a href="#MergeForEach">MergeForEach</a>
  253. </li>
  254. <li>
  255. <a href="#GroupByConstParallelSetter">GroupByConstParallelSetter</a>
  256. </li>
  257. </ul>
  258. </li>
  259. <li>
  260. <a href="#performance-enhancers">Performance Enhancers</a>
  261. <ul class="minitoc">
  262. <li>
  263. <a href="#Use+Optimization">Use Optimization</a>
  264. </li>
  265. <li>
  266. <a href="#types">Use Types</a>
  267. </li>
  268. <li>
  269. <a href="#projection">Project Early and Often </a>
  270. </li>
  271. <li>
  272. <a href="#filter">Filter Early and Often</a>
  273. </li>
  274. <li>
  275. <a href="#pipeline">Reduce Your Operator Pipeline</a>
  276. </li>
  277. <li>
  278. <a href="#algebraic-interface">Make Your UDFs Algebraic</a>
  279. </li>
  280. <li>
  281. <a href="#accumulator-interface">Use the Accumulator Interface</a>
  282. </li>
  283. <li>
  284. <a href="#nulls">Drop Nulls Before a Join</a>
  285. </li>
  286. <li>
  287. <a href="#join-optimizations">Take Advantage of Join Optimizations</a>
  288. </li>
  289. <li>
  290. <a href="#parallel">Use the Parallel Features</a>
  291. </li>
  292. <li>
  293. <a href="#limit">Use the LIMIT Operator</a>
  294. </li>
  295. <li>
  296. <a href="#distinct">Prefer DISTINCT over GROUP BY/GENERATE</a>
  297. </li>
  298. <li>
  299. <a href="#compression">Compress the Results of Intermediate Jobs</a>
  300. </li>
  301. <li>
  302. <a href="#combine-files">Combine Small Input Files</a>
  303. </li>
  304. </ul>
  305. </li>
  306. <li>
  307. <a href="#specialized-joins">Specialized Joins</a>
  308. <ul class="minitoc">
  309. <li>
  310. <a href="#replicated-joins">Replicated Joins</a>
  311. </li>
  312. <li>
  313. <a href="#skewed-joins">Skewed Joins</a>
  314. </li>
  315. <li>
  316. <a href="#merge-joins">Merge Joins</a>
  317. </li>
  318. <li>
  319. <a href="#merge-sparse-joins">Merge-Sparse Joins</a>
  320. </li>
  321. <li>
  322. <a href="#specialized-joins-performance">Performance Considerations</a>
  323. </li>
  324. </ul>
  325. </li>
  326. </ul>
  327. </div>
  328. </div>
  329. <a name="profiling"></a>
  330. <h2 class="h3">Timing your UDFs</h2>
  331. <div class="section">
  332. <p>The first step to improving performance and efficiency is measuring where the time is going. Pig provides a light-weight method for approximately measuring how much time is spent in different user-defined functions (UDFs) and Loaders. Simply set the pig.udf.profile property to true. This will cause new counters to be tracked for all Map-Reduce jobs generated by your script: approx_microsecs measures the approximate amount of time spent in a UDF, and approx_invocations measures the approximate number of times the UDF was invoked. Note that this may produce a large number of counters (two per UDF). Excessive amounts of counters can lead to poor JobTracker performance, so use this feature carefully, and preferably on a test cluster.</p>
  333. </div>
  334. <!-- ================================================================== -->
  335. <!-- COMBINER -->
  336. <a name="combiner"></a>
  337. <h2 class="h3">Combiner</h2>
  338. <div class="section">
  339. <p>The Pig combiner is an optimizer that is invoked when the statements in your scripts are arranged in certain ways. The examples below demonstrate when the combiner is used and not used. Whenever possible, make sure the combiner is used as it frequently yields an order of magnitude improvement in performance. </p>
  340. <a name="When+the+Combiner+is+Used"></a>
  341. <h3 class="h4">When the Combiner is Used</h3>
  342. <p>The combiner is generally used in the case of non-nested foreach where all projections are either expressions on the group column or expressions on algebraic UDFs (see <a href="#Algebraic-interface">Make Your UDFs Algebraic</a>).</p>
  343. <p>Example:</p>
  344. <pre class="code">
  345. A = load 'studenttab10k' as (name, age, gpa);
  346. B = group A by age;
  347. C = foreach B generate ABS(SUM(A.gpa)), COUNT(org.apache.pig.builtin.Distinct(A.name)), (MIN(A.gpa) + MAX(A.gpa))/2, group.age;
  348. explain C;
  349. </pre>
  350. <p></p>
  351. <p>In the above example:</p>
  352. <ul>
  353. <li>The GROUP statement can be referred to as a whole or by accessing individual fields (as in the example). </li>
  354. <li>The GROUP statement and its elements can appear anywhere in the projection. </li>
  355. </ul>
  356. <p>In the above example, a variety of expressions can be applied to algebraic functions including:</p>
  357. <ul>
  358. <li>A column transformation function such as ABS can be applied to an algebraic function SUM.</li>
  359. <li>An algebraic function (COUNT) can be applied to another algebraic function (Distinct), but only the inner function is computed using the combiner. </li>
  360. <li>A mathematical expression can be applied to one or more algebraic functions. </li>
  361. </ul>
  362. <p></p>
  363. <p>You can check if the combiner is used for your query by running <a href="test.html#EXPLAIN">EXPLAIN</a> on the FOREACH alias as shown above. You should see the combine section in the MapReduce part of the plan:</p>
  364. <pre class="code">
  365. .....
  366. Combine Plan
  367. B: Local Rearrange[tuple]{bytearray}(false) - scope-42
  368. | |
  369. | Project[bytearray][0] - scope-43
  370. |
  371. |---C: New For Each(false,false,false)[bag] - scope-28
  372. | |
  373. | Project[bytearray][0] - scope-29
  374. | |
  375. | POUserFunc(org.apache.pig.builtin.SUM$Intermediate)[tuple] - scope-30
  376. | |
  377. | |---Project[bag][1] - scope-31
  378. | |
  379. | POUserFunc(org.apache.pig.builtin.Distinct$Intermediate)[tuple] - scope-32
  380. | |
  381. | |---Project[bag][2] - scope-33
  382. |
  383. |---POCombinerPackage[tuple]{bytearray} - scope-36--------
  384. .....
  385. </pre>
  386. <p>The combiner is also used with a nested foreach as long as the only nested operation used is DISTINCT
  387. (see <a href="basic.html#FOREACH">FOREACH</a> and <a href="basic.html#nestedblock">Example: Nested Block</a>).
  388. </p>
  389. <pre class="code">
  390. A = load 'studenttab10k' as (name, age, gpa);
  391. B = group A by age;
  392. C = foreach B { D = distinct (A.name); generate group, COUNT(D);}
  393. </pre>
  394. <p></p>
  395. <p>Finally, use of the combiner is influenced by the surrounding environment of the GROUP and FOREACH statements.</p>
  396. <a name="When+the+Combiner+is+Not+Used"></a>
  397. <h3 class="h4">When the Combiner is Not Used</h3>
  398. <p>The combiner is generally not used if there is any operator that comes between the GROUP and FOREACH statements in the execution plan. Even if the statements are next to each other in your script, the optimizer might rearrange them. In this example, the optimizer will push FILTER above FOREACH which will prevent the use of the combiner:</p>
  399. <pre class="code">
  400. A = load 'studenttab10k' as (name, age, gpa);
  401. B = group A by age;
  402. C = foreach B generate group, COUNT (A);
  403. D = filter C by group.age &lt;30;
  404. </pre>
  405. <p></p>
  406. <p>Please note that the script above can be made more efficient by performing filtering before the GROUP statement:</p>
  407. <pre class="code">
  408. A = load 'studenttab10k' as (name, age, gpa);
  409. B = filter A by age &lt;30;
  410. C = group B by age;
  411. D = foreach C generate group, COUNT (B);
  412. </pre>
  413. <p></p>
  414. <p>
  415. <strong>Note:</strong> One exception to the above rule is LIMIT. Starting with Pig 0.9, even if LIMIT comes between GROUP and FOREACH, the combiner will still be used. In this example, the optimizer will push LIMIT above FOREACH but this will not prevent the use of the combiner.</p>
  416. <pre class="code">
  417. A = load 'studenttab10k' as (name, age, gpa);
  418. B = group A by age;
  419. C = foreach B generate group, COUNT (A);
  420. D = limit C 20;
  421. </pre>
  422. <p></p>
  423. <p>The combiner is also not used in the case where multiple FOREACH statements are associated with the same GROUP:</p>
  424. <pre class="code">
  425. A = load 'studenttab10k' as (name, age, gpa);
  426. B = group A by age;
  427. C = foreach B generate group, COUNT (A);
  428. D = foreach B generate group, MIN (A.gpa). MAX(A.gpa);
  429. .....
  430. </pre>
  431. <p>Depending on your use case, it might be more efficient (improve performance) to split your script into multiple scripts.</p>
  432. </div>
  433. <!-- ================================================================== -->
  434. <!-- HASH-BASED AGGREGATION IN MAP TASK-->
  435. <a name="hash-based-aggregation"></a>
  436. <h2 class="h3">Hash-based Aggregation in Map Task</h2>
  437. <div class="section">
  438. <p> To improve performance, hash-based aggregation will aggregate records in the map task before sending them to the combiner. This optimization reduces the serializing/deserializing costs of the combiner by sending it fewer records.</p>
  439. <p>
  440. <strong>Turning On Off</strong>
  441. </p>
  442. <p>Hash-based aggregation has been shown to improve the speed of group-by operations by up to 50%. However, since this is a very new feature, it is currently turned OFF by default. To turn it ON, set the property pig.exec.mapPartAgg to true.</p>
  443. <p>
  444. <strong>Configuring</strong>
  445. </p>
  446. <p>If the group-by keys used for grouping don't result in a sufficient reduction in the number of records, the performance might be worse with this feature turned ON. To prevent this from happening, the feature turns itself off if the reduction in records sent to combiner is not more than a configurable threshold. This threshold can be set using the property pig.exec.mapPartAgg.minReduction. It is set to a default value of 10, which means that the number of records that get sent to the combiner should be reduced by a factor of 10 or more.</p>
  447. </div>
  448. <!-- ================================================================== -->
  449. <!-- MEMORY MANAGEMENT -->
  450. <a name="memory-management"></a>
  451. <h2 class="h3">Memory Management</h2>
  452. <div class="section">
  453. <p>Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner. </p>
  454. <a name="memory-bags"></a>
  455. <p id="memory-bags">The amount of memory allocated to bags is determined by pig.cachedbag.memusage; the default is set to 20% (0.2) of available memory. Note that this memory is shared across all large bags used by the application.</p>
  456. </div>
  457. <!-- ================================================================== -->
  458. <!-- REDUCER ESTIMATION -->
  459. <a name="reducer-estimation"></a>
  460. <h2 class="h3">Reducer Estimation</h2>
  461. <div class="section">
  462. <p>
  463. By default Pig determines the number of reducers to use for a given job based on the size of the
  464. input to the map phase. The input data size is divided by the
  465. pig.exec.reducers.bytes.per.reducer parameter value (default 1GB) to determine the number of
  466. reducers. The maximum number of reducers for a job is limited by the pig.exec.reducers.max parameter
  467. (default 999).
  468. </p>
  469. <p>
  470. The default reducer estimation algorithm described above can be overridden by setting the
  471. pig.exec.reducer.estimator parameter to the fully qualified class name of an implementation of
  472. <a href="http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigReducerEstimator.java">org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigReducerEstimator</a>.
  473. The class must exist on the classpath of the process submitting the Pig job. If the
  474. pig.exec.reducer.estimator.arg parameter is set, the value will be passed to a constructor
  475. of the implementing class that takes a single String.
  476. </p>
  477. <p>
  478. </p>
  479. </div>
  480. <!-- ==================================================================== -->
  481. <!-- MULTI-QUERY EXECUTION-->
  482. <a name="multi-query-execution"></a>
  483. <h2 class="h3">Multi-Query Execution</h2>
  484. <div class="section">
  485. <p>With multi-query execution Pig processes an entire script or a batch of statements at once.</p>
  486. <a name="Turning+it+On+or+Off"></a>
  487. <h3 class="h4">Turning it On or Off</h3>
  488. <p>Multi-query execution is turned on by default.
  489. To turn it off and revert to Pig's "execute-on-dump/store" behavior, use the "-M" or "-no_multiquery" options. </p>
  490. <p>To run script "myscript.pig" without the optimization, execute Pig as follows: </p>
  491. <pre class="code">
  492. $ pig -M myscript.pig
  493. or
  494. $ pig -no_multiquery myscript.pig
  495. </pre>
  496. <a name="How+it+Works"></a>
  497. <h3 class="h4">How it Works</h3>
  498. <p>Multi-query execution introduces some changes:</p>
  499. <ul>
  500. <li>
  501. <p>For batch mode execution, the entire script is first parsed to determine if intermediate tasks
  502. can be combined to reduce the overall amount of work that needs to be done; execution starts only after the parsing is completed
  503. (see the <a href="test.html#EXPLAIN">EXPLAIN</a> operator and the <a href="cmds.html#run">run</a> and <a href="cmds.html#exec">exec</a> commands). </p>
  504. </li>
  505. <li>
  506. <p>Two run scenarios are optimized, as explained below: explicit and implicit splits, and storing intermediate results.</p>
  507. </li>
  508. </ul>
  509. <a name="splits"></a>
  510. <h4>Explicit and Implicit Splits</h4>
  511. <p>There might be cases in which you want different processing on separate parts of the same data stream.</p>
  512. <p>Example 1:</p>
  513. <pre class="code">
  514. A = LOAD ...
  515. ...
  516. SPLIT A' INTO B IF ..., C IF ...
  517. ...
  518. STORE B' ...
  519. STORE C' ...
  520. </pre>
  521. <p>Example 2:</p>
  522. <pre class="code">
  523. A = LOAD ...
  524. ...
  525. B = FILTER A' ...
  526. C = FILTER A' ...
  527. ...
  528. STORE B' ...
  529. STORE C' ...
  530. </pre>
  531. <p>In prior Pig releases, Example 1 will dump A' to disk and then start jobs for B' and C'.
  532. Example 2 will execute all the dependencies of B' and store it and then execute all the dependencies of C' and store it.
  533. Both are equivalent, but the performance will be different. </p>
  534. <p>Here's what the multi-query execution does to increase the performance: </p>
  535. <ul>
  536. <li>
  537. <p>For Example 2, adds an implicit split to transform the query to Example 1.
  538. This eliminates the processing of A' multiple times.</p>
  539. </li>
  540. <li>
  541. <p>Makes the split non-blocking and allows processing to continue.
  542. This helps reduce the amount of data that has to be stored right at the split. </p>
  543. </li>
  544. <li>
  545. <p>Allows multiple outputs from a job. This way some results can be stored as a side-effect of the main job.
  546. This is also necessary to make the previous item work. </p>
  547. </li>
  548. <li>
  549. <p>Allows multiple split branches to be carried on to the combiner/reducer.
  550. This reduces the amount of IO again in the case where multiple branches in the split can benefit from a combiner run. </p>
  551. </li>
  552. </ul>
  553. <a name="data-store-performance"></a>
  554. <h4>Storing Intermediate Results</h4>
  555. <p>Sometimes it is necessary to store intermediate results. </p>
  556. <pre class="code">
  557. A = LOAD ...
  558. ...
  559. STORE A'
  560. ...
  561. STORE A''
  562. </pre>
  563. <p>If the script doesn't re-load A' for the processing of A the steps above A' will be duplicated.
  564. This is a special case of Example 2 above, so the same steps are recommended.
  565. With multi-query execution, the script will process A and dump A' as a side-effect.</p>
  566. <a name="store-dump"></a>
  567. <h3 class="h4">Store vs. Dump</h3>
  568. <p>With multi-query exection, you want to use <a href="basic.html#STORE">STORE</a> to save (persist) your results.
  569. You do not want to use <a href="test.html#DUMP">DUMP</a> as it will disable multi-query execution and is likely to slow down execution. (If you have included DUMP statements in your scripts for debugging purposes, you should remove them.) </p>
  570. <p>DUMP Example: In this script, because the DUMP command is interactive, the multi-query execution will be disabled and two separate jobs will be created to execute this script. The first job will execute A &gt; B &gt; DUMP while the second job will execute A &gt; B &gt; C &gt; STORE.</p>
  571. <pre class="code">
  572. A = LOAD 'input' AS (x, y, z);
  573. B = FILTER A BY x &gt; 5;
  574. DUMP B;
  575. C = FOREACH B GENERATE y, z;
  576. STORE C INTO 'output';
  577. </pre>
  578. <p>STORE Example: In this script, multi-query optimization will kick in allowing the entire script to be executed as a single job. Two outputs are produced: output1 and output2.</p>
  579. <pre class="code">
  580. A = LOAD 'input' AS (x, y, z);
  581. B = FILTER A BY x &gt; 5;
  582. STORE B INTO 'output1';
  583. C = FOREACH B GENERATE y, z;
  584. STORE C INTO 'output2';
  585. </pre>
  586. <a name="error-handling"></a>
  587. <h3 class="h4">Error Handling</h3>
  588. <p>With multi-query execution Pig processes an entire script or a batch of statements at once.
  589. By default Pig tries to run all the jobs that result from that, regardless of whether some jobs fail during execution.
  590. To check which jobs have succeeded or failed use one of these options. </p>
  591. <p>First, Pig logs all successful and failed store commands. Store commands are identified by output path.
  592. At the end of execution a summary line indicates success, partial failure or failure of all store commands. </p>
  593. <p>Second, Pig returns different code upon completion for these scenarios:</p>
  594. <ul>
  595. <li>
  596. <p>Return code 0: All jobs succeeded</p>
  597. </li>
  598. <li>
  599. <p>Return code 1: <em>Used for retrievable errors</em>
  600. </p>
  601. </li>
  602. <li>
  603. <p>Return code 2: All jobs have failed </p>
  604. </li>
  605. <li>
  606. <p>Return code 3: Some jobs have failed </p>
  607. </li>
  608. </ul>
  609. <p></p>
  610. <p>In some cases it might be desirable to fail the entire script upon detecting the first failed job.
  611. This can be achieved with the "-F" or "-stop_on_failure" command line flag.
  612. If used, Pig will stop execution when the first failed job is detected and discontinue further processing.
  613. This also means that file commands that come after a failed store in the script will not be executed (this can be used to create "done" files). </p>
  614. <p>This is how the flag is used: </p>
  615. <pre class="code">
  616. $ pig -F myscript.pig
  617. or
  618. $ pig -stop_on_failure myscript.pig
  619. </pre>
  620. <a name="backward-compatibility"></a>
  621. <h3 class="h4">Backward Compatibility</h3>
  622. <p>Most existing Pig scripts will produce the same result with or without the multi-query execution.
  623. There are cases though where this is not true. Path names and schemes are discussed here.</p>
  624. <p>Any script is parsed in it's entirety before it is sent to execution. Since the current directory can change
  625. throughout the script any path used in LOAD or STORE statement is translated to a fully qualified and absolute path.</p>
  626. <p>In map-reduce mode, the following script will load from "hdfs://&lt;host&gt;:&lt;port&gt;/data1" and store into "hdfs://&lt;host&gt;:&lt;port&gt;/tmp/out1". </p>
  627. <pre class="code">
  628. cd /;
  629. A = LOAD 'data1';
  630. cd tmp;
  631. STORE A INTO 'out1';
  632. </pre>
  633. <p>These expanded paths will be passed to any LoadFunc or Slicer implementation.
  634. In some cases this can cause problems, especially when a LoadFunc/Slicer is not used to read from a dfs file or path
  635. (for example, loading from an SQL database). </p>
  636. <p>Solutions are to either: </p>
  637. <ul>
  638. <li>
  639. <p>Specify "-M" or "-no_multiquery" to revert to the old names</p>
  640. </li>
  641. <li>
  642. <p>Specify a custom scheme for the LoadFunc/Slicer </p>
  643. </li>
  644. </ul>
  645. <p>Arguments used in a LOAD statement that have a scheme other than "hdfs" or "file" will not be expanded and passed to the LoadFunc/Slicer unchanged.</p>
  646. <p>In the SQL case, the SQLLoader function is invoked with 'sql://mytable'. </p>
  647. <pre class="code">
  648. A = LOAD 'sql://mytable' USING SQLLoader();
  649. </pre>
  650. <a name="Implicit-Dependencies"></a>
  651. <h3 class="h4">Implicit Dependencies</h3>
  652. <p>If a script has dependencies on the execution order outside of what Pig knows about, execution may fail. </p>
  653. <a name="Example"></a>
  654. <h4>Example</h4>
  655. <p>In this script, MYUDF might try to read from out1, a file that A was just stored into.
  656. However, Pig does not know that MYUDF depends on the out1 file and might submit the jobs
  657. producing the out2 and out1 files at the same time.</p>
  658. <pre class="code">
  659. ...
  660. STORE A INTO 'out1';
  661. B = LOAD 'data2';
  662. C = FOREACH B GENERATE MYUDF($0,'out1');
  663. STORE C INTO 'out2';
  664. </pre>
  665. <p>To make the script work (to ensure that the right execution order is enforced) add the exec statement.
  666. The exec statement will trigger the execution of the statements that produce the out1 file. </p>
  667. <pre class="code">
  668. ...
  669. STORE A INTO 'out1';
  670. EXEC;
  671. B = LOAD 'data2';
  672. C = FOREACH B GENERATE MYUDF($0,'out1');
  673. STORE C INTO 'out2';
  674. </pre>
  675. <a name="Example-N10217"></a>
  676. <h4>Example</h4>
  677. <p>In this script, the STORE/LOAD operators have different file paths; however, the LOAD operator depends on the STORE operator.</p>
  678. <pre class="code">
  679. A = LOAD '/user/xxx/firstinput' USING PigStorage();
  680. B = group ....
  681. C = .... agrregation function
  682. STORE C INTO '/user/vxj/firstinputtempresult/days1';
  683. ..
  684. Atab = LOAD '/user/xxx/secondinput' USING PigStorage();
  685. Btab = group ....
  686. Ctab = .... agrregation function
  687. STORE Ctab INTO '/user/vxj/secondinputtempresult/days1';
  688. ..
  689. E = LOAD '/user/vxj/firstinputtempresult/' USING PigStorage();
  690. F = group ....
  691. G = .... aggregation function
  692. STORE G INTO '/user/vxj/finalresult1';
  693. Etab =LOAD '/user/vxj/secondinputtempresult/' USING PigStorage();
  694. Ftab = group ....
  695. Gtab = .... aggregation function
  696. STORE Gtab INTO '/user/vxj/finalresult2';
  697. </pre>
  698. <p>To make the script works, add the exec statement. </p>
  699. <pre class="code">
  700. A = LOAD '/user/xxx/firstinput' USING PigStorage();
  701. B = group ....
  702. C = .... agrregation function
  703. STORE C INTO '/user/vxj/firstinputtempresult/days1';
  704. ..
  705. Atab = LOAD '/user/xxx/secondinput' USING PigStorage();
  706. Btab = group ....
  707. Ctab = .... agrregation function
  708. STORE Ctab INTO '/user/vxj/secondinputtempresult/days1';
  709. EXEC;
  710. E = LOAD '/user/vxj/firstinputtempresult/' USING PigStorage();
  711. F = group ....
  712. G = .... aggregation function
  713. STORE G INTO '/user/vxj/finalresult1';
  714. ..
  715. Etab =LOAD '/user/vxj/secondinputtempresult/' USING PigStorage();
  716. Ftab = group ....
  717. Gtab = .... aggregation function
  718. STORE Gtab INTO '/user/vxj/finalresult2';
  719. </pre>
  720. </div>
  721. <!-- ==================================================================== -->
  722. <!-- OPTIMIZATION RULES -->
  723. <a name="optimization-rules"></a>
  724. <h2 class="h3">Optimization Rules</h2>
  725. <div class="section">
  726. <p>Pig supports various optimization rules. By default optimization, and all optimization rules, are turned on.
  727. To turn off optimiztion, use:</p>
  728. <pre class="code">
  729. pig -optimizer_off [opt_rule | all ]
  730. </pre>
  731. <p>Note that some rules are mandatory and cannot be turned off.</p>
  732. <a name="FilterLogicExpressionSimplifier"></a>
  733. <h3 class="h4">FilterLogicExpressionSimplifier</h3>
  734. <p>This rule simplifies the expression in filter statement.</p>
  735. <pre class="code">
  736. 1) Constant pre-calculation
  737. B = FILTER A BY a0 &gt; 5+7;
  738. is simplified to
  739. B = FILTER A BY a0 &gt; 12;
  740. 2) Elimination of negations
  741. B = FILTER A BY NOT (NOT(a0 &gt; 5) OR a &gt; 10);
  742. is simplified to
  743. B = FILTER A BY a0 &gt; 5 AND a &lt;= 10;
  744. 3) Elimination of logical implied expression in AND
  745. B = FILTER A BY (a0 &gt; 5 AND a0 &gt; 7);
  746. is simplified to
  747. B = FILTER A BY a0 &gt; 7;
  748. 4) Elimination of logical implied expression in OR
  749. B = FILTER A BY ((a0 &gt; 5) OR (a0 &gt; 6 AND a1 &gt; 15);
  750. is simplified to
  751. B = FILTER C BY a0 &gt; 5;
  752. 5) Equivalence elimination
  753. B = FILTER A BY (a0 v 5 AND a0 &gt; 5);
  754. is simplified to
  755. B = FILTER A BY a0 &gt; 5;
  756. 6) Elimination of complementary expressions in OR
  757. B = FILTER A BY (a0 &gt; 5 OR a0 &lt;= 5);
  758. is simplified to non-filtering
  759. 7) Elimination of naive TRUE expression
  760. B = FILTER A BY 1==1;
  761. is simplified to non-filtering
  762. </pre>
  763. <a name="SplitFilter"></a>
  764. <h3 class="h4">SplitFilter</h3>
  765. <p>Split filter conditions so that we can push filter more aggressively.</p>
  766. <pre class="code">
  767. A = LOAD 'input1' as (a0, a1);
  768. B = LOAD 'input2' as (b0, b1);
  769. C = JOIN A by a0, B by b0;
  770. D = FILTER C BY a1&gt;0 and b1&gt;0;
  771. </pre>
  772. <p>Here D will be splitted into:</p>
  773. <pre class="code">
  774. X = FILTER C BY a1&gt;0;
  775. D = FILTER X BY b1&gt;0;
  776. </pre>
  777. <p>So "a1&gt;0" and "b1&gt;0" can be pushed up individually.</p>
  778. <a name="PushUpFilter"></a>
  779. <h3 class="h4">PushUpFilter</h3>
  780. <p>The objective of this rule is to push the FILTER operators up the data flow graph. As a result, the number of records that flow through the pipeline is reduced. </p>
  781. <pre class="code">
  782. A = LOAD 'input';
  783. B = GROUP A BY $0;
  784. C = FILTER B BY $0 &lt; 10;
  785. </pre>
  786. <a name="MergeFilter"></a>
  787. <h3 class="h4">MergeFilter</h3>
  788. <p>Merge filter conditions after PushUpFilter rule to decrease the number of filter statements.</p>
  789. <a name="PushDownForEachFlatten"></a>
  790. <h3 class="h4">PushDownForEachFlatten</h3>
  791. <p>The objective of this rule is to reduce the number of records that flow through the pipeline by moving FOREACH operators with a FLATTEN down the data flow graph. In the example shown below, it would be more efficient to move the foreach after the join to reduce the cost of the join operation.</p>
  792. <pre class="code">
  793. A = LOAD 'input' AS (a, b, c);
  794. B = LOAD 'input2' AS (x, y, z);
  795. C = FOREACH A GENERATE FLATTEN($0), B, C;
  796. D = JOIN C BY $1, B BY $1;
  797. </pre>
  798. <a name="LimitOptimizer"></a>
  799. <h3 class="h4">LimitOptimizer</h3>
  800. <p>The objective of this rule is to push the LIMIT operator up the data flow graph (or down the tree for database folks). In addition, for top-k (ORDER BY followed by a LIMIT) the LIMIT is pushed into the ORDER BY.</p>
  801. <pre class="code">
  802. A = LOAD 'input';
  803. B = ORDER A BY $0;
  804. C = LIMIT B 10;
  805. </pre>
  806. <a name="ColumnMapKeyPrune"></a>
  807. <h3 class="h4">ColumnMapKeyPrune</h3>
  808. <p>Prune the loader to only load necessary columns. The performance gain is more significant if the corresponding loader support column pruning and only load necessary columns (See LoadPushDown.pushProjection). Otherwise, ColumnMapKeyPrune will insert a ForEach statement right after loader.</p>
  809. <pre class="code">
  810. A = load 'input' as (a0, a1, a2);
  811. B = ORDER A by a0;
  812. C = FOREACH B GENERATE a0, a1;
  813. </pre>
  814. <p>a2 is irrelevant in this query, so we can prune it earlier. The loader in this query is PigStorage and it supports column pruning. So we only load a0 and a1 from the input file.</p>
  815. <p>ColumnMapKeyPrune also prunes unused map keys:</p>
  816. <pre class="code">
  817. A = load 'input' as (a0:map[]);
  818. B = FOREACH A generate a0#'key1';
  819. </pre>
  820. <a name="AddForEach"></a>
  821. <h3 class="h4">AddForEach</h3>
  822. <p>Prune unused column as soon as possible. In addition to prune the loader in ColumnMapKeyPrune, we can prune a column as soon as it is not used in the rest of the script</p>
  823. <pre class="code">
  824. -- Original code:
  825. A = LOAD 'input' AS (a0, a1, a2);
  826. B = ORDER A BY a0;
  827. C = FILTER B BY a1&gt;0;
  828. </pre>
  829. <p>We can only prune a2 from the loader. However, a0 is never used after "ORDER BY". So we can drop a0 right after "ORDER BY" statement.</p>
  830. <pre class="code">
  831. -- Optimized code:
  832. A = LOAD 'input' AS (a0, a1, a2);
  833. B = ORDER A BY a0;
  834. B1 = FOREACH B GENERATE a1; -- drop a0
  835. C = FILTER B1 BY a1&gt;0;
  836. </pre>
  837. <a name="MergeForEach"></a>
  838. <h3 class="h4">MergeForEach</h3>
  839. <p>The objective of this rule is to merge together two feach statements, if these preconditions are met:</p>
  840. <ul>
  841. <li>The foreach statements are consecutive.</li>
  842. <li>The first foreach statement does not contain flatten.</li>
  843. <li>The second foreach is not nested.</li>
  844. </ul>
  845. <pre class="code">
  846. -- Original code:
  847. A = LOAD 'file.txt' AS (a, b, c);
  848. B = FOREACH A GENERATE a+b AS u, c-b AS v;
  849. C = FOREACH B GENERATE $0+5, v;
  850. -- Optimized code:
  851. A = LOAD 'file.txt' AS (a, b, c);
  852. C = FOREACH A GENERATE a+b+5, c-b;
  853. </pre>
  854. <a name="GroupByConstParallelSetter"></a>
  855. <h3 class="h4">GroupByConstParallelSetter</h3>
  856. <p>Force parallel "1" for "group all" statement. That's because even if we set parallel to N, only 1 reducer will be used in this case and all other reducer produce empty result.</p>
  857. <pre class="code">
  858. A = LOAD 'input';
  859. B = GROUP A all PARALLEL 10;
  860. </pre>
  861. </div>
  862. <!-- ==================================================================== -->
  863. <!-- PERFORMANCE ENHANCERS-->
  864. <a name="performance-enhancers"></a>
  865. <h2 class="h3">Performance Enhancers</h2>
  866. <div class="section">
  867. <a name="Use+Optimization"></a>
  868. <h3 class="h4">Use Optimization</h3>
  869. <p>Pig supports various <a href="perf.html#Optimization-Rules">optimization rules</a> which are turned on by default.
  870. Become familiar with these rules.</p>
  871. <a name="types"></a>
  872. <h3 class="h4">Use Types</h3>
  873. <p>If types are not specified in the load statement, Pig assumes the type of =double= for numeric computations.
  874. A lot of the time, your data would be much smaller, maybe, integer or long. Specifying the real type will help with
  875. speed of arithmetic computation. It has an additional advantage of early error detection. </p>
  876. <pre class="code">
  877. --Query 1
  878. A = load 'myfile' as (t, u, v);
  879. B = foreach A generate t + u;
  880. --Query 2
  881. A = load 'myfile' as (t: int, u: int, v);
  882. B = foreach A generate t + u;
  883. </pre>
  884. <p>The second query will run more efficiently than the first. In some of our queries with see 2x speedup. </p>
  885. <a name="projection"></a>
  886. <h3 class="h4">Project Early and Often </h3>
  887. <p>Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like: </p>
  888. <pre class="code">
  889. A = load 'myfile' as (t, u, v);
  890. B = load 'myotherfile' as (x, y, z);
  891. C = join A by t, B by x;
  892. D = group C by u;
  893. E = foreach D generate group, COUNT($1);
  894. </pre>
  895. <p>There is no need for v, y, or z to participate in this query. And there is no need to carry both t and x past the join, just one will suffice. Changing the query above to the query below will greatly reduce the amount of data being carried through the map and reduce phases by pig. </p>
  896. <pre class="code">
  897. A = load 'myfile' as (t, u, v);
  898. A1 = foreach A generate t, u;
  899. B = load 'myotherfile' as (x, y, z);
  900. B1 = foreach B generate x;
  901. C = join A1 by t, B1 by x;
  902. C1 = foreach C generate t, u;
  903. D = group C1 by u;
  904. E = foreach D generate group, COUNT($1);
  905. </pre>
  906. <p>Depending on your data, this can produce significant time savings. In queries similar to the example shown here we have seen total time drop by 50%.</p>
  907. <a name="filter"></a>
  908. <h3 class="h4">Filter Early and Often</h3>
  909. <p>As with early projection, in most cases it is beneficial to apply filters as early as possible to reduce the amount of data flowing through the pipeline. </p>
  910. <pre class="code">
  911. -- Query 1
  912. A = load 'myfile' as (t, u, v);
  913. B = load 'myotherfile' as (x, y, z);
  914. C = filter A by t == 1;
  915. D = join C by t, B by x;
  916. E = group D by u;
  917. F = foreach E generate group, COUNT($1);
  918. -- Query 2
  919. A = load 'myfile' as (t, u, v);
  920. B = load 'myotherfile' as (x, y, z);
  921. C = join A by t, B by x;
  922. D = group C by u;
  923. E = foreach D generate group, COUNT($1);
  924. F = filter E by C.t == 1;
  925. </pre>
  926. <p>The first query is clearly more efficient than the second one because it reduces the amount of data going into the join. </p>
  927. <p>One case where pushing filters up might not be a good idea is if the cost of applying filter is very high and only a small amount of data is filtered out. </p>
  928. <a name="pipeline"></a>
  929. <h3 class="h4">Reduce Your Operator Pipeline</h3>
  930. <p>For clarity of your script, you might choose to split your projects into several steps for instance: </p>
  931. <pre class="code">
  932. A = load 'data' as (in: map[]);
  933. -- get key out of the map
  934. B = foreach A generate in#'k1' as k1, in#'k2' as k2;
  935. -- concatenate the keys
  936. C = foreach B generate CONCAT(k1, k2);
  937. .......
  938. </pre>
  939. <p>While the example above is easier to read, you might want to consider combining the two foreach statements to improve your query performance: </p>
  940. <pre class="code">
  941. A = load 'data' as (in: map[]);
  942. -- concatenate the keys from the map
  943. B = foreach A generate CONCAT(in#'k1', in#'k2');
  944. ....
  945. </pre>
  946. <p>The same goes for filters. </p>
  947. <a name="algebraic-interface"></a>
  948. <h3 class="h4">Make Your UDFs Algebraic</h3>
  949. <p>Queries that can take advantage of the combiner generally ran much faster (sometimes several times faster) than the versions that don't. The latest code significantly improves combiner usage; however, you need to make sure you do your part. If you have a UDF that works on grouped data and is, by nature, algebraic (meaning their computation can be decomposed into multiple steps) make sure you implement it as such. For details on how to write algebraic UDFs, see <a href="udf.html#algebraic-interface">Algebraic Interface</a>.</p>
  950. <pre class="code">
  951. A = load 'data' as (x, y, z)
  952. B = group A by x;
  953. C = foreach B generate group, MyUDF(A);
  954. ....
  955. </pre>
  956. <p>If <span class="codefrag">MyUDF</span> is algebraic, the query will use combiner and run much faster. You can run <span class="codefrag">explain</span> command on your query to make sure that combiner is used. </p>
  957. <a name="accumulator-interface"></a>
  958. <h3 class="h4">Use the Accumulator Interface</h3>
  959. <p>
  960. If your UDF can't be made Algebraic but is able to deal with getting input in chunks rather than all at once, consider implementing the Accumulator interface to reduce the amount of memory used by your script. If your function <em>is</em> Algebraic and can be used on conjunction with Accumulator functions, you will need to implement the Accumulator interface as well as the Algebraic interface. For more information, see <a href="udf.html#Accumulator-Interface">Accumulator Interface</a>.</p>
  961. <p>
  962. <strong>Note:</strong> Pig automatically chooses the interface that it expects to provide the best performance: Algebraic &gt; Accumulator &gt; Default. </p>
  963. <a name="nulls"></a>
  964. <h3 class="h4">Drop Nulls Before a Join</h3>
  965. <p>With the introduction of nulls, join and cogroup semantics were altered to work with nulls. The semantic for cogrouping with nulls is that nulls from a given input are grouped together, but nulls across inputs are not grouped together. This preserves the semantics of grouping (nulls are collected together from a single input to be passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs). Since flattening an empty bag results in an empty row (and no output), in a standard join the rows with a null key will always be dropped. </p>
  966. <p>This join</p>
  967. <pre class="code">
  968. A = load 'myfile' as (t, u, v);
  969. B = load 'myotherfile' as (x, y, z);
  970. C = join A by t, B by x;
  971. </pre>
  972. <p>is rewritten by Pig to </p>
  973. <pre class="code">
  974. A = load 'myfile' as (t, u, v);
  975. B = load 'myotherfile' as (x, y, z);
  976. C1 = cogroup A by t INNER, B by x INNER;
  977. C = foreach C1 generate flatten(A), flatten(B);
  978. </pre>
  979. <p>Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output. So the null keys will be dropped. But they will not be dropped until the last possible moment. </p>
  980. <p>If the query is rewritten to </p>
  981. <pre class="code">
  982. A = load 'myfile' as (t, u, v);
  983. B = load 'myotherfile' as (x, y, z);
  984. A1 = filter A by t is not null;
  985. B1 = filter B by x is not null;
  986. C = join A1 by t, B1 by x;
  987. </pre>
  988. <p>then the nulls will be dropped before the join. Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be significant. In one test where the key was null 7% of the time and the data was spread across 200 reducers, we saw a about a 10x speed up in the query by adding the early filters. </p>
  989. <a name="join-optimizations"></a>
  990. <h3 class="h4">Take Advantage of Join Optimizations</h3>
  991. <p>
  992. <strong>Regular Join Optimizations</strong>
  993. </p>
  994. <p>Optimization for regular joins ensures that the last table in the join is not brought into memory but streamed through instead. Optimization reduces the amount of memory used which means you can avoid spilling the data and also should be able to scale your query to larger data volumes. </p>
  995. <p>To take advantage of this optimization, make sure that the table with the largest number of tuples per key is the last table in your query.
  996. In some of our tests we saw 10x performance improvement as the result of this optimization.</p>
  997. <pre class="code">
  998. small = load 'small_file' as (t, u, v);
  999. large = load 'large_file' as (x, y, z);
  1000. C = join small by t, large by x;
  1001. </pre>
  1002. <p>
  1003. <strong>Specialized Join Optimizations</strong>
  1004. </p>
  1005. <p>Optimization can also be achieved using fragment replicate joins, skewed joins, and merge joins.
  1006. For more information see <a href="perf.html#Specialized-Joins">Specialized Joins</a>.</p>
  1007. <a name="parallel"></a>
  1008. <h3 class="h4">Use the Parallel Features</h3>
  1009. <p>You can set the number of reduce tasks for the MapReduce jobs generated by Pig using two parallel features.
  1010. (The parallel features only affect the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.)</p>
  1011. <p>
  1012. <strong>You Set the Number of Reducers</strong>
  1013. </p>
  1014. <p>Use the <a href="cmds.html#set">set default parallel</a> command to set the number of reducers at the script level.</p>
  1015. <p>Alternatively, use the PARALLEL clause to set the number of reducers at the operator level.
  1016. (In a script, the value set via the PARALLEL clause will override any value set via "set default parallel.")
  1017. You can include the PARALLEL clause with any operator that starts a reduce phase:
  1018. <a href="basic.html#COGROUP">COGROUP</a>,
  1019. <a href="basic.html#CROSS">CROSS</a>,
  1020. <a href="basic.html#DISTINCT">DISTINCT</a>,
  1021. <a href="basic.html#GROUP">GROUP</a>,
  1022. <a href="basic.html#JOIN-inner">JOIN (inner)</a>,
  1023. <a href="basic.html#JOIN-outer">JOIN (outer)</a>, and
  1024. <a href="basic.html#ORDER-BY">ORDER BY</a>.
  1025. </p>
  1026. <p>The number of reducers you need for a particular construct in Pig that forms a MapReduce boundary depends entirely on (1) your data and the number of intermediate keys you are generating in your mappers and (2) the partitioner and distribution of map (combiner) output keys. In the best cases we have seen that a reducer processing about 1 GB of data behaves efficiently.</p>
  1027. <p>
  1028. <strong>Let Pig Set the Number of Reducers</strong>
  1029. </p>
  1030. <p>If neither "set default parallel" nor the PARALLEL clause are used, Pig sets the number of reducers using a heuristic based on the size of the input data. You can set the values for these properties:</p>
  1031. <ul>
  1032. <li>pig.exec.reducers.bytes.per.reducer - Defines the number of input bytes per reduce; default value is 1000*1000*1000 (1GB).</li>
  1033. <li>pig.exec.reducers.max - Defines the upper bound on the number of reducers; default is 999. </li>
  1034. </ul>
  1035. <p></p>
  1036. <p>The formula, shown below, is very simple and will improve over time. The computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.</p>
  1037. <p>
  1038. <span class="codefrag">#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer) </span>
  1039. </p>
  1040. <p>
  1041. <strong>Examples</strong>
  1042. </p>
  1043. <p>In this example PARALLEL is used with the GROUP operator. </p>
  1044. <pre class="code">
  1045. A = LOAD 'myfile' AS (t, u, v);
  1046. B = GROUP A BY t PARALLEL 18;
  1047. ...
  1048. </pre>
  1049. <p>In this example all the MapReduce jobs that get launched use 20 reducers.</p>
  1050. <pre class="code">
  1051. SET default_parallel 20;
  1052. A = LOAD &lsquo;myfile.txt&rsquo; USING PigStorage() AS (t, u, v);
  1053. B = GROUP A BY t;
  1054. C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
  1055. D = ORDER C BY mycount;
  1056. STORE D INTO &lsquo;mysortedcount&rsquo; USING PigStorage();
  1057. </pre>
  1058. <a name="limit"></a>
  1059. <h3 class="h4">Use the LIMIT Operator</h3>
  1060. <p>Often you are not interested in the entire output but rather a sample or top results. In such cases, using LIMIT can yield a much better performance as we push the limit as high as possible to minimize the amount of data travelling through the pipeline. </p>
  1061. <p>Sample:
  1062. </p>
  1063. <pre class="code">
  1064. A = load 'myfile' as (t, u, v);
  1065. B = limit A 500;
  1066. </pre>
  1067. <p>Top results: </p>
  1068. <pre class="code">
  1069. A = load 'myfile' as (t, u, v);
  1070. B = order A by t;
  1071. C = limit B 500;
  1072. </pre>
  1073. <a name="distinct"></a>
  1074. <h3 class="h4">Prefer DISTINCT over GROUP BY/GENERATE</h3>
  1075. <p>To extract unique values from a column in a relation you can use DISTINCT or GROUP BY/GENERATE. DISTINCT is the preferred method; it is faster and more efficient.</p>
  1076. <p>Example using GROUP BY - GENERATE:</p>
  1077. <pre class="code">
  1078. A = load 'myfile' as (t, u, v);
  1079. B = foreach A generate u;
  1080. C = group B by u;
  1081. D = foreach C generate group as uniquekey;
  1082. dump D;
  1083. </pre>
  1084. <p>Example using DISTINCT:</p>
  1085. <pre class="code">
  1086. A = load 'myfile' as (t, u, v);
  1087. B = foreach A generate u;
  1088. C = distinct B;
  1089. dump C;
  1090. </pre>
  1091. <a name="compression"></a>
  1092. <h3 class="h4">Compress the Results of Intermediate Jobs</h3>
  1093. <p>If your Pig script generates a sequence of MapReduce jobs, you can compress the output of the intermediate jobs using LZO compression. (Use the <a href="test.html#EXPLAIN">EXPLAIN</a> operator to determine if your script produces multiple MapReduce Jobs.)</p>
  1094. <p>By doing this, you will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data that is generated, the more benefits in storage and speed that result.</p>
  1095. <p>You can set the value for these properties:</p>
  1096. <ul>
  1097. <li>pig.tmpfilecompression - Determines if the temporary files should be compressed or not (set to false by default).</li>
  1098. <li>pig.tmpfilecompression.codec - Specifies which compression codec to use. Currently, Pig accepts "gz" and "lzo" as possible values. However, because LZO is under GPL license (and disabled by default) you will need to configure your cluster to use the LZO codec to take advantage of this feature. For details, see http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ.</li>
  1099. </ul>
  1100. <p></p>
  1101. <p>On the non-trivial queries (one ran longer than a couple of minutes) we saw significant improvements both in terms of query latency and space usage. For some queries we saw up to 96% disk saving and up to 4x query speed up. Of course, the performance characteristics are very much query and data dependent and testing needs to be done to determine gains. We did not see any slowdown in the tests we peformed which means that you are at least saving on space while using compression.</p>
  1102. <p>With gzip we saw a better compression (96-99%) but at a cost of 4% slowdown. Thus, we don't recommend using gzip. </p>
  1103. <p>
  1104. <strong>Example</strong>
  1105. </p>
  1106. <pre class="code">
  1107. -- launch Pig script using lzo compression
  1108. java -cp $PIG_HOME/pig.jar
  1109. -Djava.library.path=&lt;path to the lzo library&gt;
  1110. -Dpig.tmpfilecompression=true
  1111. -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main myscript.pig
  1112. </pre>
  1113. <a name="combine-files"></a>
  1114. <h3 class="h4">Combine Small Input Files</h3>
  1115. <p>Processing input (either user input or intermediate input) from multiple small files can be inefficient because a separate map has to be created for each file. Pig can now combined small files so that they are processed as a single map.</p>
  1116. <p>You can set the values for these properties:</p>
  1117. <ul>
  1118. <li>pig.maxCombinedSplitSize &ndash; Specifies the size, in bytes, of data to be processed by a single map. Smaller files are combined untill this size is reached. </li>
  1119. <li>pig.splitCombination &ndash; Turns combine split files on or off (set to &ldquo;true&rdquo; by default).</li>
  1120. </ul>
  1121. <p></p>
  1122. <p>This feature works with <a href="func.html#PigStorage">PigStorage</a>. However, if you are using a custom loader, please note the following:</p>
  1123. <ul>
  1124. <li>If your loader implementation makes use of the PigSplit object passed through the prepareToRead method, then you may need to rebuild the loader since the definition of PigSplit has been modified. </li>
  1125. <li>The loader must be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument.</li>
  1126. <li>If a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations.</li>
  1127. </ul>
  1128. <p></p>
  1129. </div>
  1130. <!-- ==================================================================== -->
  1131. <!-- SPECIALIZED JOINS-->
  1132. <a name="specialized-joins"></a>
  1133. <h2 class="h3">Specialized Joins</h2>
  1134. <div class="section">
  1135. <a name="replicated-joins"></a>
  1136. <h3 class="h4">Replicated Joins</h3>
  1137. <p>Fragment replicate join is a special type of join that works well if one or more relations are small enough to fit into main memory.
  1138. In such cases, Pig can perform a very efficient join because all of the hadoop work is done on the map side. In this type of join the
  1139. large relation is followed by one or more small relations. The small relations must be small enough to fit into main memory; if they
  1140. don't, …

Large files files are truncated, but you can click here to view the full file