PageRenderTime 326ms CodeModel.GetById 25ms RepoModel.GetById 0ms app.codeStats 1ms

/site/publish/docs/r0.9.2/perf.html

#
HTML | 1375 lines | 1183 code | 115 blank | 77 comment | 0 complexity | e1d00b5051e761b53235ff43f248a6c6 MD5 | raw file

Large files files are truncated, but you can click here to view the full file

  1. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
  2. <html>
  3. <head>
  4. <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
  5. <meta content="Apache Forrest" name="Generator">
  6. <meta name="Forrest-version" content="0.8">
  7. <meta name="Forrest-skin-name" content="pelt">
  8. <title>Performance and Efficiency</title>
  9. <link type="text/css" href="skin/basic.css" rel="stylesheet">
  10. <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
  11. <link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
  12. <link type="text/css" href="skin/profile.css" rel="stylesheet">
  13. <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
  14. <link rel="shortcut icon" href="">
  15. </head>
  16. <body onload="init()">
  17. <script type="text/javascript">ndeSetTextSize();</script>
  18. <div id="top">
  19. <!--+
  20. |breadtrail
  21. +-->
  22. <div class="breadtrail">
  23. <a href="http://www.apache.org/">Apache</a> &gt; <a href="http://hadoop.apache.org/">Hadoop</a> &gt; <a href="http://hadoop.apache.org/pig/">Pig</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
  24. </div>
  25. <!--+
  26. |header
  27. +-->
  28. <div class="header">
  29. <!--+
  30. |start group logo
  31. +-->
  32. <div class="grouplogo">
  33. <a href="http://hadoop.apache.org/"><img class="logoImage" alt="Hadoop" src="images/hadoop-logo.jpg" title="Apache Hadoop"></a>
  34. </div>
  35. <!--+
  36. |end group logo
  37. +-->
  38. <!--+
  39. |start Project Logo
  40. +-->
  41. <div class="projectlogo">
  42. <a href="http://hadoop.apache.org/pig/"><img class="logoImage" alt="Pig" src="images/pig-logo.gif" title="A platform for analyzing large datasets."></a>
  43. </div>
  44. <!--+
  45. |end Project Logo
  46. +-->
  47. <!--+
  48. |start Search
  49. +-->
  50. <div class="searchbox">
  51. <form action="http://www.google.com/search" method="get" class="roundtopsmall">
  52. <input value="" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp;
  53. <input name="Search" value="Search" type="submit">
  54. </form>
  55. </div>
  56. <!--+
  57. |end search
  58. +-->
  59. <!--+
  60. |start Tabs
  61. +-->
  62. <ul id="tabs">
  63. <li>
  64. <a class="unselected" href="http://hadoop.apache.org/pig/">Project</a>
  65. </li>
  66. <li>
  67. <a class="unselected" href="http://wiki.apache.org/pig/">Wiki</a>
  68. </li>
  69. <li class="current">
  70. <a class="selected" href="index.html">Pig 0.9.2 Documentation</a>
  71. </li>
  72. </ul>
  73. <!--+
  74. |end Tabs
  75. +-->
  76. </div>
  77. </div>
  78. <div id="main">
  79. <div id="publishedStrip">
  80. <!--+
  81. |start Subtabs
  82. +-->
  83. <div id="level2tabs"></div>
  84. <!--+
  85. |end Endtabs
  86. +-->
  87. <script type="text/javascript"><!--
  88. document.write("Last Published: " + document.lastModified);
  89. // --></script>
  90. </div>
  91. <!--+
  92. |breadtrail
  93. +-->
  94. <div class="breadtrail">
  95. &nbsp;
  96. </div>
  97. <!--+
  98. |start Menu, mainarea
  99. +-->
  100. <!--+
  101. |start Menu
  102. +-->
  103. <div id="menu">
  104. <div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Pig</div>
  105. <div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;">
  106. <div class="menuitem">
  107. <a href="index.html">Overview</a>
  108. </div>
  109. <div class="menuitem">
  110. <a href="start.html">Getting Started</a>
  111. </div>
  112. <div class="menuitem">
  113. <a href="basic.html">Pig Latin Basics</a>
  114. </div>
  115. <div class="menuitem">
  116. <a href="func.html">Built In Functions</a>
  117. </div>
  118. <div class="menuitem">
  119. <a href="udf.html">User Defined Functions</a>
  120. </div>
  121. <div class="menuitem">
  122. <a href="cont.html">Control Structures</a>
  123. </div>
  124. <div class="menuitem">
  125. <a href="cmds.html">Shell and Utililty Commands</a>
  126. </div>
  127. <div class="menupage">
  128. <div class="menupagetitle">Performance and Efficiency</div>
  129. </div>
  130. <div class="menuitem">
  131. <a href="test.html">Testing and Diagnostics</a>
  132. </div>
  133. <div class="menuitem">
  134. <a href="pig-index.html">Index</a>
  135. </div>
  136. </div>
  137. <div onclick="SwitchMenu('menu_1.2', 'skin/')" id="menu_1.2Title" class="menutitle">Zebra</div>
  138. <div id="menu_1.2" class="menuitemgroup">
  139. <div class="menuitem">
  140. <a href="zebra_overview.html">Zebra Overview </a>
  141. </div>
  142. <div class="menuitem">
  143. <a href="zebra_users.html">Zebra Users </a>
  144. </div>
  145. <div class="menuitem">
  146. <a href="zebra_reference.html">Zebra Reference </a>
  147. </div>
  148. <div class="menuitem">
  149. <a href="zebra_mapreduce.html">Zebra MapReduce </a>
  150. </div>
  151. <div class="menuitem">
  152. <a href="zebra_pig.html">Zebra Pig </a>
  153. </div>
  154. <div class="menuitem">
  155. <a href="zebra_stream.html">Zebra Streaming </a>
  156. </div>
  157. </div>
  158. <div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Miscellaneous</div>
  159. <div id="menu_1.3" class="menuitemgroup">
  160. <div class="menuitem">
  161. <a href="api/">API Docs</a>
  162. </div>
  163. <div class="menuitem">
  164. <a href="jdiff/changes.html">API Changes</a>
  165. </div>
  166. <div class="menuitem">
  167. <a href="https://cwiki.apache.org/confluence/display/PIG">Wiki</a>
  168. </div>
  169. <div class="menuitem">
  170. <a href="https://cwiki.apache.org/confluence/display/PIG/FAQ">FAQ</a>
  171. </div>
  172. <div class="menuitem">
  173. <a href="http://hadoop.apache.org/pig/releases.html">Release Notes</a>
  174. </div>
  175. </div>
  176. <div id="credit"></div>
  177. <div id="roundbottom">
  178. <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
  179. <!--+
  180. |alternative credits
  181. +-->
  182. <div id="credit2"></div>
  183. </div>
  184. <!--+
  185. |end Menu
  186. +-->
  187. <!--+
  188. |start content
  189. +-->
  190. <div id="content">
  191. <div title="Portable Document Format" class="pdflink">
  192. <a class="dida" href="perf.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
  193. PDF</a>
  194. </div>
  195. <h1>Performance and Efficiency</h1>
  196. <div id="minitoc-area">
  197. <ul class="minitoc">
  198. <li>
  199. <a href="#combiner">Combiner</a>
  200. <ul class="minitoc">
  201. <li>
  202. <a href="#When+the+Combiner+is+Used">When the Combiner is Used</a>
  203. </li>
  204. <li>
  205. <a href="#When+the+Combiner+is+Not+Used">When the Combiner is Not Used</a>
  206. </li>
  207. </ul>
  208. </li>
  209. <li>
  210. <a href="#memory-management">Memory Management</a>
  211. </li>
  212. <li>
  213. <a href="#multi-query-execution">Multi-Query Execution</a>
  214. <ul class="minitoc">
  215. <li>
  216. <a href="#Turning+it+On+or+Off">Turning it On or Off</a>
  217. </li>
  218. <li>
  219. <a href="#How+it+Works">How it Works</a>
  220. </li>
  221. <li>
  222. <a href="#store-dump">Store vs. Dump</a>
  223. </li>
  224. <li>
  225. <a href="#error-handling">Error Handling</a>
  226. </li>
  227. <li>
  228. <a href="#backward-compatibility">Backward Compatibility</a>
  229. </li>
  230. <li>
  231. <a href="#Implicit-Dependencies">Implicit Dependencies</a>
  232. </li>
  233. </ul>
  234. </li>
  235. <li>
  236. <a href="#optimization-rules">Optimization Rules</a>
  237. <ul class="minitoc">
  238. <li>
  239. <a href="#FilterLogicExpressionSimplifier">FilterLogicExpressionSimplifier</a>
  240. </li>
  241. <li>
  242. <a href="#SplitFilter">SplitFilter</a>
  243. </li>
  244. <li>
  245. <a href="#PushUpFilter">PushUpFilter</a>
  246. </li>
  247. <li>
  248. <a href="#MergeFilter">MergeFilter</a>
  249. </li>
  250. <li>
  251. <a href="#PushDownForEachFlatten">PushDownForEachFlatten</a>
  252. </li>
  253. <li>
  254. <a href="#LimitOptimizer">LimitOptimizer</a>
  255. </li>
  256. <li>
  257. <a href="#ColumnMapKeyPrune">ColumnMapKeyPrune</a>
  258. </li>
  259. <li>
  260. <a href="#AddForEach">AddForEach</a>
  261. </li>
  262. <li>
  263. <a href="#MergeForEach">MergeForEach</a>
  264. </li>
  265. <li>
  266. <a href="#GroupByConstParallelSetter">GroupByConstParallelSetter</a>
  267. </li>
  268. </ul>
  269. </li>
  270. <li>
  271. <a href="#performance-enhancers">Performance Enhancers</a>
  272. <ul class="minitoc">
  273. <li>
  274. <a href="#Use+Optimization">Use Optimization</a>
  275. </li>
  276. <li>
  277. <a href="#types">Use Types</a>
  278. </li>
  279. <li>
  280. <a href="#projection">Project Early and Often </a>
  281. </li>
  282. <li>
  283. <a href="#filter">Filter Early and Often</a>
  284. </li>
  285. <li>
  286. <a href="#pipeline">Reduce Your Operator Pipeline</a>
  287. </li>
  288. <li>
  289. <a href="#algebraic-interface">Make Your UDFs Algebraic</a>
  290. </li>
  291. <li>
  292. <a href="#accumulator-interface">Use the Accumulator Interface</a>
  293. </li>
  294. <li>
  295. <a href="#nulls">Drop Nulls Before a Join</a>
  296. </li>
  297. <li>
  298. <a href="#join-optimizations">Take Advantage of Join Optimizations</a>
  299. </li>
  300. <li>
  301. <a href="#parallel">Use the Parallel Features</a>
  302. </li>
  303. <li>
  304. <a href="#limit">Use the LIMIT Operator</a>
  305. </li>
  306. <li>
  307. <a href="#distinct">Prefer DISTINCT over GROUP BY/GENERATE</a>
  308. </li>
  309. <li>
  310. <a href="#compression">Compress the Results of Intermediate Jobs</a>
  311. </li>
  312. <li>
  313. <a href="#combine-files">Combine Small Input Files</a>
  314. </li>
  315. </ul>
  316. </li>
  317. <li>
  318. <a href="#specialized-joins">Specialized Joins</a>
  319. <ul class="minitoc">
  320. <li>
  321. <a href="#replicated-joins">Replicated Joins</a>
  322. </li>
  323. <li>
  324. <a href="#skewed-joins">Skewed Joins</a>
  325. </li>
  326. <li>
  327. <a href="#merge-joins">Merge Joins</a>
  328. </li>
  329. <li>
  330. <a href="#specialized-joins-performance">Performance Considerations</a>
  331. </li>
  332. </ul>
  333. </li>
  334. </ul>
  335. </div>
  336. <!-- ================================================================== -->
  337. <!-- COMBINER -->
  338. <a name="N10011"></a><a name="combiner"></a>
  339. <h2 class="h3">Combiner</h2>
  340. <div class="section">
  341. <p>The Pig combiner is an optimizer that is invoked when the statements in your scripts are arranged in certain ways. The examples below demonstrate when the combiner is used and not used. Whenever possible, make sure the combiner is used as it frequently yields an order of magnitude improvement in performance. </p>
  342. <a name="N1001A"></a><a name="When+the+Combiner+is+Used"></a>
  343. <h3 class="h4">When the Combiner is Used</h3>
  344. <p>The combiner is generally used in the case of non-nested foreach where all projections are either expressions on the group column or expressions on algebraic UDFs (see <a href="#Algebraic-interface">Make Your UDFs Algebraic</a>).</p>
  345. <p>Example:</p>
  346. <pre class="code">
  347. A = load 'studenttab10k' as (name, age, gpa);
  348. B = group A by age;
  349. C = foreach B generate ABS(SUM(A.gpa)), COUNT(org.apache.pig.builtin.Distinct(A.name)), (MIN(A.gpa) + MAX(A.gpa))/2, group.age;
  350. explain C;
  351. </pre>
  352. <p></p>
  353. <p>In the above example:</p>
  354. <ul>
  355. <li>The GROUP statement can be referred to as a whole or by accessing individual fields (as in the example). </li>
  356. <li>The GROUP statement and its elements can appear anywhere in the projection. </li>
  357. </ul>
  358. <p>In the above example, a variety of expressions can be applied to algebraic functions including:</p>
  359. <ul>
  360. <li>A column transformation function such as ABS can be applied to an algebraic function SUM.</li>
  361. <li>An algebraic function (COUNT) can be applied to another algebraic function (Distinct), but only the inner function is computed using the combiner. </li>
  362. <li>A mathematical expression can be applied to one or more algebraic functions. </li>
  363. </ul>
  364. <p></p>
  365. <p>You can check if the combiner is used for your query by running <a href="test.html#EXPLAIN">EXPLAIN</a> on the FOREACH alias as shown above. You should see the combine section in the MapReduce part of the plan:</p>
  366. <pre class="code">
  367. .....
  368. Combine Plan
  369. B: Local Rearrange[tuple]{bytearray}(false) - scope-42
  370. | |
  371. | Project[bytearray][0] - scope-43
  372. |
  373. |---C: New For Each(false,false,false)[bag] - scope-28
  374. | |
  375. | Project[bytearray][0] - scope-29
  376. | |
  377. | POUserFunc(org.apache.pig.builtin.SUM$Intermediate)[tuple] - scope-30
  378. | |
  379. | |---Project[bag][1] - scope-31
  380. | |
  381. | POUserFunc(org.apache.pig.builtin.Distinct$Intermediate)[tuple] - scope-32
  382. | |
  383. | |---Project[bag][2] - scope-33
  384. |
  385. |---POCombinerPackage[tuple]{bytearray} - scope-36--------
  386. .....
  387. </pre>
  388. <p>The combiner is also used with a nested foreach as long as the only nested operation used is DISTINCT
  389. (see <a href="basic.html#FOREACH">FOREACH</a> and <a href="basic.html#nestedblock">Example: Nested Block</a>).
  390. </p>
  391. <pre class="code">
  392. A = load 'studenttab10k' as (name, age, gpa);
  393. B = group A by age;
  394. C = foreach B { D = distinct (A.name); generate group, COUNT(D);}
  395. </pre>
  396. <p></p>
  397. <p>Finally, use of the combiner is influenced by the surrounding environment of the GROUP and FOREACH statements.</p>
  398. <a name="N1006D"></a><a name="When+the+Combiner+is+Not+Used"></a>
  399. <h3 class="h4">When the Combiner is Not Used</h3>
  400. <p>The combiner is generally not used if there is any operator that comes between the GROUP and FOREACH statements in the execution plan. Even if the statements are next to each other in your script, the optimizer might rearrange them. In this example, the optimizer will push FILTER above FOREACH which will prevent the use of the combiner:</p>
  401. <pre class="code">
  402. A = load 'studenttab10k' as (name, age, gpa);
  403. B = group A by age;
  404. C = foreach B generate group, COUNT (A);
  405. D = filter C by group.age &lt;30;
  406. </pre>
  407. <p></p>
  408. <p>Please note that the script above can be made more efficient by performing filtering before the GROUP statement:</p>
  409. <pre class="code">
  410. A = load 'studenttab10k' as (name, age, gpa);
  411. B = filter A by age &lt;30;
  412. C = group B by age;
  413. D = foreach C generate group, COUNT (B);
  414. </pre>
  415. <p></p>
  416. <p>
  417. <strong>Note:</strong> One exception to the above rule is LIMIT. Starting with Pig 0.9, even if LIMIT comes between GROUP and FOREACH, the combiner will still be used. In this example, the optimizer will push LIMIT above FOREACH but this will not prevent the use of the combiner.</p>
  418. <pre class="code">
  419. A = load 'studenttab10k' as (name, age, gpa);
  420. B = group A by age;
  421. C = foreach B generate group, COUNT (A);
  422. D = limit C 20;
  423. </pre>
  424. <p></p>
  425. <p>The combiner is also not used in the case where multiple FOREACH statements are associated with the same GROUP:</p>
  426. <pre class="code">
  427. A = load 'studenttab10k' as (name, age, gpa);
  428. B = group A by age;
  429. C = foreach B generate group, COUNT (A);
  430. D = foreach B generate group, MIN (A.gpa). MAX(A.gpa);
  431. .....
  432. </pre>
  433. <p>Depending on your use case, it might be more efficient (improve performance) to split your script into multiple scripts.</p>
  434. </div>
  435. <!-- ================================================================== -->
  436. <!-- MEMORY MANAGEMENT -->
  437. <a name="N100A0"></a><a name="memory-management"></a>
  438. <h2 class="h3">Memory Management</h2>
  439. <div class="section">
  440. <p>Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner. </p>
  441. <a name="memory-bags"></a>
  442. <p id="memory-bags">The amount of memory allocated to bags is determined by pig.cachedbag.memusage; the default is set to 20% (0.2) of available memory. Note that this memory is shared across all large bags used by the application.</p>
  443. </div>
  444. <!-- ==================================================================== -->
  445. <!-- MULTI-QUERY EXECUTION-->
  446. <a name="N100B2"></a><a name="multi-query-execution"></a>
  447. <h2 class="h3">Multi-Query Execution</h2>
  448. <div class="section">
  449. <p>With multi-query execution Pig processes an entire script or a batch of statements at once.</p>
  450. <a name="N100BB"></a><a name="Turning+it+On+or+Off"></a>
  451. <h3 class="h4">Turning it On or Off</h3>
  452. <p>Multi-query execution is turned on by default.
  453. To turn it off and revert to Pig's "execute-on-dump/store" behavior, use the "-M" or "-no_multiquery" options. </p>
  454. <p>To run script "myscript.pig" without the optimization, execute Pig as follows: </p>
  455. <pre class="code">
  456. $ pig -M myscript.pig
  457. or
  458. $ pig -no_multiquery myscript.pig
  459. </pre>
  460. <a name="N100CC"></a><a name="How+it+Works"></a>
  461. <h3 class="h4">How it Works</h3>
  462. <p>Multi-query execution introduces some changes:</p>
  463. <ul>
  464. <li>
  465. <p>For batch mode execution, the entire script is first parsed to determine if intermediate tasks
  466. can be combined to reduce the overall amount of work that needs to be done; execution starts only after the parsing is completed
  467. (see the <a href="test.html#EXPLAIN">EXPLAIN</a> operator and the <a href="cmds.html#run">run</a> and <a href="cmds.html#exec">exec</a> commands). </p>
  468. </li>
  469. <li>
  470. <p>Two run scenarios are optimized, as explained below: explicit and implicit splits, and storing intermediate results.</p>
  471. </li>
  472. </ul>
  473. <a name="N100F0"></a><a name="splits"></a>
  474. <h4>Explicit and Implicit Splits</h4>
  475. <p>There might be cases in which you want different processing on separate parts of the same data stream.</p>
  476. <p>Example 1:</p>
  477. <pre class="code">
  478. A = LOAD ...
  479. ...
  480. SPLIT A' INTO B IF ..., C IF ...
  481. ...
  482. STORE B' ...
  483. STORE C' ...
  484. </pre>
  485. <p>Example 2:</p>
  486. <pre class="code">
  487. A = LOAD ...
  488. ...
  489. B = FILTER A' ...
  490. C = FILTER A' ...
  491. ...
  492. STORE B' ...
  493. STORE C' ...
  494. </pre>
  495. <p>In prior Pig releases, Example 1 will dump A' to disk and then start jobs for B' and C'.
  496. Example 2 will execute all the dependencies of B' and store it and then execute all the dependencies of C' and store it.
  497. Both are equivalent, but the performance will be different. </p>
  498. <p>Here's what the multi-query execution does to increase the performance: </p>
  499. <ul>
  500. <li>
  501. <p>For Example 2, adds an implicit split to transform the query to Example 1.
  502. This eliminates the processing of A' multiple times.</p>
  503. </li>
  504. <li>
  505. <p>Makes the split non-blocking and allows processing to continue.
  506. This helps reduce the amount of data that has to be stored right at the split. </p>
  507. </li>
  508. <li>
  509. <p>Allows multiple outputs from a job. This way some results can be stored as a side-effect of the main job.
  510. This is also necessary to make the previous item work. </p>
  511. </li>
  512. <li>
  513. <p>Allows multiple split branches to be carried on to the combiner/reducer.
  514. This reduces the amount of IO again in the case where multiple branches in the split can benefit from a combiner run. </p>
  515. </li>
  516. </ul>
  517. <a name="N10121"></a><a name="data-store-performance"></a>
  518. <h4>Storing Intermediate Results</h4>
  519. <p>Sometimes it is necessary to store intermediate results. </p>
  520. <pre class="code">
  521. A = LOAD ...
  522. ...
  523. STORE A'
  524. ...
  525. STORE A''
  526. </pre>
  527. <p>If the script doesn't re-load A' for the processing of A the steps above A' will be duplicated.
  528. This is a special case of Example 2 above, so the same steps are recommended.
  529. With multi-query execution, the script will process A and dump A' as a side-effect.</p>
  530. <a name="N10135"></a><a name="store-dump"></a>
  531. <h3 class="h4">Store vs. Dump</h3>
  532. <p>With multi-query exection, you want to use <a href="basic.html#STORE">STORE</a> to save (persist) your results.
  533. You do not want to use <a href="test.html#DUMP">DUMP</a> as it will disable multi-query execution and is likely to slow down execution. (If you have included DUMP statements in your scripts for debugging purposes, you should remove them.) </p>
  534. <p>DUMP Example: In this script, because the DUMP command is interactive, the multi-query execution will be disabled and two separate jobs will be created to execute this script. The first job will execute A &gt; B &gt; DUMP while the second job will execute A &gt; B &gt; C &gt; STORE.</p>
  535. <pre class="code">
  536. A = LOAD 'input' AS (x, y, z);
  537. B = FILTER A BY x &gt; 5;
  538. DUMP B;
  539. C = FOREACH B GENERATE y, z;
  540. STORE C INTO 'output';
  541. </pre>
  542. <p>STORE Example: In this script, multi-query optimization will kick in allowing the entire script to be executed as a single job. Two outputs are produced: output1 and output2.</p>
  543. <pre class="code">
  544. A = LOAD 'input' AS (x, y, z);
  545. B = FILTER A BY x &gt; 5;
  546. STORE B INTO 'output1';
  547. C = FOREACH B GENERATE y, z;
  548. STORE C INTO 'output2';
  549. </pre>
  550. <a name="N10157"></a><a name="error-handling"></a>
  551. <h3 class="h4">Error Handling</h3>
  552. <p>With multi-query execution Pig processes an entire script or a batch of statements at once.
  553. By default Pig tries to run all the jobs that result from that, regardless of whether some jobs fail during execution.
  554. To check which jobs have succeeded or failed use one of these options. </p>
  555. <p>First, Pig logs all successful and failed store commands. Store commands are identified by output path.
  556. At the end of execution a summary line indicates success, partial failure or failure of all store commands. </p>
  557. <p>Second, Pig returns different code upon completion for these scenarios:</p>
  558. <ul>
  559. <li>
  560. <p>Return code 0: All jobs succeeded</p>
  561. </li>
  562. <li>
  563. <p>Return code 1: <em>Used for retrievable errors</em>
  564. </p>
  565. </li>
  566. <li>
  567. <p>Return code 2: All jobs have failed </p>
  568. </li>
  569. <li>
  570. <p>Return code 3: Some jobs have failed </p>
  571. </li>
  572. </ul>
  573. <p></p>
  574. <p>In some cases it might be desirable to fail the entire script upon detecting the first failed job.
  575. This can be achieved with the "-F" or "-stop_on_failure" command line flag.
  576. If used, Pig will stop execution when the first failed job is detected and discontinue further processing.
  577. This also means that file commands that come after a failed store in the script will not be executed (this can be used to create "done" files). </p>
  578. <p>This is how the flag is used: </p>
  579. <pre class="code">
  580. $ pig -F myscript.pig
  581. or
  582. $ pig -stop_on_failure myscript.pig
  583. </pre>
  584. <a name="N1018B"></a><a name="backward-compatibility"></a>
  585. <h3 class="h4">Backward Compatibility</h3>
  586. <p>Most existing Pig scripts will produce the same result with or without the multi-query execution.
  587. There are cases though where this is not true. Path names and schemes are discussed here.</p>
  588. <p>Any script is parsed in it's entirety before it is sent to execution. Since the current directory can change
  589. throughout the script any path used in LOAD or STORE statement is translated to a fully qualified and absolute path.</p>
  590. <p>In map-reduce mode, the following script will load from "hdfs://&lt;host&gt;:&lt;port&gt;/data1" and store into "hdfs://&lt;host&gt;:&lt;port&gt;/tmp/out1". </p>
  591. <pre class="code">
  592. cd /;
  593. A = LOAD 'data1';
  594. cd tmp;
  595. STORE A INTO 'out1';
  596. </pre>
  597. <p>These expanded paths will be passed to any LoadFunc or Slicer implementation.
  598. In some cases this can cause problems, especially when a LoadFunc/Slicer is not used to read from a dfs file or path
  599. (for example, loading from an SQL database). </p>
  600. <p>Solutions are to either: </p>
  601. <ul>
  602. <li>
  603. <p>Specify "-M" or "-no_multiquery" to revert to the old names</p>
  604. </li>
  605. <li>
  606. <p>Specify a custom scheme for the LoadFunc/Slicer </p>
  607. </li>
  608. </ul>
  609. <p>Arguments used in a LOAD statement that have a scheme other than "hdfs" or "file" will not be expanded and passed to the LoadFunc/Slicer unchanged.</p>
  610. <p>In the SQL case, the SQLLoader function is invoked with 'sql://mytable'. </p>
  611. <pre class="code">
  612. A = LOAD 'sql://mytable' USING SQLLoader();
  613. </pre>
  614. <a name="N101BA"></a><a name="Implicit-Dependencies"></a>
  615. <h3 class="h4">Implicit Dependencies</h3>
  616. <p>If a script has dependencies on the execution order outside of what Pig knows about, execution may fail. </p>
  617. <a name="N101C3"></a><a name="Example"></a>
  618. <h4>Example</h4>
  619. <p>In this script, MYUDF might try to read from out1, a file that A was just stored into.
  620. However, Pig does not know that MYUDF depends on the out1 file and might submit the jobs
  621. producing the out2 and out1 files at the same time.</p>
  622. <pre class="code">
  623. ...
  624. STORE A INTO 'out1';
  625. B = LOAD 'data2';
  626. C = FOREACH B GENERATE MYUDF($0,'out1');
  627. STORE C INTO 'out2';
  628. </pre>
  629. <p>To make the script work (to ensure that the right execution order is enforced) add the exec statement.
  630. The exec statement will trigger the execution of the statements that produce the out1 file. </p>
  631. <pre class="code">
  632. ...
  633. STORE A INTO 'out1';
  634. EXEC;
  635. B = LOAD 'data2';
  636. C = FOREACH B GENERATE MYUDF($0,'out1');
  637. STORE C INTO 'out2';
  638. </pre>
  639. <a name="N101D8"></a><a name="Example-N101D8"></a>
  640. <h4>Example</h4>
  641. <p>In this script, the STORE/LOAD operators have different file paths; however, the LOAD operator depends on the STORE operator.</p>
  642. <pre class="code">
  643. A = LOAD '/user/xxx/firstinput' USING PigStorage();
  644. B = group ....
  645. C = .... agrregation function
  646. STORE C INTO '/user/vxj/firstinputtempresult/days1';
  647. ..
  648. Atab = LOAD '/user/xxx/secondinput' USING PigStorage();
  649. Btab = group ....
  650. Ctab = .... agrregation function
  651. STORE Ctab INTO '/user/vxj/secondinputtempresult/days1';
  652. ..
  653. E = LOAD '/user/vxj/firstinputtempresult/' USING PigStorage();
  654. F = group ....
  655. G = .... aggregation function
  656. STORE G INTO '/user/vxj/finalresult1';
  657. Etab =LOAD '/user/vxj/secondinputtempresult/' USING PigStorage();
  658. Ftab = group ....
  659. Gtab = .... aggregation function
  660. STORE Gtab INTO '/user/vxj/finalresult2';
  661. </pre>
  662. <p>To make the script works, add the exec statement. </p>
  663. <pre class="code">
  664. A = LOAD '/user/xxx/firstinput' USING PigStorage();
  665. B = group ....
  666. C = .... agrregation function
  667. STORE C INTO '/user/vxj/firstinputtempresult/days1';
  668. ..
  669. Atab = LOAD '/user/xxx/secondinput' USING PigStorage();
  670. Btab = group ....
  671. Ctab = .... agrregation function
  672. STORE Ctab INTO '/user/vxj/secondinputtempresult/days1';
  673. EXEC;
  674. E = LOAD '/user/vxj/firstinputtempresult/' USING PigStorage();
  675. F = group ....
  676. G = .... aggregation function
  677. STORE G INTO '/user/vxj/finalresult1';
  678. ..
  679. Etab =LOAD '/user/vxj/secondinputtempresult/' USING PigStorage();
  680. Ftab = group ....
  681. Gtab = .... aggregation function
  682. STORE Gtab INTO '/user/vxj/finalresult2';
  683. </pre>
  684. </div>
  685. <!-- ==================================================================== -->
  686. <!-- OPTIMIZATION RULES -->
  687. <a name="N101F3"></a><a name="optimization-rules"></a>
  688. <h2 class="h3">Optimization Rules</h2>
  689. <div class="section">
  690. <p>Pig supports various optimization rules. By default optimization, and all optimization rules, are turned on.
  691. To turn off optimiztion, use:</p>
  692. <pre class="code">
  693. pig -optimizer_off [opt_rule | all ]
  694. </pre>
  695. <p>Note that some rules are mandatory and cannot be turned off.</p>
  696. <a name="N10205"></a><a name="FilterLogicExpressionSimplifier"></a>
  697. <h3 class="h4">FilterLogicExpressionSimplifier</h3>
  698. <p>This rule simplifies the expression in filter statement.</p>
  699. <pre class="code">
  700. 1) Constant pre-calculation
  701. B = FILTER A BY a0 &gt; 5+7;
  702. is simplified to
  703. B = FILTER A BY a0 &gt; 12;
  704. 2) Elimination of negations
  705. B = FILTER A BY NOT (NOT(a0 &gt; 5) OR a &gt; 10);
  706. is simplified to
  707. B = FILTER A BY a0 &gt; 5 AND a &lt;= 10;
  708. 3) Elimination of logical implied expression in AND
  709. B = FILTER A BY (a0 &gt; 5 AND a0 &gt; 7);
  710. is simplified to
  711. B = FILTER A BY a0 &gt; 7;
  712. 4) Elimination of logical implied expression in OR
  713. B = FILTER A BY ((a0 &gt; 5) OR (a0 &gt; 6 AND a1 &gt; 15);
  714. is simplified to
  715. B = FILTER C BY a0 &gt; 5;
  716. 5) Equivalence elimination
  717. B = FILTER A BY (a0 v 5 AND a0 &gt; 5);
  718. is simplified to
  719. B = FILTER A BY a0 &gt; 5;
  720. 6) Elimination of complementary expressions in OR
  721. B = FILTER A BY (a0 &gt; 5 OR a0 &lt;= 5);
  722. is simplified to non-filtering
  723. 7) Elimination of naive TRUE expression
  724. B = FILTER A BY 1==1;
  725. is simplified to non-filtering
  726. </pre>
  727. <a name="N10215"></a><a name="SplitFilter"></a>
  728. <h3 class="h4">SplitFilter</h3>
  729. <p>Split filter conditions so that we can push filter more aggressively.</p>
  730. <pre class="code">
  731. A = LOAD 'input1' as (a0, a1);
  732. B = LOAD 'input2' as (b0, b1);
  733. C = JOIN A by a0, B by b0;
  734. D = FILTER C BY a1&gt;0 and b1&gt;0;
  735. </pre>
  736. <p>Here D will be splitted into:</p>
  737. <pre class="code">
  738. X = FILTER C BY a1&gt;0;
  739. D = FILTER X BY b1&gt;0;
  740. </pre>
  741. <p>So "a1&gt;0" and "b1&gt;0" can be pushed up individually.</p>
  742. <a name="N1022F"></a><a name="PushUpFilter"></a>
  743. <h3 class="h4">PushUpFilter</h3>
  744. <p>The objective of this rule is to push the FILTER operators up the data flow graph. As a result, the number of records that flow through the pipeline is reduced. </p>
  745. <pre class="code">
  746. A = LOAD 'input';
  747. B = GROUP A BY $0;
  748. C = FILTER B BY $0 &lt; 10;
  749. </pre>
  750. <a name="N1023F"></a><a name="MergeFilter"></a>
  751. <h3 class="h4">MergeFilter</h3>
  752. <p>Merge filter conditions after PushUpFilter rule to decrease the number of filter statements.</p>
  753. <a name="N1024B"></a><a name="PushDownForEachFlatten"></a>
  754. <h3 class="h4">PushDownForEachFlatten</h3>
  755. <p>The objective of this rule is to reduce the number of records that flow through the pipeline by moving FOREACH operators with a FLATTEN down the data flow graph. In the example shown below, it would be more efficient to move the foreach after the join to reduce the cost of the join operation.</p>
  756. <pre class="code">
  757. A = LOAD 'input' AS (a, b, c);
  758. B = LOAD 'input2' AS (x, y, z);
  759. C = FOREACH A GENERATE FLATTEN($0), B, C;
  760. D = JOIN C BY $1, B BY $1;
  761. </pre>
  762. <a name="N1025B"></a><a name="LimitOptimizer"></a>
  763. <h3 class="h4">LimitOptimizer</h3>
  764. <p>The objective of this rule is to push the LIMIT operator up the data flow graph (or down the tree for database folks). In addition, for top-k (ORDER BY followed by a LIMIT) the LIMIT is pushed into the ORDER BY.</p>
  765. <pre class="code">
  766. A = LOAD 'input';
  767. B = ORDER A BY $0;
  768. C = LIMIT B 10;
  769. </pre>
  770. <a name="N1026B"></a><a name="ColumnMapKeyPrune"></a>
  771. <h3 class="h4">ColumnMapKeyPrune</h3>
  772. <p>Prune the loader to only load necessary columns. The performance gain is more significant if the corresponding loader support column pruning and only load necessary columns (See LoadPushDown.pushProjection). Otherwise, ColumnMapKeyPrune will insert a ForEach statement right after loader.</p>
  773. <pre class="code">
  774. A = load 'input' as (a0, a1, a2);
  775. B = ORDER A by a0;
  776. C = FOREACH B GENERATE a0, a1;
  777. </pre>
  778. <p>a2 is irrelevant in this query, so we can prune it earlier. The loader in this query is PigStorage and it supports column pruning. So we only load a0 and a1 from the input file.</p>
  779. <p>ColumnMapKeyPrune also prunes unused map keys:</p>
  780. <pre class="code">
  781. A = load 'input' as (a0:map[]);
  782. B = FOREACH A generate a0#'key1';
  783. </pre>
  784. <a name="N10285"></a><a name="AddForEach"></a>
  785. <h3 class="h4">AddForEach</h3>
  786. <p>Prune unused column as soon as possible. In addition to prune the loader in ColumnMapKeyPrune, we can prune a column as soon as it is not used in the rest of the script</p>
  787. <pre class="code">
  788. -- Original code:
  789. A = LOAD 'input' AS (a0, a1, a2);
  790. B = ORDER A BY a0;
  791. C = FILTER B BY a1&gt;0;
  792. </pre>
  793. <p>We can only prune a2 from the loader. However, a0 is never used after "ORDER BY". So we can drop a0 right after "ORDER BY" statement.</p>
  794. <pre class="code">
  795. -- Optimized code:
  796. A = LOAD 'input' AS (a0, a1, a2);
  797. B = ORDER A BY a0;
  798. B1 = FOREACH B GENERATE a1; -- drop a0
  799. C = FILTER B1 BY a1&gt;0;
  800. </pre>
  801. <a name="N1029C"></a><a name="MergeForEach"></a>
  802. <h3 class="h4">MergeForEach</h3>
  803. <p>The objective of this rule is to merge together two feach statements, if these preconditions are met:</p>
  804. <ul>
  805. <li>The foreach statements are consecutive.</li>
  806. <li>The first foreach statement does not contain flatten.</li>
  807. <li>The second foreach is not nested.</li>
  808. </ul>
  809. <pre class="code">
  810. -- Original code:
  811. A = LOAD 'file.txt' AS (a, b, c);
  812. B = FOREACH A GENERATE a+b AS u, c-b AS v;
  813. C = FOREACH B GENERATE $0+5, v;
  814. -- Optimized code:
  815. A = LOAD 'file.txt' AS (a, b, c);
  816. C = FOREACH A GENERATE a+b+5, c-b;
  817. </pre>
  818. <a name="N102B8"></a><a name="GroupByConstParallelSetter"></a>
  819. <h3 class="h4">GroupByConstParallelSetter</h3>
  820. <p>Force parallel "1" for "group all" statement. That's because even if we set parallel to N, only 1 reducer will be used in this case and all other reducer produce empty result.</p>
  821. <pre class="code">
  822. A = LOAD 'input';
  823. B = GROUP A all PARALLEL 10;
  824. </pre>
  825. </div>
  826. <!-- ==================================================================== -->
  827. <!-- PERFORMANCE ENHANCERS-->
  828. <a name="N102CB"></a><a name="performance-enhancers"></a>
  829. <h2 class="h3">Performance Enhancers</h2>
  830. <div class="section">
  831. <a name="N102D1"></a><a name="Use+Optimization"></a>
  832. <h3 class="h4">Use Optimization</h3>
  833. <p>Pig supports various <a href="perf.html#Optimization-Rules">optimization rules</a> which are turned on by default.
  834. Become familiar with these rules.</p>
  835. <a name="N102E1"></a><a name="types"></a>
  836. <h3 class="h4">Use Types</h3>
  837. <p>If types are not specified in the load statement, Pig assumes the type of =double= for numeric computations.
  838. A lot of the time, your data would be much smaller, maybe, integer or long. Specifying the real type will help with
  839. speed of arithmetic computation. It has an additional advantage of early error detection. </p>
  840. <pre class="code">
  841. --Query 1
  842. A = load 'myfile' as (t, u, v);
  843. B = foreach A generate t + u;
  844. --Query 2
  845. A = load 'myfile' as (t: int, u: int, v);
  846. B = foreach A generate t + u;
  847. </pre>
  848. <p>The second query will run more efficiently than the first. In some of our queries with see 2x speedup. </p>
  849. <a name="N102F4"></a><a name="projection"></a>
  850. <h3 class="h4">Project Early and Often </h3>
  851. <p>Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like: </p>
  852. <pre class="code">
  853. A = load 'myfile' as (t, u, v);
  854. B = load 'myotherfile' as (x, y, z);
  855. C = join A by t, B by x;
  856. D = group C by u;
  857. E = foreach D generate group, COUNT($1);
  858. </pre>
  859. <p>There is no need for v, y, or z to participate in this query. And there is no need to carry both t and x past the join, just one will suffice. Changing the query above to the query below will greatly reduce the amount of data being carried through the map and reduce phases by pig. </p>
  860. <pre class="code">
  861. A = load 'myfile' as (t, u, v);
  862. A1 = foreach A generate t, u;
  863. B = load 'myotherfile' as (x, y, z);
  864. B1 = foreach B generate x;
  865. C = join A1 by t, B1 by x;
  866. C1 = foreach C generate t, u;
  867. D = group C1 by u;
  868. E = foreach D generate group, COUNT($1);
  869. </pre>
  870. <p>Depending on your data, this can produce significant time savings. In queries similar to the example shown here we have seen total time drop by 50%.</p>
  871. <a name="N1030E"></a><a name="filter"></a>
  872. <h3 class="h4">Filter Early and Often</h3>
  873. <p>As with early projection, in most cases it is beneficial to apply filters as early as possible to reduce the amount of data flowing through the pipeline. </p>
  874. <pre class="code">
  875. -- Query 1
  876. A = load 'myfile' as (t, u, v);
  877. B = load 'myotherfile' as (x, y, z);
  878. C = filter A by t == 1;
  879. D = join C by t, B by x;
  880. E = group D by u;
  881. F = foreach E generate group, COUNT($1);
  882. -- Query 2
  883. A = load 'myfile' as (t, u, v);
  884. B = load 'myotherfile' as (x, y, z);
  885. C = join A by t, B by x;
  886. D = group C by u;
  887. E = foreach D generate group, COUNT($1);
  888. F = filter E by C.t == 1;
  889. </pre>
  890. <p>The first query is clearly more efficient than the second one because it reduces the amount of data going into the join. </p>
  891. <p>One case where pushing filters up might not be a good idea is if the cost of applying filter is very high and only a small amount of data is filtered out. </p>
  892. <a name="N10324"></a><a name="pipeline"></a>
  893. <h3 class="h4">Reduce Your Operator Pipeline</h3>
  894. <p>For clarity of your script, you might choose to split your projects into several steps for instance: </p>
  895. <pre class="code">
  896. A = load 'data' as (in: map[]);
  897. -- get key out of the map
  898. B = foreach A generate in#'k1' as k1, in#'k2' as k2;
  899. -- concatenate the keys
  900. C = foreach B generate CONCAT(k1, k2);
  901. .......
  902. </pre>
  903. <p>While the example above is easier to read, you might want to consider combining the two foreach statements to improve your query performance: </p>
  904. <pre class="code">
  905. A = load 'data' as (in: map[]);
  906. -- concatenate the keys from the map
  907. B = foreach A generate CONCAT(in#'k1', in#'k2');
  908. ....
  909. </pre>
  910. <p>The same goes for filters. </p>
  911. <a name="N1033E"></a><a name="algebraic-interface"></a>
  912. <h3 class="h4">Make Your UDFs Algebraic</h3>
  913. <p>Queries that can take advantage of the combiner generally ran much faster (sometimes several times faster) than the versions that don't. The latest code significantly improves combiner usage; however, you need to make sure you do your part. If you have a UDF that works on grouped data and is, by nature, algebraic (meaning their computation can be decomposed into multiple steps) make sure you implement it as such. For details on how to write algebraic UDFs, see <a href="udf.html#algebraic-interface">Algebraic Interface</a>.</p>
  914. <pre class="code">
  915. A = load 'data' as (x, y, z)
  916. B = group A by x;
  917. C = foreach B generate group, MyUDF(A);
  918. ....
  919. </pre>
  920. <p>If <span class="codefrag">MyUDF</span> is algebraic, the query will use combiner and run much faster. You can run <span class="codefrag">explain</span> command on your query to make sure that combiner is used. </p>
  921. <a name="N1035B"></a><a name="accumulator-interface"></a>
  922. <h3 class="h4">Use the Accumulator Interface</h3>
  923. <p>
  924. If your UDF can't be made Algebraic but is able to deal with getting input in chunks rather than all at once, consider implementing the Accumulator interface to reduce the amount of memory used by your script. If your function <em>is</em> Algebraic and can be used on conjunction with Accumulator functions, you will need to implement the Accumulator interface as well as the Algebraic interface. For more information, see <a href="udf.html#Accumulator-Interface">Accumulator Interface</a>.</p>
  925. <p>
  926. <strong>Note:</strong> Pig automatically chooses the interface that it expects to provide the best performance: Algebraic &gt; Accumulator &gt; Default. </p>
  927. <a name="N10373"></a><a name="nulls"></a>
  928. <h3 class="h4">Drop Nulls Before a Join</h3>
  929. <p>With the introduction of nulls, join and cogroup semantics were altered to work with nulls. The semantic for cogrouping with nulls is that nulls from a given input are grouped together, but nulls across inputs are not grouped together. This preserves the semantics of grouping (nulls are collected together from a single input to be passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs). Since flattening an empty bag results in an empty row (and no output), in a standard join the rows with a null key will always be dropped. </p>
  930. <p>This join</p>
  931. <pre class="code">
  932. A = load 'myfile' as (t, u, v);
  933. B = load 'myotherfile' as (x, y, z);
  934. C = join A by t, B by x;
  935. </pre>
  936. <p>is rewritten by Pig to </p>
  937. <pre class="code">
  938. A = load 'myfile' as (t, u, v);
  939. B = load 'myotherfile' as (x, y, z);
  940. C1 = cogroup A by t INNER, B by x INNER;
  941. C = foreach C1 generate flatten(A), flatten(B);
  942. </pre>
  943. <p>Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output. So the null keys will be dropped. But they will not be dropped until the last possible moment. </p>
  944. <p>If the query is rewritten to </p>
  945. <pre class="code">
  946. A = load 'myfile' as (t, u, v);
  947. B = load 'myotherfile' as (x, y, z);
  948. A1 = filter A by t is not null;
  949. B1 = filter B by x is not null;
  950. C = join A1 by t, B1 by x;
  951. </pre>
  952. <p>then the nulls will be dropped before the join. Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be significant. In one test where the key was null 7% of the time and the data was spread across 200 reducers, we saw a about a 10x speed up in the query by adding the early filters. </p>
  953. <a name="N1039A"></a><a name="join-optimizations"></a>
  954. <h3 class="h4">Take Advantage of Join Optimizations</h3>
  955. <p>
  956. <strong>Regular Join Optimizations</strong>
  957. </p>
  958. <p>Optimization for regular joins ensures that the last table in the join is not brought into memory but streamed through instead. Optimization reduces the amount of memory used which means you can avoid spilling the data and also should be able to scale your query to larger data volumes. </p>
  959. <p>To take advantage of this optimization, make sure that the table with the largest number of tuples per key is the last table in your query.
  960. In some of our tests we saw 10x performance improvement as the result of this optimization.</p>
  961. <pre class="code">
  962. small = load 'small_file' as (t, u, v);
  963. large = load 'large_file' as (x, y, z);
  964. C = join small by t, large by x;
  965. </pre>
  966. <p>
  967. <strong>Specialized Join Optimizations</strong>
  968. </p>
  969. <p>Optimization can also be achieved using fragment replicate joins, skewed joins, and merge joins.
  970. For more information see <a href="perf.html#Specialized-Joins">Specialized Joins</a>.</p>
  971. <a name="N103BC"></a><a name="parallel"></a>
  972. <h3 class="h4">Use the Parallel Features</h3>
  973. <p>You can set the number of reduce tasks for the MapReduce jobs generated by Pig using two parallel features.
  974. (The parallel features only affect the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.)</p>
  975. <p>
  976. <strong>You Set the Number of Reducers</strong>
  977. </p>
  978. <p>Use the <a href="cmds.html#set">set default parallel</a> command to set the number of reducers at the script level.</p>
  979. <p>Alternatively, use the PARALLEL clause to set the number of reducers at the operator level.
  980. (In a script, the value set via the PARALLEL clause will override any value set via "set default parallel.")
  981. You can include the PARALLEL clause with any operator that starts a reduce phase:
  982. <a href="basic.html#COGROUP">COGROUP</a>,
  983. <a href="basic.html#CROSS">CROSS</a>,
  984. <a href="basic.html#DISTINCT">DISTINCT</a>,
  985. <a href="basic.html#GROUP">GROUP</a>,
  986. <a href="basic.html#JOIN-inner">JOIN (inner)</a>,
  987. <a href="basic.html#JOIN-outer">JOIN (outer)</a>, and
  988. <a href="basic.html#ORDER-BY">ORDER BY</a>.
  989. </p>
  990. <p>The number of reducers you need for a particular construct in Pig that forms a MapReduce boundary depends entirely on (1) your data and the number of intermediate keys you are generating in your mappers and (2) the partitioner and distribution of map (combiner) output keys. In the best cases we have seen that a reducer processing about 1 GB of data behaves efficiently.</p>
  991. <p>
  992. <strong>Let Pig Set the Number of Reducers</strong>
  993. </p>
  994. <p>If neither "set default parallel" nor the PARALLEL clause are used, Pig sets the number of reducers using a heuristic based on the size of the input data. You can set the values for these properties:</p>
  995. <ul>
  996. <li>pig.exec.reducers.bytes.per.reducer - Defines the number of input bytes per reduce; default value is 1000*1000*1000 (1GB).</li>
  997. <li>pig.exec.reducers.max - Defines the upper bound on the number of reducers; default is 999. </li>
  998. </ul>
  999. <p></p>
  1000. <p>The formula, shown below, is very simple and will improve over time. The computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.</p>
  1001. <p>
  1002. <span class="codefrag">#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer) </span>
  1003. </p>
  1004. <p>
  1005. <strong>Examples</strong>
  1006. </p>
  1007. <p>In this example PARALLEL is used with the GROUP operator. </p>
  1008. <pre class="code">
  1009. A = LOAD 'myfile' AS (t, u, v);
  1010. B = GROUP A BY t PARALLEL 18;
  1011. ...
  1012. </pre>
  1013. <p>In this example all the MapReduce jobs that get launched use 20 reducers.</p>
  1014. <pre class="code">
  1015. SET default_parallel 20;
  1016. A = LOAD &lsquo;myfile.txt&rsquo; USING PigStorage() AS (t, u, v);
  1017. B = GROUP A BY t;
  1018. C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
  1019. D = ORDER C BY mycount;
  1020. STORE D INTO &lsquo;mysortedcount&rsquo; USING PigStorage();
  1021. </pre>
  1022. <a name="N10420"></a><a name="limit"></a>
  1023. <h3 class="h4">Use the LIMIT Operator</h3>
  1024. <p>Often you are not interested in the entire output but rather a sample or top results. In such cases, using LIMIT can yield a much better performance as we push the limit as high as possible to minimize the amount of data travelling through the pipeline. </p>
  1025. <p>Sample:
  1026. </p>
  1027. <pre class="code">
  1028. A = load 'myfile' as (t, u, v);
  1029. B = limit A 500;
  1030. </pre>
  1031. <p>Top results: </p>
  1032. <pre class="code">
  1033. A = load 'myfile' as (t, u, v);
  1034. B = order A by t;
  1035. C = limit B 500;
  1036. </pre>
  1037. <a name="N1043A"></a><a name="distinct"></a>
  1038. <h3 class="h4">Prefer DISTINCT over GROUP BY/GENERATE</h3>
  1039. <p>To extract unique values from a column in a relation you can use DISTINCT or GROUP BY/GENERATE. DISTINCT is the preferred method; it is faster and more efficient.</p>
  1040. <p>Example using GROUP BY - GENERATE:</p>
  1041. <pre class="code">
  1042. A = load 'myfile' as (t, u, v);
  1043. B = foreach A generate u;
  1044. C = group B by u;
  1045. D = foreach C generate group as uniquekey;
  1046. dump D;
  1047. </pre>
  1048. <p>Example using DISTINCT:</p>
  1049. <pre class="code">
  1050. A = load 'myfile' as (t, u, v);
  1051. B = foreach A generate u;
  1052. C = distinct B;
  1053. dump C;
  1054. </pre>
  1055. <a name="N10454"></a><a name="compression"></a>
  1056. <h3 class="h4">Compress the Results of Intermediate Jobs</h3>
  1057. <p>If your Pig script generates a sequence of MapReduce jobs, you can compress the output of the intermediate jobs using LZO compression. (Use the <a href="test.html#EXPLAIN">EXPLAIN</a> operator to determine if your script produces multiple MapReduce Jobs.)</p>
  1058. <p>By doing this, you will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data that is generated, the more benefits in storage and speed that result.</p>
  1059. <p>You can set the value for these properties:</p>
  1060. <ul>
  1061. <li>pig.tmpfilecompression - Determines if the temporary files should be compressed or not (set to false by default).</li>
  1062. <li>pig.tmpfilecompression.codec - Specifies which compression codec to use. Currently, Pig accepts "gz" and "lzo" as possible values. However, because LZO is under GPL license (and disabled by default) you will need to configure your cluster to use the LZO codec to take advantage of this feature. For details, see http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ.</li>
  1063. </ul>
  1064. <p></p>
  1065. <p>On the non-trivial queries (one ran longer than a couple of minutes) we saw significant improvements both in terms of query latency and space usage. For some queries we saw up to 96% disk saving and up to 4x query speed up. Of course, the performance characteristics are very much query and data dependent and testing needs to be done to determine gains. We did not see any slowdown in the tests we peformed which means that you are at least saving on space while using compression.</p>
  1066. <p>With gzip we saw a better compression (96-99%) but at a cost of 4% slowdown. Thus, we don't recommend using gzip. </p>
  1067. <p>
  1068. <strong>Example</strong>
  1069. </p>
  1070. <pre class="code">
  1071. -- launch Pig script using lzo compression
  1072. java -cp $PIG_HOME/pig.jar
  1073. -Djava.library.path=&lt;path to the lzo library&gt;
  1074. -Dpig.tmpfilecompression=true
  1075. -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main myscript.pig
  1076. </pre>
  1077. <a name="N10483"></a><a name="combine-files"></a>
  1078. <h3 class="h4">Combine Small Input Files</h3>
  1079. <p>Processing input (either user input or intermediate input) from multiple small files can be inefficient because a separate map has to be created for each file. Pig can now combined small files so that they are processed as a single map.</p>
  1080. <p>You can set the values for these properties:</p>
  1081. <ul>
  1082. <li>pig.maxCombinedSplitSize &ndash; Specifies the size, in bytes, of data to be processed by a single map. Smaller files are combined untill this size is reached. </li>
  1083. <li>pig.splitCombination &ndash; Turns combine split files on or off (set to &ldquo;true&rdquo; by default).</li>
  1084. </ul>
  1085. <p></p>
  1086. <p>This feature works with <a href="func.html#PigStorage">PigStorage</a>. However, if you are using a custom loader, please note the following:</p>
  1087. <ul>
  1088. <li>If your loader implementation makes use of the PigSplit object passed through the prepareToRead method, then you may need to rebuild the loader since the definition of PigSplit has been modified. </li>
  1089. <li>The loader must be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument.</li>
  1090. <li>If a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations.</li>
  1091. </ul>
  1092. <p></p>
  1093. </div>
  1094. <!-- ==================================================================== -->
  1095. <!-- SPECIALIZED JOINS-->
  1096. <a name="N104B5"></a><a name="specialized-joins"></a>
  1097. <h2 class="h3">Specialized Joins</h2>
  1098. <div class="section">
  1099. <a name="N104BF"></a><a name="replicated-joins"></a>
  1100. <h3 class="h4">Replicated Joins</h3>
  1101. <p>Fragment replicate join is a special type of join that works well if one or more relations are small enough to fit into main memory.
  1102. In such cases, Pig can perform a very efficient join because all of the hadoop work is done on the map side. In this type of join the
  1103. large relation is followed by one or more small relations. The small relations must be small enough to fit into main memory; if they
  1104. don't, the process fails and an error is generated.</p>
  1105. <a name="N104C8"></a><a name="Usage"></a>
  1106. <h4>Usage</h4>
  1107. <p>Perform a replicated join with the USING clause (see <a href="basic.html#JOIN-inner">JOIN (inner)</a> and <a href="basic.html#JOIN-outer">JOIN (outer)</a>).
  1108. In this example, a large relation is joined with two smaller relations. Note that the large relation comes first followed by the smaller relations;
  1109. and, all small relations together must fit into main memory, otherwise an error is generated. </p>
  1110. <pre class="code">
  1111. big = LOAD 'big_data' AS (b1,b2,b3);
  1112. tiny = LOAD 'tiny_data' AS (t1,t2,t3);
  1113. mini = LOAD 'mini_data' AS (m1,m2,m3);
  1114. C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
  1115. </pre>
  1116. <a name="N104DE"></a><a name="Conditions"></a>
  1117. <h4>Conditions</h4>
  1118. <p>Fragment replicate joins are experimental; we don't have a strong sense of how small the small relation must be to fit
  1119. into memory. In our tests with a simple query that involves just a JOIN, a relation of up to 100 M can be used if the process overall
  1120. gets 1 GB of memory. Please share your observations and experience with us.</p>
  1121. <a name="N104EF"></a><a name="skewed-joins"></a>
  1122. <h3 class="h4">Skewed Joins</h3>
  1123. <p>
  1124. Parallel joins are vulnerable to the presence of skew in the underlying data.
  1125. If the underlying data is sufficiently skewed, load imbalances will swamp any of the parallelism gains.
  1126. In order to counteract this problem, skewed join computes a histogram of the key space and uses this
  1127. data to allocate reducers for a given key. Skewed join does not place a restriction on the size of the input keys.
  1128. It accomplishes this by splitting the left input on the join predicate and streaming the right input. The left input is
  1129. sampled to create the histogram.
  1130. </p>
  1131. <p>
  1132. Skewed join can be used when the underlying data is sufficiently skewed and you need a finer
  1133. control over the allocation of reducers to counteract the skew. It should also be used when the data
  1134. associated with a given key is too large to fit in memory.
  1135. </p>
  1136. <a name="N104FB"></a><a name="Usage-N104FB"></a>
  1137. <h4>Usage</h4>
  1138. <p>Perform a skewed join with the USING clause (see <a href="basic.html#JOIN-inner">JOIN (inner)</a> and <a href="basic.h…

Large files files are truncated, but you can click here to view the full file