PageRenderTime 7ms CodeModel.GetById 3ms app.highlight 32ms RepoModel.GetById 1ms app.codeStats 0ms

/site/publish/docs/r0.9.2/perf.html

#
HTML | 1375 lines | 1183 code | 115 blank | 77 comment | 0 complexity | e1d00b5051e761b53235ff43f248a6c6 MD5 | raw file

Large files files are truncated, but you can click here to view the full file

   1<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
   2<html>
   3<head>
   4<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
   5<meta content="Apache Forrest" name="Generator">
   6<meta name="Forrest-version" content="0.8">
   7<meta name="Forrest-skin-name" content="pelt">
   8<title>Performance and Efficiency</title>
   9<link type="text/css" href="skin/basic.css" rel="stylesheet">
  10<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
  11<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
  12<link type="text/css" href="skin/profile.css" rel="stylesheet">
  13<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
  14<link rel="shortcut icon" href="">
  15</head>
  16<body onload="init()">
  17<script type="text/javascript">ndeSetTextSize();</script>
  18<div id="top">
  19<!--+
  20    |breadtrail
  21    +-->
  22<div class="breadtrail">
  23<a href="http://www.apache.org/">Apache</a> &gt; <a href="http://hadoop.apache.org/">Hadoop</a> &gt; <a href="http://hadoop.apache.org/pig/">Pig</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
  24</div>
  25<!--+
  26    |header
  27    +-->
  28<div class="header">
  29<!--+
  30    |start group logo
  31    +-->
  32<div class="grouplogo">
  33<a href="http://hadoop.apache.org/"><img class="logoImage" alt="Hadoop" src="images/hadoop-logo.jpg" title="Apache Hadoop"></a>
  34</div>
  35<!--+
  36    |end group logo
  37    +-->
  38<!--+
  39    |start Project Logo
  40    +-->
  41<div class="projectlogo">
  42<a href="http://hadoop.apache.org/pig/"><img class="logoImage" alt="Pig" src="images/pig-logo.gif" title="A platform for analyzing large datasets."></a>
  43</div>
  44<!--+
  45    |end Project Logo
  46    +-->
  47<!--+
  48    |start Search
  49    +-->
  50<div class="searchbox">
  51<form action="http://www.google.com/search" method="get" class="roundtopsmall">
  52<input value="" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp; 
  53                    <input name="Search" value="Search" type="submit">
  54</form>
  55</div>
  56<!--+
  57    |end search
  58    +-->
  59<!--+
  60    |start Tabs
  61    +-->
  62<ul id="tabs">
  63<li>
  64<a class="unselected" href="http://hadoop.apache.org/pig/">Project</a>
  65</li>
  66<li>
  67<a class="unselected" href="http://wiki.apache.org/pig/">Wiki</a>
  68</li>
  69<li class="current">
  70<a class="selected" href="index.html">Pig 0.9.2 Documentation</a>
  71</li>
  72</ul>
  73<!--+
  74    |end Tabs
  75    +-->
  76</div>
  77</div>
  78<div id="main">
  79<div id="publishedStrip">
  80<!--+
  81    |start Subtabs
  82    +-->
  83<div id="level2tabs"></div>
  84<!--+
  85    |end Endtabs
  86    +-->
  87<script type="text/javascript"><!--
  88document.write("Last Published: " + document.lastModified);
  89//  --></script>
  90</div>
  91<!--+
  92    |breadtrail
  93    +-->
  94<div class="breadtrail">
  95
  96             &nbsp;
  97           </div>
  98<!--+
  99    |start Menu, mainarea
 100    +-->
 101<!--+
 102    |start Menu
 103    +-->
 104<div id="menu">
 105<div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Pig</div>
 106<div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;">
 107<div class="menuitem">
 108<a href="index.html">Overview</a>
 109</div>
 110<div class="menuitem">
 111<a href="start.html">Getting Started</a>
 112</div>
 113<div class="menuitem">
 114<a href="basic.html">Pig Latin Basics</a>
 115</div>
 116<div class="menuitem">
 117<a href="func.html">Built In Functions</a>
 118</div>
 119<div class="menuitem">
 120<a href="udf.html">User Defined Functions</a>
 121</div>
 122<div class="menuitem">
 123<a href="cont.html">Control Structures</a>
 124</div>
 125<div class="menuitem">
 126<a href="cmds.html">Shell and Utililty Commands</a>
 127</div>
 128<div class="menupage">
 129<div class="menupagetitle">Performance and Efficiency</div>
 130</div>
 131<div class="menuitem">
 132<a href="test.html">Testing and Diagnostics</a>
 133</div>
 134<div class="menuitem">
 135<a href="pig-index.html">Index</a>
 136</div>
 137</div>
 138<div onclick="SwitchMenu('menu_1.2', 'skin/')" id="menu_1.2Title" class="menutitle">Zebra</div>
 139<div id="menu_1.2" class="menuitemgroup">
 140<div class="menuitem">
 141<a href="zebra_overview.html">Zebra Overview </a>
 142</div>
 143<div class="menuitem">
 144<a href="zebra_users.html">Zebra Users </a>
 145</div>
 146<div class="menuitem">
 147<a href="zebra_reference.html">Zebra Reference </a>
 148</div>
 149<div class="menuitem">
 150<a href="zebra_mapreduce.html">Zebra MapReduce </a>
 151</div>
 152<div class="menuitem">
 153<a href="zebra_pig.html">Zebra Pig </a>
 154</div>
 155<div class="menuitem">
 156<a href="zebra_stream.html">Zebra Streaming </a>
 157</div>
 158</div>
 159<div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Miscellaneous</div>
 160<div id="menu_1.3" class="menuitemgroup">
 161<div class="menuitem">
 162<a href="api/">API Docs</a>
 163</div>
 164<div class="menuitem">
 165<a href="jdiff/changes.html">API Changes</a>
 166</div>
 167<div class="menuitem">
 168<a href="https://cwiki.apache.org/confluence/display/PIG">Wiki</a>
 169</div>
 170<div class="menuitem">
 171<a href="https://cwiki.apache.org/confluence/display/PIG/FAQ">FAQ</a>
 172</div>
 173<div class="menuitem">
 174<a href="http://hadoop.apache.org/pig/releases.html">Release Notes</a>
 175</div>
 176</div>
 177<div id="credit"></div>
 178<div id="roundbottom">
 179<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
 180<!--+
 181  |alternative credits
 182  +-->
 183<div id="credit2"></div>
 184</div>
 185<!--+
 186    |end Menu
 187    +-->
 188<!--+
 189    |start content
 190    +-->
 191<div id="content">
 192<div title="Portable Document Format" class="pdflink">
 193<a class="dida" href="perf.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
 194        PDF</a>
 195</div>
 196<h1>Performance and Efficiency</h1>
 197<div id="minitoc-area">
 198<ul class="minitoc">
 199<li>
 200<a href="#combiner">Combiner</a>
 201<ul class="minitoc">
 202<li>
 203<a href="#When+the+Combiner+is+Used">When the Combiner is Used</a>
 204</li>
 205<li>
 206<a href="#When+the+Combiner+is+Not+Used">When the Combiner is Not Used</a>
 207</li>
 208</ul>
 209</li>
 210<li>
 211<a href="#memory-management">Memory Management</a>
 212</li>
 213<li>
 214<a href="#multi-query-execution">Multi-Query Execution</a>
 215<ul class="minitoc">
 216<li>
 217<a href="#Turning+it+On+or+Off">Turning it On or Off</a>
 218</li>
 219<li>
 220<a href="#How+it+Works">How it Works</a>
 221</li>
 222<li>
 223<a href="#store-dump">Store vs. Dump</a>
 224</li>
 225<li>
 226<a href="#error-handling">Error Handling</a>
 227</li>
 228<li>
 229<a href="#backward-compatibility">Backward Compatibility</a>
 230</li>
 231<li>
 232<a href="#Implicit-Dependencies">Implicit Dependencies</a>
 233</li>
 234</ul>
 235</li>
 236<li>
 237<a href="#optimization-rules">Optimization Rules</a>
 238<ul class="minitoc">
 239<li>
 240<a href="#FilterLogicExpressionSimplifier">FilterLogicExpressionSimplifier</a>
 241</li>
 242<li>
 243<a href="#SplitFilter">SplitFilter</a>
 244</li>
 245<li>
 246<a href="#PushUpFilter">PushUpFilter</a>
 247</li>
 248<li>
 249<a href="#MergeFilter">MergeFilter</a>
 250</li>
 251<li>
 252<a href="#PushDownForEachFlatten">PushDownForEachFlatten</a>
 253</li>
 254<li>
 255<a href="#LimitOptimizer">LimitOptimizer</a>
 256</li>
 257<li>
 258<a href="#ColumnMapKeyPrune">ColumnMapKeyPrune</a>
 259</li>
 260<li>
 261<a href="#AddForEach">AddForEach</a>
 262</li>
 263<li>
 264<a href="#MergeForEach">MergeForEach</a>
 265</li>
 266<li>
 267<a href="#GroupByConstParallelSetter">GroupByConstParallelSetter</a>
 268</li>
 269</ul>
 270</li>
 271<li>
 272<a href="#performance-enhancers">Performance Enhancers</a>
 273<ul class="minitoc">
 274<li>
 275<a href="#Use+Optimization">Use Optimization</a>
 276</li>
 277<li>
 278<a href="#types">Use Types</a>
 279</li>
 280<li>
 281<a href="#projection">Project Early and Often </a>
 282</li>
 283<li>
 284<a href="#filter">Filter Early and Often</a>
 285</li>
 286<li>
 287<a href="#pipeline">Reduce Your Operator Pipeline</a>
 288</li>
 289<li>
 290<a href="#algebraic-interface">Make Your UDFs Algebraic</a>
 291</li>
 292<li>
 293<a href="#accumulator-interface">Use the Accumulator Interface</a>
 294</li>
 295<li>
 296<a href="#nulls">Drop Nulls Before a Join</a>
 297</li>
 298<li>
 299<a href="#join-optimizations">Take Advantage of Join Optimizations</a>
 300</li>
 301<li>
 302<a href="#parallel">Use the Parallel Features</a>
 303</li>
 304<li>
 305<a href="#limit">Use the LIMIT Operator</a>
 306</li>
 307<li>
 308<a href="#distinct">Prefer DISTINCT over GROUP BY/GENERATE</a>
 309</li>
 310<li>
 311<a href="#compression">Compress the Results of Intermediate Jobs</a>
 312</li>
 313<li>
 314<a href="#combine-files">Combine Small Input Files</a>
 315</li>
 316</ul>
 317</li>
 318<li>
 319<a href="#specialized-joins">Specialized Joins</a>
 320<ul class="minitoc">
 321<li>
 322<a href="#replicated-joins">Replicated Joins</a>
 323</li>
 324<li>
 325<a href="#skewed-joins">Skewed Joins</a>
 326</li>
 327<li>
 328<a href="#merge-joins">Merge Joins</a>
 329</li>
 330<li>
 331<a href="#specialized-joins-performance">Performance Considerations</a>
 332</li>
 333</ul>
 334</li>
 335</ul>
 336</div> 
 337  
 338<!-- ================================================================== -->
 339<!-- COMBINER -->
 340
 341<a name="N10011"></a><a name="combiner"></a>
 342<h2 class="h3">Combiner</h2>
 343<div class="section">
 344<p>The Pig combiner is an optimizer that is invoked when the statements in your scripts are arranged in certain ways. The examples below demonstrate when the combiner is used and not used. Whenever possible, make sure the combiner is used as it frequently yields an order of magnitude improvement in performance. </p>
 345<a name="N1001A"></a><a name="When+the+Combiner+is+Used"></a>
 346<h3 class="h4">When the Combiner is Used</h3>
 347<p>The combiner is generally used in the case of non-nested foreach where all projections are either expressions on the group column or expressions on algebraic UDFs (see  <a href="#Algebraic-interface">Make Your UDFs Algebraic</a>).</p>
 348<p>Example:</p>
 349<pre class="code">
 350A = load 'studenttab10k' as (name, age, gpa);
 351B = group A by age;
 352C = foreach B generate ABS(SUM(A.gpa)), COUNT(org.apache.pig.builtin.Distinct(A.name)), (MIN(A.gpa) + MAX(A.gpa))/2, group.age;
 353explain C;
 354</pre>
 355<p></p>
 356<p>In the above example:</p>
 357<ul>
 358
 359<li>The GROUP statement can be referred to as a whole or by accessing individual fields (as in the example). </li>
 360
 361<li>The GROUP statement and its elements can appear anywhere in the projection. </li>
 362
 363</ul>
 364<p>In the above example, a variety of expressions can be applied to algebraic functions including:</p>
 365<ul>
 366
 367<li>A column transformation function such as ABS can be applied to an algebraic function SUM.</li>
 368
 369<li>An algebraic function (COUNT) can be applied to another algebraic function (Distinct), but only the inner function is computed using the combiner. </li>
 370
 371<li>A mathematical expression can be applied to one or more algebraic functions. </li>
 372
 373</ul>
 374<p></p>
 375<p>You can check if the combiner is used for your query by running <a href="test.html#EXPLAIN">EXPLAIN</a> on the FOREACH alias as shown above. You should see the combine section in the MapReduce part of the plan:</p>
 376<pre class="code">
 377.....
 378Combine Plan
 379B: Local Rearrange[tuple]{bytearray}(false) - scope-42
 380| |
 381| Project[bytearray][0] - scope-43
 382|
 383|---C: New For Each(false,false,false)[bag] - scope-28
 384| |
 385| Project[bytearray][0] - scope-29
 386| |
 387| POUserFunc(org.apache.pig.builtin.SUM$Intermediate)[tuple] - scope-30
 388| |
 389| |---Project[bag][1] - scope-31
 390| |
 391| POUserFunc(org.apache.pig.builtin.Distinct$Intermediate)[tuple] - scope-32
 392| |
 393| |---Project[bag][2] - scope-33
 394|
 395|---POCombinerPackage[tuple]{bytearray} - scope-36--------
 396.....
 397</pre>
 398<p>The combiner is also used with a nested foreach as long as the only nested operation used is DISTINCT
 399(see <a href="basic.html#FOREACH">FOREACH</a> and <a href="basic.html#nestedblock">Example: Nested Block</a>).
 400</p>
 401<pre class="code">
 402A = load 'studenttab10k' as (name, age, gpa);
 403B = group A by age;
 404C = foreach B { D = distinct (A.name); generate group, COUNT(D);}
 405</pre>
 406<p></p>
 407<p>Finally, use of the combiner is influenced by the surrounding environment of the GROUP and FOREACH statements.</p>
 408<a name="N1006D"></a><a name="When+the+Combiner+is+Not+Used"></a>
 409<h3 class="h4">When the Combiner is Not Used</h3>
 410<p>The combiner is generally not used if there is any operator that comes between the GROUP and FOREACH statements in the execution plan. Even if the statements are next to each other in your script, the optimizer might rearrange them. In this example, the optimizer will push FILTER above FOREACH which will prevent the use of the combiner:</p>
 411<pre class="code">
 412A = load 'studenttab10k' as (name, age, gpa);
 413B = group A by age;
 414C = foreach B generate group, COUNT (A);
 415D = filter C by group.age &lt;30;
 416</pre>
 417<p></p>
 418<p>Please note that the script above can be made more efficient by performing filtering before the GROUP statement:</p>
 419<pre class="code">
 420A = load 'studenttab10k' as (name, age, gpa);
 421B = filter A by age &lt;30;
 422C = group B by age;
 423D = foreach C generate group, COUNT (B);
 424</pre>
 425<p></p>
 426<p>
 427<strong>Note:</strong> One exception to the above rule is LIMIT. Starting with Pig 0.9, even if LIMIT comes between GROUP and FOREACH, the combiner will still be used. In this example, the optimizer will push LIMIT above FOREACH but this will not prevent the use of the combiner.</p>
 428<pre class="code">
 429A = load 'studenttab10k' as (name, age, gpa);
 430B = group A by age;
 431C = foreach B generate group, COUNT (A);
 432D = limit C 20;
 433</pre>
 434<p></p>
 435<p>The combiner is also not used in the case where multiple FOREACH statements are associated with the same GROUP:</p>
 436<pre class="code">
 437A = load 'studenttab10k' as (name, age, gpa);
 438B = group A by age;
 439C = foreach B generate group, COUNT (A);
 440D = foreach B generate group, MIN (A.gpa). MAX(A.gpa);
 441.....
 442</pre>
 443<p>Depending on your use case, it might be more efficient (improve performance) to split your script into multiple scripts.</p>
 444</div> 
 445  
 446  
 447<!-- ================================================================== -->
 448<!-- MEMORY MANAGEMENT -->
 449
 450<a name="N100A0"></a><a name="memory-management"></a>
 451<h2 class="h3">Memory Management</h2>
 452<div class="section">
 453<p>Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner. </p>
 454<a name="memory-bags"></a>
 455<p id="memory-bags">The amount of memory allocated to bags is determined by pig.cachedbag.memusage; the default is set to 20% (0.2) of available memory. Note that this memory is shared across all large bags used by the application.</p>
 456</div> 
 457
 458
 459<!-- ==================================================================== -->
 460<!-- MULTI-QUERY EXECUTION-->
 461
 462<a name="N100B2"></a><a name="multi-query-execution"></a>
 463<h2 class="h3">Multi-Query Execution</h2>
 464<div class="section">
 465<p>With multi-query execution Pig processes an entire script or a batch of statements at once.</p>
 466<a name="N100BB"></a><a name="Turning+it+On+or+Off"></a>
 467<h3 class="h4">Turning it On or Off</h3>
 468<p>Multi-query execution is turned on by default. 
 469	To turn it off and revert to Pig's "execute-on-dump/store" behavior, use the "-M" or "-no_multiquery" options. </p>
 470<p>To run script "myscript.pig" without the optimization, execute Pig as follows: </p>
 471<pre class="code">
 472$ pig -M myscript.pig
 473or
 474$ pig -no_multiquery myscript.pig
 475</pre>
 476<a name="N100CC"></a><a name="How+it+Works"></a>
 477<h3 class="h4">How it Works</h3>
 478<p>Multi-query execution introduces some changes:</p>
 479<ul>
 480
 481<li>
 482
 483<p>For batch mode execution, the entire script is first parsed to determine if intermediate tasks 
 484can be combined to reduce the overall amount of work that needs to be done; execution starts only after the parsing is completed 
 485(see the <a href="test.html#EXPLAIN">EXPLAIN</a> operator and the <a href="cmds.html#run">run</a> and <a href="cmds.html#exec">exec</a> commands). </p>
 486
 487
 488</li>
 489
 490<li>
 491
 492<p>Two run scenarios are optimized, as explained below: explicit and implicit splits, and storing intermediate results.</p>
 493
 494</li>
 495
 496</ul>
 497<a name="N100F0"></a><a name="splits"></a>
 498<h4>Explicit and Implicit Splits</h4>
 499<p>There might be cases in which you want different processing on separate parts of the same data stream.</p>
 500<p>Example 1:</p>
 501<pre class="code">
 502A = LOAD ...
 503...
 504SPLIT A' INTO B IF ..., C IF ...
 505...
 506STORE B' ...
 507STORE C' ...
 508</pre>
 509<p>Example 2:</p>
 510<pre class="code">
 511A = LOAD ...
 512...
 513B = FILTER A' ...
 514C = FILTER A' ...
 515...
 516STORE B' ...
 517STORE C' ...
 518</pre>
 519<p>In prior Pig releases, Example 1 will dump A' to disk and then start jobs for B' and C'. 
 520Example 2 will execute all the dependencies of B' and store it and then execute all the dependencies of C' and store it. 
 521Both are equivalent, but the performance will be different. </p>
 522<p>Here's what the multi-query execution does to increase the performance: </p>
 523<ul>
 524		
 525<li>
 526<p>For Example 2, adds an implicit split to transform the query to Example 1. 
 527		This eliminates the processing of A' multiple times.</p>
 528</li>
 529		
 530<li>
 531<p>Makes the split non-blocking and allows processing to continue. 
 532		This helps reduce the amount of data that has to be stored right at the split.  </p>
 533</li>
 534		
 535<li>
 536<p>Allows multiple outputs from a job. This way some results can be stored as a side-effect of the main job. 
 537		This is also necessary to make the previous item work.  </p>
 538</li>
 539		
 540<li>
 541<p>Allows multiple split branches to be carried on to the combiner/reducer. 
 542		This reduces the amount of IO again in the case where multiple branches in the split can benefit from a combiner run. </p>
 543</li>
 544	
 545</ul>
 546<a name="N10121"></a><a name="data-store-performance"></a>
 547<h4>Storing Intermediate Results</h4>
 548<p>Sometimes it is necessary to store intermediate results. </p>
 549<pre class="code">
 550A = LOAD ...
 551...
 552STORE A'
 553...
 554STORE A''
 555</pre>
 556<p>If the script doesn't re-load A' for the processing of A the steps above A' will be duplicated. 
 557This is a special case of Example 2 above, so the same steps are recommended. 
 558With multi-query execution, the script will process A and dump A' as a side-effect.</p>
 559<a name="N10135"></a><a name="store-dump"></a>
 560<h3 class="h4">Store vs. Dump</h3>
 561<p>With multi-query exection, you want to use <a href="basic.html#STORE">STORE</a> to save (persist) your results. 
 562	You do not want to use <a href="test.html#DUMP">DUMP</a> as it will disable multi-query execution and is likely to slow down execution. (If you have included DUMP statements in your scripts for debugging purposes, you should remove them.) </p>
 563<p>DUMP Example: In this script, because the DUMP command is interactive, the multi-query execution will be disabled and two separate jobs will be created to execute this script. The first job will execute A &gt; B &gt; DUMP while the second job will execute A &gt; B &gt; C &gt; STORE.</p>
 564<pre class="code">
 565A = LOAD 'input' AS (x, y, z);
 566B = FILTER A BY x &gt; 5;
 567DUMP B;
 568C = FOREACH B GENERATE y, z;
 569STORE C INTO 'output';
 570</pre>
 571<p>STORE Example: In this script, multi-query optimization will kick in allowing the entire script to be executed as a single job. Two outputs are produced: output1 and output2.</p>
 572<pre class="code">
 573A = LOAD 'input' AS (x, y, z);
 574B = FILTER A BY x &gt; 5;
 575STORE B INTO 'output1';
 576C = FOREACH B GENERATE y, z;
 577STORE C INTO 'output2';	
 578</pre>
 579<a name="N10157"></a><a name="error-handling"></a>
 580<h3 class="h4">Error Handling</h3>
 581<p>With multi-query execution Pig processes an entire script or a batch of statements at once. 
 582	By default Pig tries to run all the jobs that result from that, regardless of whether some jobs fail during execution. 
 583	To check which jobs have succeeded or failed use one of these options. </p>
 584<p>First, Pig logs all successful and failed store commands. Store commands are identified by output path. 
 585	At the end of execution a summary line indicates success, partial failure or failure of all store commands. </p>
 586<p>Second, Pig returns different code upon completion for these scenarios:</p>
 587<ul>
 588		
 589<li>
 590<p>Return code 0: All jobs succeeded</p>
 591</li>
 592		
 593<li>
 594<p>Return code 1: <em>Used for retrievable errors</em> 
 595</p>
 596</li>
 597		
 598<li>
 599<p>Return code 2: All jobs have failed </p>
 600</li>
 601		
 602<li>
 603<p>Return code 3: Some jobs have failed  </p>
 604</li>
 605	
 606</ul>
 607<p></p>
 608<p>In some cases it might be desirable to fail the entire script upon detecting the first failed job. 
 609	This can be achieved with the "-F" or "-stop_on_failure" command line flag. 
 610	If used, Pig will stop execution when the first failed job is detected and discontinue further processing. 
 611	This also means that file commands that come after a failed store in the script will not be executed (this can be used to create "done" files). </p>
 612<p>This is how the flag is used: </p>
 613<pre class="code">
 614$ pig -F myscript.pig
 615or
 616$ pig -stop_on_failure myscript.pig
 617</pre>
 618<a name="N1018B"></a><a name="backward-compatibility"></a>
 619<h3 class="h4">Backward Compatibility</h3>
 620<p>Most existing Pig scripts will produce the same result with or without the multi-query execution. 
 621	There are cases though where this is not true. Path names and schemes are discussed here.</p>
 622<p>Any script is parsed in it's entirety before it is sent to execution. Since the current directory can change 
 623	throughout the script any path used in LOAD or STORE statement is translated to a fully qualified and absolute path.</p>
 624<p>In map-reduce mode, the following script will load from "hdfs://&lt;host&gt;:&lt;port&gt;/data1" and store into "hdfs://&lt;host&gt;:&lt;port&gt;/tmp/out1". </p>
 625<pre class="code">
 626cd /;
 627A = LOAD 'data1';
 628cd tmp;
 629STORE A INTO 'out1';
 630</pre>
 631<p>These expanded paths will be passed to any LoadFunc or Slicer implementation. 
 632	In some cases this can cause problems, especially when a LoadFunc/Slicer is not used to read from a dfs file or path 
 633	(for example, loading from an SQL database). </p>
 634<p>Solutions are to either: </p>
 635<ul>
 636		
 637<li>
 638<p>Specify "-M" or "-no_multiquery" to revert to the old names</p>
 639</li>
 640		
 641<li>
 642<p>Specify a custom scheme for the LoadFunc/Slicer </p>
 643</li>
 644	
 645</ul>
 646<p>Arguments used in a LOAD statement that have a scheme other than "hdfs" or "file" will not be expanded and passed to the LoadFunc/Slicer unchanged.</p>
 647<p>In the SQL case, the SQLLoader function is invoked with 'sql://mytable'. </p>
 648<pre class="code">
 649A = LOAD 'sql://mytable' USING SQLLoader();
 650</pre>
 651<a name="N101BA"></a><a name="Implicit-Dependencies"></a>
 652<h3 class="h4">Implicit Dependencies</h3>
 653<p>If a script has dependencies on the execution order outside of what Pig knows about, execution may fail. </p>
 654<a name="N101C3"></a><a name="Example"></a>
 655<h4>Example</h4>
 656<p>In this script, MYUDF might try to read from out1, a file that A was just stored into. 
 657However, Pig does not know that MYUDF depends on the out1 file and might submit the jobs 
 658producing the out2 and out1 files at the same time.</p>
 659<pre class="code">
 660...
 661STORE A INTO 'out1';
 662B = LOAD 'data2';
 663C = FOREACH B GENERATE MYUDF($0,'out1');
 664STORE C INTO 'out2';
 665</pre>
 666<p>To make the script work (to ensure that the right execution order is enforced) add the exec statement. 
 667The exec statement will trigger the execution of the statements that produce the out1 file. </p>
 668<pre class="code">
 669...
 670STORE A INTO 'out1';
 671EXEC;
 672B = LOAD 'data2';
 673C = FOREACH B GENERATE MYUDF($0,'out1');
 674STORE C INTO 'out2';
 675</pre>
 676<a name="N101D8"></a><a name="Example-N101D8"></a>
 677<h4>Example</h4>
 678<p>In this script, the STORE/LOAD operators have different file paths; however, the LOAD operator depends on the STORE operator.</p>
 679<pre class="code">
 680A = LOAD '/user/xxx/firstinput' USING PigStorage();
 681B = group ....
 682C = .... agrregation function
 683STORE C INTO '/user/vxj/firstinputtempresult/days1';
 684..
 685Atab = LOAD '/user/xxx/secondinput' USING  PigStorage();
 686Btab = group ....
 687Ctab = .... agrregation function
 688STORE Ctab INTO '/user/vxj/secondinputtempresult/days1';
 689..
 690E = LOAD '/user/vxj/firstinputtempresult/' USING  PigStorage();
 691F = group ....
 692G = .... aggregation function
 693STORE G INTO '/user/vxj/finalresult1';
 694
 695Etab =LOAD '/user/vxj/secondinputtempresult/' USING  PigStorage();
 696Ftab = group ....
 697Gtab = .... aggregation function
 698STORE Gtab INTO '/user/vxj/finalresult2';
 699</pre>
 700<p>To make the script works, add the exec statement.  </p>
 701<pre class="code">
 702A = LOAD '/user/xxx/firstinput' USING PigStorage();
 703B = group ....
 704C = .... agrregation function
 705STORE C INTO '/user/vxj/firstinputtempresult/days1';
 706..
 707Atab = LOAD '/user/xxx/secondinput' USING  PigStorage();
 708Btab = group ....
 709Ctab = .... agrregation function
 710STORE Ctab INTO '/user/vxj/secondinputtempresult/days1';
 711
 712EXEC;
 713
 714E = LOAD '/user/vxj/firstinputtempresult/' USING  PigStorage();
 715F = group ....
 716G = .... aggregation function
 717STORE G INTO '/user/vxj/finalresult1';
 718..
 719Etab =LOAD '/user/vxj/secondinputtempresult/' USING  PigStorage();
 720Ftab = group ....
 721Gtab = .... aggregation function
 722STORE Gtab INTO '/user/vxj/finalresult2';
 723</pre>
 724</div>
 725
 726
 727<!-- ==================================================================== -->
 728 <!-- OPTIMIZATION RULES -->
 729
 730<a name="N101F3"></a><a name="optimization-rules"></a>
 731<h2 class="h3">Optimization Rules</h2>
 732<div class="section">
 733<p>Pig supports various optimization rules. By default optimization, and all optimization rules, are turned on. 
 734To turn off optimiztion, use:</p>
 735<pre class="code">
 736pig -optimizer_off [opt_rule | all ]
 737</pre>
 738<p>Note that some rules are mandatory and cannot be turned off.</p>
 739<a name="N10205"></a><a name="FilterLogicExpressionSimplifier"></a>
 740<h3 class="h4">FilterLogicExpressionSimplifier</h3>
 741<p>This rule simplifies the expression in filter statement.</p>
 742<pre class="code">
 7431) Constant pre-calculation 
 744
 745B = FILTER A BY a0 &gt; 5+7; 
 746is simplified to 
 747B = FILTER A BY a0 &gt; 12; 
 748
 7492) Elimination of negations 
 750
 751B = FILTER A BY NOT (NOT(a0 &gt; 5) OR a &gt; 10); 
 752is simplified to 
 753B = FILTER A BY a0 &gt; 5 AND a &lt;= 10; 
 754
 7553) Elimination of logical implied expression in AND 
 756
 757B = FILTER A BY (a0 &gt; 5 AND a0 &gt; 7); 
 758is simplified to 
 759B = FILTER A BY a0 &gt; 7; 
 760
 7614) Elimination of logical implied expression in OR 
 762
 763B = FILTER A BY ((a0 &gt; 5) OR (a0 &gt; 6 AND a1 &gt; 15); 
 764is simplified to 
 765B = FILTER C BY a0 &gt; 5; 
 766
 7675) Equivalence elimination 
 768
 769B = FILTER A BY (a0 v 5 AND a0 &gt; 5); 
 770is simplified to 
 771B = FILTER A BY a0 &gt; 5; 
 772
 7736) Elimination of complementary expressions in OR 
 774
 775B = FILTER A BY (a0 &gt; 5 OR a0 &lt;= 5); 
 776is simplified to non-filtering 
 777
 7787) Elimination of naive TRUE expression 
 779
 780B = FILTER A BY 1==1; 
 781is simplified to non-filtering 
 782</pre>
 783<a name="N10215"></a><a name="SplitFilter"></a>
 784<h3 class="h4">SplitFilter</h3>
 785<p>Split filter conditions so that we can push filter more aggressively.</p>
 786<pre class="code">
 787A = LOAD 'input1' as (a0, a1);
 788B = LOAD 'input2' as (b0, b1);
 789C = JOIN A by a0, B by b0;
 790D = FILTER C BY a1&gt;0 and b1&gt;0;
 791</pre>
 792<p>Here D will be splitted into:</p>
 793<pre class="code">
 794X = FILTER C BY a1&gt;0;
 795D = FILTER X BY b1&gt;0;
 796</pre>
 797<p>So "a1&gt;0" and "b1&gt;0" can be pushed up individually.</p>
 798<a name="N1022F"></a><a name="PushUpFilter"></a>
 799<h3 class="h4">PushUpFilter</h3>
 800<p>The objective of this rule is to push the FILTER operators up the data flow graph. As a result, the number of records that flow through the pipeline is reduced. </p>
 801<pre class="code">
 802A = LOAD 'input';
 803B = GROUP A BY $0;
 804C = FILTER B BY $0 &lt; 10;
 805</pre>
 806<a name="N1023F"></a><a name="MergeFilter"></a>
 807<h3 class="h4">MergeFilter</h3>
 808<p>Merge filter conditions after PushUpFilter rule to decrease the number of filter statements.</p>
 809<a name="N1024B"></a><a name="PushDownForEachFlatten"></a>
 810<h3 class="h4">PushDownForEachFlatten</h3>
 811<p>The objective of this rule is to reduce the number of records that flow through the pipeline by moving FOREACH operators with a FLATTEN down the data flow graph. In the example shown below, it would be more efficient to move the foreach after the join to reduce the cost of the join operation.</p>
 812<pre class="code">
 813A = LOAD 'input' AS (a, b, c);
 814B = LOAD 'input2' AS (x, y, z);
 815C = FOREACH A GENERATE FLATTEN($0), B, C;
 816D = JOIN C BY $1, B BY $1;
 817</pre>
 818<a name="N1025B"></a><a name="LimitOptimizer"></a>
 819<h3 class="h4">LimitOptimizer</h3>
 820<p>The objective of this rule is to push the LIMIT operator up the data flow graph (or down the tree for database folks). In addition, for top-k (ORDER BY followed by a LIMIT) the LIMIT is pushed into the ORDER BY.</p>
 821<pre class="code">
 822A = LOAD 'input';
 823B = ORDER A BY $0;
 824C = LIMIT B 10;
 825</pre>
 826<a name="N1026B"></a><a name="ColumnMapKeyPrune"></a>
 827<h3 class="h4">ColumnMapKeyPrune</h3>
 828<p>Prune the loader to only load necessary columns. The performance gain is more significant if the corresponding loader support column pruning and only load necessary columns (See LoadPushDown.pushProjection). Otherwise, ColumnMapKeyPrune will insert a ForEach statement right after loader.</p>
 829<pre class="code">
 830A = load 'input' as (a0, a1, a2);
 831B = ORDER A by a0;
 832C = FOREACH B GENERATE a0, a1;
 833</pre>
 834<p>a2 is irrelevant in this query, so we can prune it earlier. The loader in this query is PigStorage and it supports column pruning. So we only load a0 and a1 from the input file.</p>
 835<p>ColumnMapKeyPrune also prunes unused map keys:</p>
 836<pre class="code">
 837A = load 'input' as (a0:map[]);
 838B = FOREACH A generate a0#'key1';
 839</pre>
 840<a name="N10285"></a><a name="AddForEach"></a>
 841<h3 class="h4">AddForEach</h3>
 842<p>Prune unused column as soon as possible. In addition to prune the loader in ColumnMapKeyPrune, we can prune a column as soon as it is not used in the rest of the script</p>
 843<pre class="code">
 844-- Original code: 
 845
 846A = LOAD 'input' AS (a0, a1, a2); 
 847B = ORDER A BY a0;
 848C = FILTER B BY a1&gt;0;
 849</pre>
 850<p>We can only prune a2 from the loader. However, a0 is never used after "ORDER BY". So we can drop a0 right after "ORDER BY" statement.</p>
 851<pre class="code">
 852-- Optimized code: 
 853
 854A = LOAD 'input' AS (a0, a1, a2); 
 855B = ORDER A BY a0;
 856B1 = FOREACH B GENERATE a1;  -- drop a0
 857C = FILTER B1 BY a1&gt;0;
 858</pre>
 859<a name="N1029C"></a><a name="MergeForEach"></a>
 860<h3 class="h4">MergeForEach</h3>
 861<p>The objective of this rule is to merge together two feach statements, if these preconditions are met:</p>
 862<ul>
 863
 864<li>The foreach statements are consecutive.</li>
 865
 866<li>The first foreach statement does not contain flatten.</li>
 867
 868<li>The second foreach is not nested.</li>
 869
 870</ul>
 871<pre class="code">
 872-- Original code: 
 873
 874A = LOAD 'file.txt' AS (a, b, c); 
 875B = FOREACH A GENERATE a+b AS u, c-b AS v; 
 876C = FOREACH B GENERATE $0+5, v; 
 877
 878-- Optimized code: 
 879
 880A = LOAD 'file.txt' AS (a, b, c); 
 881C = FOREACH A GENERATE a+b+5, c-b;
 882</pre>
 883<a name="N102B8"></a><a name="GroupByConstParallelSetter"></a>
 884<h3 class="h4">GroupByConstParallelSetter</h3>
 885<p>Force parallel "1" for "group all" statement. That's because even if we set parallel to N, only 1 reducer will be used in this case and all other reducer produce empty result.</p>
 886<pre class="code">
 887A = LOAD 'input';
 888B = GROUP A all PARALLEL 10;
 889</pre>
 890</div>
 891
 892  
 893<!-- ==================================================================== -->
 894<!-- PERFORMANCE ENHANCERS-->
 895
 896<a name="N102CB"></a><a name="performance-enhancers"></a>
 897<h2 class="h3">Performance Enhancers</h2>
 898<div class="section">
 899<a name="N102D1"></a><a name="Use+Optimization"></a>
 900<h3 class="h4">Use Optimization</h3>
 901<p>Pig supports various <a href="perf.html#Optimization-Rules">optimization rules</a> which are turned on by default. 
 902Become familiar with these rules.</p>
 903<a name="N102E1"></a><a name="types"></a>
 904<h3 class="h4">Use Types</h3>
 905<p>If types are not specified in the load statement, Pig assumes the type of =double= for numeric computations. 
 906A lot of the time, your data would be much smaller, maybe, integer or long. Specifying the real type will help with 
 907speed of arithmetic computation. It has an additional advantage of early error detection. </p>
 908<pre class="code">
 909--Query 1
 910A = load 'myfile' as (t, u, v);
 911B = foreach A generate t + u;
 912
 913--Query 2
 914A = load 'myfile' as (t: int, u: int, v);
 915B = foreach A generate t + u;
 916</pre>
 917<p>The second query will run more efficiently than the first. In some of our queries with see 2x speedup. </p>
 918<a name="N102F4"></a><a name="projection"></a>
 919<h3 class="h4">Project Early and Often </h3>
 920<p>Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like: </p>
 921<pre class="code">
 922A = load 'myfile' as (t, u, v);
 923B = load 'myotherfile' as (x, y, z);
 924C = join A by t, B by x;
 925D = group C by u;
 926E = foreach D generate group, COUNT($1);
 927</pre>
 928<p>There is no need for v, y, or z to participate in this query.  And there is no need to carry both t and x past the join, just one will suffice. Changing the query above to the query below will greatly reduce the amount of data being carried through the map and reduce phases by pig. </p>
 929<pre class="code">
 930A = load 'myfile' as (t, u, v);
 931A1 = foreach A generate t, u;
 932B = load 'myotherfile' as (x, y, z);
 933B1 = foreach B generate x;
 934C = join A1 by t, B1 by x;
 935C1 = foreach C generate t, u;
 936D = group C1 by u;
 937E = foreach D generate group, COUNT($1);
 938</pre>
 939<p>Depending on your data, this can produce significant time savings. In queries similar to the example shown here we have seen total time drop by 50%.</p>
 940<a name="N1030E"></a><a name="filter"></a>
 941<h3 class="h4">Filter Early and Often</h3>
 942<p>As with early projection, in most cases it is beneficial to apply filters as early as possible to reduce the amount of data flowing through the pipeline. </p>
 943<pre class="code">
 944-- Query 1
 945A = load 'myfile' as (t, u, v);
 946B = load 'myotherfile' as (x, y, z);
 947C = filter A by t == 1;
 948D = join C by t, B by x;
 949E = group D by u;
 950F = foreach E generate group, COUNT($1);
 951
 952-- Query 2
 953A = load 'myfile' as (t, u, v);
 954B = load 'myotherfile' as (x, y, z);
 955C = join A by t, B by x;
 956D = group C by u;
 957E = foreach D generate group, COUNT($1);
 958F = filter E by C.t == 1;
 959</pre>
 960<p>The first query is clearly more efficient than the second one because it reduces the amount of data going into the join. </p>
 961<p>One case where pushing filters up might not be a good idea is if the cost of applying filter is very high and only a small amount of data is filtered out. </p>
 962<a name="N10324"></a><a name="pipeline"></a>
 963<h3 class="h4">Reduce Your Operator Pipeline</h3>
 964<p>For clarity of your script, you might choose to split your projects into several steps for instance: </p>
 965<pre class="code">
 966A = load 'data' as (in: map[]);
 967-- get key out of the map
 968B = foreach A generate in#'k1' as k1, in#'k2' as k2;
 969-- concatenate the keys
 970C = foreach B generate CONCAT(k1, k2);
 971.......
 972</pre>
 973<p>While the example above is easier to read, you might want to consider combining the two foreach statements to improve your query performance: </p>
 974<pre class="code">
 975A = load 'data' as (in: map[]);
 976-- concatenate the keys from the map
 977B = foreach A generate CONCAT(in#'k1', in#'k2');
 978....
 979</pre>
 980<p>The same goes for filters. </p>
 981<a name="N1033E"></a><a name="algebraic-interface"></a>
 982<h3 class="h4">Make Your UDFs Algebraic</h3>
 983<p>Queries that can take advantage of the combiner generally ran much faster (sometimes several times faster) than the versions that don't. The latest code significantly improves combiner usage; however, you need to make sure you do your part. If you have a UDF that works on grouped data and is, by nature, algebraic (meaning their computation can be decomposed into multiple steps) make sure you implement it as such. For details on how to write algebraic UDFs, see <a href="udf.html#algebraic-interface">Algebraic Interface</a>.</p>
 984<pre class="code">
 985A = load 'data' as (x, y, z)
 986B = group A by x;
 987C = foreach B generate group, MyUDF(A);
 988....
 989</pre>
 990<p>If <span class="codefrag">MyUDF</span> is algebraic, the query will use combiner and run much faster. You can run <span class="codefrag">explain</span> command on your query to make sure that combiner is used. </p>
 991<a name="N1035B"></a><a name="accumulator-interface"></a>
 992<h3 class="h4">Use the Accumulator Interface</h3>
 993<p>
 994If your UDF can't be made Algebraic but is able to deal with getting input in chunks rather than all at once, consider implementing the Accumulator  interface to reduce the amount of memory used by your script. If your function <em>is</em> Algebraic and can be used on conjunction with Accumulator functions, you will need to implement the Accumulator interface as well as the Algebraic interface. For more information, see <a href="udf.html#Accumulator-Interface">Accumulator Interface</a>.</p>
 995<p>
 996<strong>Note:</strong> Pig automatically chooses the interface that it expects to provide the best performance: Algebraic &gt; Accumulator &gt; Default. </p>
 997<a name="N10373"></a><a name="nulls"></a>
 998<h3 class="h4">Drop Nulls Before a Join</h3>
 999<p>With the introduction of nulls, join and cogroup semantics were altered to work with nulls. The semantic for cogrouping with nulls is that nulls from a given input are grouped together, but nulls across inputs are not grouped together. This preserves the semantics of grouping (nulls are collected together from a single input to be passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs). Since flattening an empty bag results in an empty row (and no output), in a standard join the rows with a null key will always be dropped. </p>
1000<p>This join</p>
1001<pre class="code">
1002A = load 'myfile' as (t, u, v);
1003B = load 'myotherfile' as (x, y, z);
1004C = join A by t, B by x;
1005</pre>
1006<p>is rewritten by Pig to </p>
1007<pre class="code">
1008A = load 'myfile' as (t, u, v);
1009B = load 'myotherfile' as (x, y, z);
1010C1 = cogroup A by t INNER, B by x INNER;
1011C = foreach C1 generate flatten(A), flatten(B);
1012</pre>
1013<p>Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output. So the null keys will be dropped. But they will not be dropped until the last possible moment. </p>
1014<p>If the query is rewritten to </p>
1015<pre class="code">
1016A = load 'myfile' as (t, u, v);
1017B = load 'myotherfile' as (x, y, z);
1018A1 = filter A by t is not null;
1019B1 = filter B by x is not null;
1020C = join A1 by t, B1 by x;
1021</pre>
1022<p>then the nulls will be dropped before the join.  Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be significant.  In one test where the key was null 7% of the time and the data was spread across 200 reducers, we saw a about a 10x speed up in the query by adding the early filters. </p>
1023<a name="N1039A"></a><a name="join-optimizations"></a>
1024<h3 class="h4">Take Advantage of Join Optimizations</h3>
1025<p>
1026<strong>Regular Join Optimizations</strong>
1027</p>
1028<p>Optimization for regular joins ensures that the last table in the join is not brought into memory but streamed through instead. Optimization reduces the amount of memory used which means you can avoid spilling the data and also should be able to scale your query to larger data volumes. </p>
1029<p>To take advantage of this optimization, make sure that the table with the largest number of tuples per key is the last table in your query. 
1030In some of our tests we saw 10x performance improvement as the result of this optimization.</p>
1031<pre class="code">
1032small = load 'small_file' as (t, u, v);
1033large = load 'large_file' as (x, y, z);
1034C = join small by t, large by x;
1035</pre>
1036<p>
1037<strong>Specialized Join Optimizations</strong>
1038</p>
1039<p>Optimization can also be achieved using fragment replicate joins, skewed joins, and merge joins. 
1040For more information see <a href="perf.html#Specialized-Joins">Specialized Joins</a>.</p>
1041<a name="N103BC"></a><a name="parallel"></a>
1042<h3 class="h4">Use the Parallel Features</h3>
1043<p>You can set the number of reduce tasks for the MapReduce jobs generated by Pig using two parallel features. 
1044(The parallel features only affect the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.)</p>
1045<p>
1046<strong>You Set the Number of Reducers</strong>
1047</p>
1048<p>Use the <a href="cmds.html#set">set default parallel</a> command to set the number of reducers at the script level.</p>
1049<p>Alternatively, use the PARALLEL clause to set the number of reducers at the operator level. 
1050(In a script, the value set via the PARALLEL clause will override any value set via "set default parallel.")
1051You can include the PARALLEL clause with any operator that starts a reduce phase:  
1052<a href="basic.html#COGROUP">COGROUP</a>, 
1053<a href="basic.html#CROSS">CROSS</a>, 
1054<a href="basic.html#DISTINCT">DISTINCT</a>, 
1055<a href="basic.html#GROUP">GROUP</a>, 
1056<a href="basic.html#JOIN-inner">JOIN (inner)</a>, 
1057<a href="basic.html#JOIN-outer">JOIN (outer)</a>, and
1058<a href="basic.html#ORDER-BY">ORDER BY</a>.
1059</p>
1060<p>The number of reducers you need for a particular construct in Pig that forms a MapReduce boundary depends entirely on (1) your data and the number of intermediate keys you are generating in your mappers and (2) the partitioner and distribution of map (combiner) output keys. In the best cases we have seen that a reducer processing about 1 GB of data behaves efficiently.</p>
1061<p>
1062<strong>Let Pig Set the Number of Reducers</strong>
1063</p>
1064<p>If  neither "set default parallel" nor the PARALLEL clause are used, Pig sets the number of reducers using a heuristic based on the size of the input data. You can set the values for these properties:</p>
1065<ul>
1066	
1067<li>pig.exec.reducers.bytes.per.reducer - Defines the number of input bytes per reduce; default value is 1000*1000*1000 (1GB).</li>
1068	
1069<li>pig.exec.reducers.max - Defines the upper bound on the number of reducers; default is 999. </li>
1070
1071</ul>
1072<p></p>
1073<p>The formula, shown below, is very simple and will improve over time. The computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.</p>
1074<p>
1075<span class="codefrag">#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer) </span>
1076</p>
1077<p>
1078<strong>Examples</strong>
1079</p>
1080<p>In this example PARALLEL is used with the GROUP operator. </p>
1081<pre class="code">
1082A = LOAD 'myfile' AS (t, u, v);
1083B = GROUP A BY t PARALLEL 18;
1084...
1085</pre>
1086<p>In this example all the MapReduce jobs that get launched use 20 reducers.</p>
1087<pre class="code">
1088SET default_parallel 20;
1089A = LOAD &lsquo;myfile.txt&rsquo; USING PigStorage() AS (t, u, v);
1090B = GROUP A BY t;
1091C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
1092D = ORDER C BY mycount;
1093STORE D INTO &lsquo;mysortedcount&rsquo; USING PigStorage();
1094</pre>
1095<a name="N10420"></a><a name="limit"></a>
1096<h3 class="h4">Use the LIMIT Operator</h3>
1097<p>Often you are not interested in the entire output but rather a sample or top results. In such cases, using LIMIT can yield a much better performance as we push the limit as high as possible to minimize the amount of data travelling through the pipeline. </p>
1098<p>Sample: 
1099</p>
1100<pre class="code">
1101A = load 'myfile' as (t, u, v);
1102B = limit A 500;
1103</pre>
1104<p>Top results: </p>
1105<pre class="code">
1106A = load 'myfile' as (t, u, v);
1107B = order A by t;
1108C = limit B 500;
1109</pre>
1110<a name="N1043A"></a><a name="distinct"></a>
1111<h3 class="h4">Prefer DISTINCT over GROUP BY/GENERATE</h3>
1112<p>To extract unique values from a column in a relation you can use DISTINCT or GROUP BY/GENERATE. DISTINCT is the preferred method; it is faster and more efficient.</p>
1113<p>Example using GROUP BY - GENERATE:</p>
1114<pre class="code">
1115A = load 'myfile' as (t, u, v);
1116B = foreach A generate u;
1117C = group B by u;
1118D = foreach C generate group as uniquekey;
1119dump D; 
1120</pre>
1121<p>Example using DISTINCT:</p>
1122<pre class="code">
1123A = load 'myfile' as (t, u, v);
1124B = foreach A generate u;
1125C = distinct B;
1126dump C; 
1127</pre>
1128<a name="N10454"></a><a name="compression"></a>
1129<h3 class="h4">Compress the Results of Intermediate Jobs</h3>
1130<p>If your Pig script generates a sequence of MapReduce jobs, you can compress the output of the intermediate jobs using LZO compression. (Use the <a href="test.html#EXPLAIN">EXPLAIN</a> operator to determine if your script produces multiple MapReduce Jobs.)</p>
1131<p>By doing this, you will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data that is generated, the more benefits in storage and speed that result.</p>
1132<p>You can set the value for these properties:</p>
1133<ul>
1134	
1135<li>pig.tmpfilecompression - Determines if the temporary files should be compressed or not (set to false by default).</li>
1136	
1137<li>pig.tmpfilecompression.codec - Specifies which compression codec to use. Currently, Pig accepts "gz" and "lzo" as possible values. However, because LZO is under GPL license (and disabled by default) you will need to configure your cluster to use the LZO codec to take advantage of this feature. For details, see http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ.</li>
1138
1139</ul>
1140<p></p>
1141<p>On the non-trivial queries (one ran longer than a couple of minutes) we saw significant improvements both in terms of query latency and space usage. For some queries we saw up to 96% disk saving and up to 4x query speed up. Of course, the performance characteristics are very much query and data dependent and testing needs to be done to determine gains. We did not see any slowdown in the tests we peformed which means that you are at least saving on space while using compression.</p>
1142<p>With gzip we saw a better compression (96-99%) but at a cost of 4% slowdown. Thus, we don't recommend using gzip. </p>
1143<p>
1144<strong>Example</strong>
1145</p>
1146<pre class="code">
1147-- launch Pig script using lzo compression 
1148
1149java -cp $PIG_HOME/pig.jar 
1150-Djava.library.path=&lt;path to the lzo library&gt; 
1151-Dpig.tmpfilecompression=true 
1152-Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main  myscript.pig 
1153</pre>
1154<a name="N10483"></a><a name="combine-files"></a>
1155<h3 class="h4">Combine Small Input Files</h3>
1156<p>Processing input (either user input or intermediate input) from multiple small files can be inefficient because a separate map has to be created for each file. Pig can now combined small files so that they are processed as a single map.</p>
1157<p>You can set the values for these properties:</p>
1158<ul>
1159
1160<li>pig.maxCombinedSplitSize &ndash; Specifies the size, in bytes, of data to be processed by a single map. Smaller files are combined untill this size is reached. </li>
1161
1162<li>pig.splitCombination &ndash; Turns combine split files on or off (set to &ldquo;true&rdquo; by default).</li>
1163
1164</ul>
1165<p></p>
1166<p>This feature works with <a href="func.html#PigStorage">PigStorage</a>. However, if you are using a custom loader, please note the following:</p>
1167<ul>
1168
1169<li>If your loader implementation makes use of the PigSplit object passed through the prepareToRead method, then you may need to rebuild the loader since the definition of PigSplit has been modified. </li>
1170
1171<li>The loader must be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument.</li>
1172
1173<li>If a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations.</li>
1174
1175</ul>
1176<p></p>
1177</div>
1178  
1179<!-- ==================================================================== -->
1180<!-- SPECIALIZED JOINS-->
1181  
1182<a name="N104B5"></a><a name="specialized-joins"></a>
1183<h2 class="h3">Specialized Joins</h2>
1184<div class="section">
1185<a name="N104BF"></a><a name="replicated-joins"></a>
1186<h3 class="h4">Replicated Joins</h3>
1187<p>Fragment replicate join is a special type of join that works well if one or more relations are small enough to fit into main memory. 
1188In such cases, Pig can perform a very efficient join because all of the hadoop work is done on the map side. In this type of join the 
1189large relation is followed by one or more small relations. The small relations must be small enough to fit into main memory; if they 
1190don't, the process fails and an error is generated.</p>
1191<a name="N104C8"></a><a name="Usage"></a>
1192<h4>Usage</h4>
1193<p>Perform a replicated join with the USING clause (see <a href="basic.html#JOIN-inner">JOIN (inner)</a> and <a href="basic.html#JOIN-outer">JOIN (outer)</a>).
1194In this example, a large relation is joined with two smaller relations. Note that the large relation comes first followed by the smaller relations; 
1195and, all small relations together must fit into main memory, otherwise an error is generated. </p>
1196<pre class="code">
1197big = LOAD 'big_data' AS (b1,b2,b3);
1198
1199tiny = LOAD 'tiny_data' AS (t1,t2,t3);
1200
1201mini = LOAD 'mini_data' AS (m1,m2,m3);
1202
1203C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
1204</pre>
1205<a name="N104DE"></a><a name="Conditions"></a>
1206<h4>Conditions</h4>
1207<p>Fragment replicate joins are experimental; we don't have a strong sense of how small the small relation must be to fit 
1208into memory. In our tests with a simple query that involves just a JOIN, a relation of up to 100 M can be used if the process overall 
1209gets 1 GB of memory. Please share your observations and experience with us.</p>
1210<a name="N104EF"></a><a name="skewed-joins"></a>
1211<h3 class="h4">Skewed Joins</h3>
1212<p>
1213Parallel joins are vulnerable to the presence of skew in the underlying data. 
1214If the underlying data is sufficiently skewed, load imbalances will swamp any of the parallelism gains. 
1215In order to counteract this problem, skewed join computes a histogram of the key space and uses this 
1216data to allocate reducers for a given key. Skewed join does not place a restriction on the size of the input keys. 
1217It accomplishes this by splitting the left input on the join predicate and streaming the right input. The left input is 
1218sampled to create the histogram.
1219</p>
1220<p>
1221Skewed join can be used when the underlying data is sufficiently skewed and you need a finer 
1222control over the allocation of reducers to counteract the skew. It should also be used when the data 
1223associated with a given key is too large to fit in memory.
1224</p>
1225<a name="N104FB"></a><a name="Usage-N104FB"></a>
1226<h4>Usage</h4>
1227<p>Perform a skewed join with the USING clause (see <a href="basic.html#JOIN-inner">JOIN (inner)</a> and <a href="basic.h…

Large files files are truncated, but you can click here to view the full file