PageRenderTime 17ms CodeModel.GetById 12ms app.highlight 2ms RepoModel.GetById 1ms app.codeStats 0ms

/tags/release-0.0.0-rc0/hive/external/docs/xdocs/language_manual/joins.xml

#
XML | 212 lines | 119 code | 75 blank | 18 comment | 0 complexity | df93ff5450d454b251e911ea58d5c8f8 MD5 | raw file
  1<?xml version="1.0" encoding="UTF-8"?>
  2<!--
  3 Licensed to the Apache Software Foundation (ASF) under one
  4 or more contributor license agreements.  See the NOTICE file
  5 distributed with this work for additional information
  6 regarding copyright ownership.  The ASF licenses this file
  7 to you under the Apache License, Version 2.0 (the
  8 "License"); you may not use this file except in compliance
  9 with the License.  You may obtain a copy of the License at
 10
 11   http://www.apache.org/licenses/LICENSE-2.0
 12
 13 Unless required by applicable law or agreed to in writing,
 14 software distributed under the License is distributed on an
 15 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 16 KIND, either express or implied.  See the License for the
 17 specific language governing permissions and limitations
 18 under the License.    
 19-->
 20
 21<document>
 22
 23  <properties>
 24    <title>Hadoop Hive- Joins</title>
 25    <author email="hive-user@hadoop.apache.org">Hadoop Hive Documentation Team</author>
 26  </properties>
 27
 28  <body>
 29
 30    <section name="Join Syntax" href="join_syntax">
 31
 32    <source><![CDATA[join_table:
 33    table_reference JOIN table_factor [join_condition]
 34  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
 35  | table_reference LEFT SEMI JOIN table_reference join_condition
 36
 37table_reference:
 38    table_factor
 39  | join_table
 40
 41table_factor:
 42    tbl_name [alias]
 43  | table_subquery alias
 44  | ( table_references )
 45
 46join_condition:
 47    ON equality_expression ( AND equality_expression )*
 48
 49equality_expression: 
 50    expression = expression
 51]]></source>
 52
 53<p>
 54Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive. 
 55</p>
 56
 57<b>Allowed Equality Joins</b>
 58
 59<source><![CDATA[SELECT a.* FROM a JOIN b ON (a.id = b.id) 
 60]]></source>
 61
 62<source><![CDATA[SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)
 63]]></source>
 64
 65<b>Disallowed Joins</b>
 66
 67<source><![CDATA[SELECT a.* FROM a JOIN b ON (a.id <> b.id)
 68]]></source>
 69
 70<p>Multiple Tables can be joined in the same query</p>
 71
 72<source><![CDATA[SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
 73]]></source>
 74
 75
 76
 77<source><![CDATA[
 78]]></source>
 79 
 80</section>
 81
 82<section name="Join implementation with Map Reduce" href="join_map_reduce">
 83
 84<p>Hive converts joins over multiple tables into a single map/reduce job if for every table the same column is used in the join clauses. The query below is
 85converted into a single map/reduce job as only key1 column for b is involved in the join.</p>
 86
 87<source><![CDATA[SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)]]></source>
 88<i>It is very interesting to note that any number of tables can be joined in single map/reduce process as long as they fit the above criteria.</i>
 89
 90<p>However if the join colums are not the same for all tables the is converted into multiple map/reduce jobs</p>
 91
 92<source><![CDATA[SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
 93]]></source>
 94
 95<p>In this case the first map/reduce job joins a with b and the results are then joined with c in the second map/reduce job. </p>
 96</section>
 97
 98<section name="Largest Table LAST" href="lagest_table_last">
 99
100<p>In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in</p>
101
102<source><![CDATA[SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)]]></source>
103
104<p>all the three tables are joined in a single map/reduce job and the values for a particular value of the key for tables a and b are buffered in the memory in the reducers. Then for each row retrieved from c, the join is computed with the buffered rows.</p>
105
106<p>For the query:</p>
107
108<source><![CDATA[SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)]]></source>
109
110<p>    * there are two map/reduce jobs involved in computing the join. The first of these joins a with b and buffers the values of a while streaming the values of b in the reducers. The second of one of these jobs buffers the results of the first join while streaming the values of c through the reducers. </p>
111
112</section>
113
114<section name="Streamtable hint" href="stream_table_hint">
115
116<p>In every map/reduce stage of the join, the table to be streamed can be specified via a hint:</p>
117
118<source><![CDATA[SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val 
119FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)]]></source>
120
121<p>All the three tables are joined in a single map/reduce job and the values for a particular value of the key for tables b and c are buffered in the memory in the reducers. Then for each row retrieved from a, the join is computed with the buffered rows.
122</p>
123
124</section>
125
126<section name="Outer Joins" href="outer_joins">
127
128<p>LEFT, RIGHT, and FULL OUTER joins exist in order to provide more control over ON clauses for which there is no match. For example:</p>
129
130<source><![CDATA[SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
131]]></source>
132
133<p>The above query will return a row for every row in a. This output row will be a.val,b.val when there is a b.key that equals a.key, and the output row will be a.val,NULL when there is no corresponding b.key. Rows from b which have no corresponding a.key will be dropped. The syntax "FROM a LEFT OUTER JOIN b" must be written on one line in order to understand how it works--a is to the LEFT of b in this query, and so all rows from a are kept; a RIGHT OUTER JOIN will keep all rows from b, and a FULL OUTER JOIN will keep all rows from a and all rows from b. OUTER JOIN semantics should conform to standard SQL specs.
134</p>
135
136<p>Joins occur BEFORE WHERE CLAUSES. So, if you want to restrict the OUTPUT of a join, a requirement should be in the WHERE clause, otherwise it should be in the JOIN clause. A big point of confusion for this issue is partitioned tables</p>
137
138<source><![CDATA[SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
139  WHERE a.ds='2009-07-07' AND b.ds='2009-07-07']]></source>
140
141<p>will join a on b, producing a list of a.val and b.val. The WHERE clause, however, can also reference other columns of a and b that are in the output of the join, and then filter them out. However, whenever a row from the JOIN has found a key for a and no key for b, all of the columns of b will be NULL, including the ds column. This is to say, you will filter out all rows of join output for which there was no valid b.key, and thus you have outsmarted your LEFT OUTER requirement. In other words, the LEFT OUTER part of the join is irrelevant if you reference any column of b in the WHERE clause. Instead, when OUTER JOINing, use this syntax:</p>
142
143<source><![CDATA[SELECT a.val, b.val FROM a LEFT OUTER JOIN b
144  ON (a.key=b.key AND b.ds='2009-07-07' AND a.ds='2009-07-07')]]></source>
145
146<p>Joins are NOT commutative! Joins are left-associative regardless of whether they are LEFT or RIGHT joins. </p>
147
148<source><![CDATA[SELECT a.val1, a.val2, b.val, c.val
149FROM a
150JOIN b ON (a.key = b.key)
151LEFT OUTER JOIN c ON (a.key = c.key)]]></source>
152
153<p>The above query first joins a on b, throwing away everything in a or b that does not have a corresponding key in the other table. The reduced table is then joined on c. This provides unintuitive results if there is a key that exists in both a and c, but not b: The whole row (including a.val1, a.val2, and a.key) is dropped in the "a JOIN b" step, so when the result of that is joined with c, any row with a c.key that had a corresponding a.key or b.key (but not both) will show up as NULL, NULL, NULL, c.val.</p>
154</section>
155
156<section name="Left Semi Join" href="left_semi_join">
157
158<p>LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in an efficient way. Since Hive currently does not support IN/EXISTS subqueries, you can rewrite your queries using LEFT SEMI JOIN. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.</p>
159
160<p>This type of query</p>
161<source><![CDATA[SELECT a.key, a.value
162FROM a 
163WHERE a.key in 
164(SELECT b.key
165FROM B);]]></source>
166
167<p>Can be written as:</p>
168
169<source><![CDATA[SELECT a.key, a.val
170FROM a LEFT SEMI JOIN b on (a.key = b.key)]]></source>
171
172</section>
173
174<section name="Map Side Join" href="map_side_join">
175
176<p>If all but one of the tables being joined are small, the join can be performed as a map only job. The query
177does not need a reducer. For every mapper a,b is read completely. A restriction is that a <b>FULL/RIGHT OUTER JOIN b</b> cannot be performed. </p>
178
179<source><![CDATA[SELECT /*+ MAPJOIN(b) */ a.key, a.value
180FROM a join b on a.key = b.key]]></source>
181
182</section>
183
184<section name="Bucketed Map Join" href="bucket_map_join">
185
186<p>If the tables being joined are bucketized, and the buckets are a multiple of each other, the buckets can be joined with each other. If table A has 8 buckets are table B has 4 buckets, the following join:</p>
187
188<source><![CDATA[SELECT /*+ MAPJOIN(b) */ a.key, a.value
189FROM a join b on a.key = b.key]]></source>
190
191<p>can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter </p>
192
193<i>set hive.optimize.bucketmapjoin = true</i>
194
195<p>If the tables being joined are sorted and bucketized, and the number of buckets are same, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets</p>
196
197<source><![CDATA[ SELECT /*+ MAPJOIN(b) */ a.key, a.value
198FROM A a join B b on a.key = b.key]]></source>
199
200<p>can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:</p>
201
202<source><![CDATA[set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
203set hive.optimize.bucketmapjoin = true;
204set hive.optimize.bucketmapjoin.sortedmerge = true;]]></source>
205
206</section>
207
208</body>
209
210
211
212</document>