/tags/release-0.1-rc2/hive/external/docs/xdocs/language_manual/working_with_bucketed_tables.xml
# · XML · 87 lines · 55 code · 14 blank · 18 comment · 0 complexity · c041f2f4b976766762e83affe3c81007 MD5 · raw file
- <?xml version="1.0" encoding="UTF-8"?>
- <!--
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
- -->
- <document>
- <properties>
- <title>Hadoop Hive- Working with Bucketed Tables</title>
- <author email="hive-user@hadoop.apache.org">Hadoop Hive Documentation Team</author>
- </properties>
- <body>
- <section name="Defining Bucketed Tables" href="defining_bucketed_tables?">
- <p>
- This is a brief example on creating a populating bucketed tables. Bucketed tables
- are fantastic in that they allow much more efficient sampling than do non-bucketed
- tables, and they may later allow for time saving operations such as mapside joins.
- However, the bucketing specified at table creation is not enforced when the table
- is written to, and so it is possible for the table's metadata to advertise
- properties which are not upheld by the table's actual layout. This should obviously
- be avoided. Here's how to do it right.
- </p>
- <p>First there’s table creation:</p>
- <source><![CDATA[CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
- COMMENT 'A bucketed copy of user_info'
- PARTITIONED BY(ds STRING)
- CLUSTERED BY(user_id) INTO 256 BUCKETS;
- ]]></source>
- <p>notice that we define user_id as the bucket column</p>
- </section>
- <section name="Populating Bucketed Tables" href="populating_bucketed_tables?">
- <source><![CDATA[set hive.enforce.bucketing = true;
- FROM user_id
- INSERT OVERWRITE TABLE user_info_bucketed
- PARTITION (ds='2009-02-25')
- SELECT userid, firstname, lastname WHERE ds='2009-02-25';
- ]]></source>
- <p>The command <strong>set hive.enforce.bucketing = true;</strong> allows the
- correct number of reducers and the cluster by column to be automatically selected
- based on the table. Otherwise, you would need to set the number of reducers to be
- the same as the number of buckets with
- <strong>set mapred.reduce.tasks = 256;</strong> and have a
- <strong>CLUSTER BY ...</strong> clause in the select.</p>
- </section>
- <section name="Bucketing Explained" href="bucketing_explained?">
- <p>
- How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.
- </p>
- </section>
- <section name="What can go wrong?" href="bucketing_gone_wrong?">
- <p>
- So, what can go wrong? As long as you
- <strong>set hive.enforce.bucketing = true</strong>, and use the syntax above,
- the tables should be populated properly. Things can go wrong if the bucketing
- column type is different during the insert and on read, or if you manually
- cluster by a value that's different from the table definition.
- </p>
- </section>
- </body>
- </document>