Big Data 4 min read

Understanding Bucket Sampling Queries in Hive

This article explains Hive's bucket sampling syntax, demonstrates how to use the TABLESAMPLE clause with various bucket parameters, provides concrete SQL examples, and clarifies the underlying hash‑based mechanism that determines which rows are returned.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Bucket Sampling Queries in Hive

The article introduces the syntax and usage of bucket sampling queries in Hive, which are written as

select * from bucket_table tablesample(bucket x out of y on bucket_column);

. It explains the meaning of the parameters: x is the starting bucket index, y is the total number of buckets to sample (must be a divisor or multiple of the total bucket count z), and the sampling selects every y ‑th bucket starting from x.

Three concrete examples illustrate how the clause works:

Example 1 selects from bucket 1 out of 2 on the id column:

select * from stu_buck2 tablesample(bucket 1 out of 2 on id);

, which extracts buckets 0 and 2.

Example 2 selects from bucket 1 out of 1 on id:

select * from stu_buck2 tablesample(bucket 1 out of 1 on id);

, returning all four buckets (0, 1, 2, 3).

Example 3 selects from bucket 2 out of 8 on id:

select * from stu_buck2 tablesample(bucket 2 out of 8 on id);

, which attempts to retrieve half of bucket 2, i.e., half of the rows in that bucket.

The article also provides the DDL used for testing:

--创建分桶表
create table people (id int, name string)
clustered by (id)
sorted by (name desc) into 4 buckets
row format delimited fields terminated by '\t';

--创建临时表
create table tmp (id int, name string)
row format delimited fields terminated by '\t';

--加载数据
load data local inpath '/home/guigu/data.txt' into table tmp;

--加载数据到分桶表
insert overwrite table people
select * from tmp;

When the author ran queries against the bucketed table, the results differed from expectations because the test data set was too small. The underlying mechanism hashes the bucket column, computes hash % y to assign rows to y partitions, and then returns rows from the specified bucket x. With insufficient data, the distribution may not reflect the theoretical percentages.

The article concludes by encouraging readers to like, bookmark, and share the post.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataSQLHiveBucket SamplingTablesample
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.