Querying Apache Hudi Tables on Amazon S3 Using Redshift Spectrum
This article explains how to use Amazon Redshift Spectrum to directly query Apache Hudi (and Delta Lake) tables stored in Amazon S3, covering supported formats, required DDL statements, partition handling, and common troubleshooting tips.
Previously the Apache Hudi community received many requests for Amazon Redshift support; now Redshift Spectrum can query Apache Hudi and Delta Lake tables stored in an Amazon S3 data lake.
Redshift Spectrum enables direct querying of S3 data without loading it, supporting lake‑house architectures, open formats such as Parquet, ORC, JSON, CSV, and complex nested types like struct, array, or map.
The feature reads the latest snapshot of Hudi 0.5.2 Copy‑On‑Write (CoW) tables and can also read Delta Lake 0.5.0 tables via manifest files.
To query Hudi CoW data, create an external table in Redshift Spectrum. Hudi CoW tables are stored as Apache Parquet files in S3; column mapping is performed column‑by‑column.
Hudi DDL statements are similar to other Parquet tables. For example, the INPUTFORMAT should be set to org.apache.hudi.hadoop.HoodieParquetInputFormat and the LOCATION must point to the base folder containing the .hoodie directory. If a SELECT fails with “No valid Hudi commit timeline found”, verify that the .hoodie folder exists and contains a valid commit timeline.
Note: Apache Hudi format is only supported when using the AWS Glue Data Catalog; it does not work with the Apache Hive metastore as an external catalog.
Example DDL for a non‑partitioned table:
CREATE EXTERNAL TABLE tbl_name (columns)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://s3-bucket/prefix'Example DDL for a partitioned table:
CREATE EXTERNAL TABLE tbl_name (columns)
PARTITIONED BY (pcolumn1 type, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://s3-bucket/prefix'To add a partition to a Hudi table, use the ALTER TABLE ADD PARTITION command, where the LOCATION points to the S3 sub‑folder for that partition:
ALTER TABLE tbl_name
ADD IF NOT EXISTS PARTITION (pcolumn1=pvalue1, ...)
LOCATION 's3://s3-bucket/prefix/partition-path'Apache Hudi was first integrated into AWS EMR and later into services such as Athena and Redshift, bridging cloud‑native data lakes and data warehouses; developers are encouraged to contribute to the Apache Hudi project on GitHub.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
