Big Data 6 min read

Choosing the Right SQL Engine for Big Data: A Practical Guide

This article explores various SQL engines and storage options for big‑data workloads, compares their performance and capabilities, shows practical code examples, and offers guidance on writing efficient SQL in complex data environments.

Programmer DD

Nov 7, 2018

Choosing the Right SQL Engine for Big Data: A Practical Guide

A junior engineer was scolded by product for taking too long to write SQL, prompting the author to publish a detailed wiki post.

Where to Write SQL?

The real question is which SQL engine to use.

Common engines include SparkSQL, Hive, Phoenix, Drill, Impala, Presto, Druid, and Kylin.

Hive: parses SQL and runs it with MapReduce.

SparkSQL: parses SQL and runs it on Spark, faster than Hive.

Phoenix: an SQL layer on HBase that bypasses MapReduce.

Drill/Impala/Presto: interactive OLAP engines similar to Google Dremel.

Druid/Kylin: OLAP engines emphasizing pre‑computation.

Choosing an engine often requires a month of research, depending on whether the requirement is real‑time or batch, incremental or static data, data volume, and acceptable response time, as well as considerations of functionality, performance, stability, operations, and development difficulty.

Which Data Storage to Execute SQL?

Most of the tools mentioned are query engines; you also need to consider storage.

Relational databases like MySQL tightly couple query engine and storage for performance optimization, while big‑data SQL engines are usually decoupled from storage to gain flexibility for large‑scale data.

Understanding which storage systems each engine supports is essential for efficient querying.

Which Syntax to Write SQL?

Not all engines support the same features: some lack JOIN, DISTINCT may not be precise, and LIMIT pagination is not universal. Complex scenarios often require custom SQL methods implemented in code.

Examples:

select `user`["user_id"] from tbl_test ;

insert overwrite table tbl_test select * from tbl_test where id > 0 ;

from tbl_test insert overwrite table tbl_test select * where id > 0 ;

How to Write SQL More Efficiently?

Understanding the execution engine (MapReduce, Spark, or other APIs), data flow, potential data skew, and resource usage (CPU, memory, I/O) is crucial for optimization.

Conclusion

When new product demands arise that the current system cannot meet, the four‑step process repeats, and the cycle of writing SQL continues.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Big Data SQL Hive SQL Engines

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.