Choosing the Right SQL Engine for Big Data: A Practical Guide
This article explores various SQL engines and storage options for big‑data workloads, compares their performance and capabilities, shows practical code examples, and offers guidance on writing efficient SQL in complex data environments.
A junior engineer was scolded by product for taking too long to write SQL, prompting the author to publish a detailed wiki post.
Where to Write SQL?
The real question is which SQL engine to use.
Common engines include SparkSQL, Hive, Phoenix, Drill, Impala, Presto, Druid, and Kylin.
Hive: parses SQL and runs it with MapReduce.
SparkSQL: parses SQL and runs it on Spark, faster than Hive.
Phoenix: an SQL layer on HBase that bypasses MapReduce.
Drill/Impala/Presto: interactive OLAP engines similar to Google Dremel.
Druid/Kylin: OLAP engines emphasizing pre‑computation.
Choosing an engine often requires a month of research, depending on whether the requirement is real‑time or batch, incremental or static data, data volume, and acceptable response time, as well as considerations of functionality, performance, stability, operations, and development difficulty.
Which Data Storage to Execute SQL?
Most of the tools mentioned are query engines; you also need to consider storage.
Relational databases like MySQL tightly couple query engine and storage for performance optimization, while big‑data SQL engines are usually decoupled from storage to gain flexibility for large‑scale data.
Understanding which storage systems each engine supports is essential for efficient querying.
Which Syntax to Write SQL?
Not all engines support the same features: some lack JOIN, DISTINCT may not be precise, and LIMIT pagination is not universal. Complex scenarios often require custom SQL methods implemented in code.
Examples:
select `user`["user_id"] from tbl_test ; insert overwrite table tbl_test select * from tbl_test where id > 0 ; from tbl_test insert overwrite table tbl_test select * where id > 0 ;How to Write SQL More Efficiently?
Understanding the execution engine (MapReduce, Spark, or other APIs), data flow, potential data skew, and resource usage (CPU, memory, I/O) is crucial for optimization.
Conclusion
When new product demands arise that the current system cannot meet, the four‑step process repeats, and the cycle of writing SQL continues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
