Mastering Distributed Join Queries: MySQL Sharding-JDBC and Elasticsearch Strategies
This article explores the challenges of distributed join queries, detailing MySQL sharding‑jdbc join implementation, routing strategies, and code examples, then examines Elasticsearch‑SQL join capabilities, various join algorithms, and practical considerations for using nested types, offering insights for optimizing performance in distributed data environments.
Compared with singleton data library queries, distributed data queries pose many technical challenges. This article records MySQL sharding‑jdbc join queries and Elasticsearch join query implementation ideas, understanding design solutions for distributed scenario data processing.
1. MySQL Sharding‑JDBC Join Query Scenario
In sharding scenarios, how query statements are dispatched and how data is organized. Compared with NoSQL databases, MySQL adapts more easily to distributed scenarios within the SQL standard.
Based on the sharding‑jdbc middleware solution, the overall design idea is understood.
sharding‑jdbc
sharding‑jdbc proxies the original datasource, implements the JDBC specification to complete sharding‑jdbc distribution and assembly, and is transparent to the application layer.
Execution flow: SQL parsing → executor optimization → SQL routing → SQL rewriting → SQL execution → result merging (io.shardingsphere.core.executor.ExecutorEngine#execute).
The parsing of Join statements determines which instance nodes the SQL should be dispatched to, corresponding to SQL routing.
SQL rewriting replaces the original (logical) table name with the actual sharded table name.
In complex cases, the maximum number of Join query dispatches = number of database instances × number of shards of table A × number of shards of table B.
Code Insight
Demo code repository: [email protected]:cluoHeadon/sharding-jdbc-demo.git
/**
* Execute query SQL entry point, from here you can fully debug the execution process
* @see ShardingPreparedStatement#execute()
* @see ParsingSQLRouter#route(String, List, SQLStatement) Join query actual involved tables are matched in routing rules.
*/
public boolean execute() throws SQLException {
try {
// Determine shards based on parameters and specific SQL, match related actual tables.
Collection<PreparedStatementUnit> preparedStatementUnits = route();
// Use thread pool to dispatch execution and merge results.
return new PreparedStatementExecutor(getConnection().getShardingContext().getExecutorEngine(), routeResult.getSqlStatement().getType(), preparedStatementUnits).execute();
} finally {
JDBCShardingRefreshHandler.build(routeResult, connection).execute();
clearBatch();
}
}SQL Routing Strategy
Enable SQL printing to directly see the actual dispatched SQL.
sharding.jdbc.config.sharding.props.sql.show=truesharding‑jdbc applies different routing strategies based on the SQL statement. For Join queries, two main strategies are relevant:
StandardRoutingEngine – binding‑tables mode
ComplexRoutingEngine – the most complex case, Cartesian product relationships
# Example SQL without clear sharding parameters
select * from order o inner join order_item oi on o.order_id = oi.order_id
-- Routing results (actual SQL after routing)
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_0 oi on o.order_id = oi.order_id
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_1 oi on o.order_id = oi.order_id
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_0 oi on o.order_id = oi.order_id
-- Actual SQL: db0 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id
-- ... (other routing results omitted for brevity)2. Elasticsearch Join Query Scenario
For NoSQL databases, requiring Join queries may indicate a misuse of the technology, but some scenarios inevitably need this functionality. The Join implementation in Elasticsearch is closer to a SQL engine.
The solution is based on the elasticsearch‑sql component, which provides an approximate SQL query capability via an HTTP service; newer Elasticsearch versions already include this feature.
elasticsearch‑sql
This is an Elasticsearch plugin that offers SQL‑like query functionality through an HTTP service; high‑version Elasticsearch already has this capability.
Because Elasticsearch lacks native Join support, implementing SQL Join requires lower‑level functionality involving Join algorithms.
Code Insight
Source code repository: [email protected]:NLPchina/elasticsearch-sql.git
/**
* Execute the ActionRequest and returns the REST response using the channel.
* @see ElasticDefaultRestExecutor#execute
* @see ESJoinQueryActionFactory#createJoinAction Join algorithm selection
*/
@Override
public void execute(Client client, Map<String, String> params, QueryAction queryAction, RestChannel channel) throws Exception {
// sql parse
SqlElasticRequestBuilder requestBuilder = queryAction.explain();
// join query
if (requestBuilder instanceof JoinRequestBuilder) {
// Join algorithm selection, includes: HashJoinElasticExecutor, NestedLoopsElasticExecutor
// If the join condition is equality (Condition.OPER.EQ), use HashJoinElasticExecutor
ElasticJoinExecutor executor = ElasticJoinExecutor.createJoinExecutor(client, requestBuilder);
executor.run();
executor.sendResponse(channel);
}
// other query types ...
}3. More Than Join
Join Algorithms
Common three Join algorithms: Nested Loop Join, Hash Join, Merge Join
MySQL only supports NLJ or its variants; since version 8.0.18 it also supports Hash Join
NLJ works as two nested loops: the outer loop iterates rows of the first table, the inner loop iterates rows of the second table, comparing each pair and outputting matching rows.
Hash Join consists of two phases: build phase and probe phase.
You can use EXPLAIN to see which Join algorithm MySQL uses. Relevant syntax: FORMAT=JSON or FORMAT=Tree.
EXPLAIN FORMAT=JSON
SELECT * FROM sale_line_info u
JOIN sale_line_manager o ON u.sale_line_code = o.sale_line_code; {
"query_block": {
"select_id": 1,
// Used join algorithm: nested_loop
"nested_loop": [
// Involved tables and keys, other info similar to normal EXPLAIN
{"table": {"table_name": "o", "access_type": "ALL"}},
{"table": {"table_name": "u", "access_type": "ref"}}
]
}
}Elasticsearch Nested Type
Analyzing Elasticsearch business data and use cases, another option is to store related information directly within a document. Elasticsearch provides complete documents for query and retrieval, completely avoiding Join‑related techniques.
This raises considerations about whether the related data is of a belonging type or a shared type, the size of the related data, and the update frequency—factors that influence the decision to use the nested type.
Further usage methods can be found online and in official documentation; we omit detailed discussion here. Our current business feature uses the nested type, solving a major difficulty during query and optimization.
Summary
Through analysis of the execution principles, we gain a clear and deep understanding of the processing flow.
This knowledge makes middleware optimization and technology selection more purposeful, leading to more cautious and careful usage.
Clear filtering criteria, smaller filter ranges, and appropriate LIMIT values can all reduce computational cost and improve performance.
References
[1] How to implement Hash Join in distributed databases: https://zhuanlan.zhihu.com/p/35040231
[2] A detailed explanation of MySQL Join optimization: https://juejin.cn/post/7224046762200154172
-end-
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
