Databases 11 min read

Mastering Distributed Join Queries: MySQL Sharding-JDBC and Elasticsearch Strategies

This article explores the challenges of distributed join queries, detailing MySQL sharding‑jdbc join implementation, routing strategies, and code examples, then examines Elasticsearch‑SQL join capabilities, various join algorithms, and practical considerations for using nested types, offering insights for optimizing performance in distributed data environments.

JD Cloud Developers

Jun 15, 2023

Mastering Distributed Join Queries: MySQL Sharding-JDBC and Elasticsearch Strategies

Compared with singleton data library queries, distributed data queries pose many technical challenges. This article records MySQL sharding‑jdbc join queries and Elasticsearch join query implementation ideas, understanding design solutions for distributed scenario data processing.

1. MySQL Sharding‑JDBC Join Query Scenario

In sharding scenarios, how query statements are dispatched and how data is organized. Compared with NoSQL databases, MySQL adapts more easily to distributed scenarios within the SQL standard.

Based on the sharding‑jdbc middleware solution, the overall design idea is understood.

sharding‑jdbc

sharding‑jdbc proxies the original datasource, implements the JDBC specification to complete sharding‑jdbc distribution and assembly, and is transparent to the application layer.

Execution flow: SQL parsing → executor optimization → SQL routing → SQL rewriting → SQL execution → result merging (io.shardingsphere.core.executor.ExecutorEngine#execute).

The parsing of Join statements determines which instance nodes the SQL should be dispatched to, corresponding to SQL routing.

SQL rewriting replaces the original (logical) table name with the actual sharded table name.

In complex cases, the maximum number of Join query dispatches = number of database instances × number of shards of table A × number of shards of table B.

Code Insight

Demo code repository: [email protected]:cluoHeadon/sharding-jdbc-demo.git

/**
 * Execute query SQL entry point, from here you can fully debug the execution process
 * @see ShardingPreparedStatement#execute()
 * @see ParsingSQLRouter#route(String, List, SQLStatement) Join query actual involved tables are matched in routing rules.
 */
public boolean execute() throws SQLException {
    try {
        // Determine shards based on parameters and specific SQL, match related actual tables.
        Collection<PreparedStatementUnit> preparedStatementUnits = route();
        // Use thread pool to dispatch execution and merge results.
        return new PreparedStatementExecutor(getConnection().getShardingContext().getExecutorEngine(), routeResult.getSqlStatement().getType(), preparedStatementUnits).execute();
    } finally {
        JDBCShardingRefreshHandler.build(routeResult, connection).execute();
        clearBatch();
    }
}

SQL Routing Strategy

Enable SQL printing to directly see the actual dispatched SQL.

sharding.jdbc.config.sharding.props.sql.show=true

sharding‑jdbc applies different routing strategies based on the SQL statement. For Join queries, two main strategies are relevant:

StandardRoutingEngine – binding‑tables mode

ComplexRoutingEngine – the most complex case, Cartesian product relationships

# Example SQL without clear sharding parameters
select * from order o inner join order_item oi on o.order_id = oi.order_id

-- Routing results (actual SQL after routing)
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_0 oi on o.order_id = oi.order_id
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_1 oi on o.order_id = oi.order_id
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_0 oi on o.order_id = oi.order_id
-- Actual SQL: db0 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id
-- ... (other routing results omitted for brevity)

2. Elasticsearch Join Query Scenario

For NoSQL databases, requiring Join queries may indicate a misuse of the technology, but some scenarios inevitably need this functionality. The Join implementation in Elasticsearch is closer to a SQL engine.

The solution is based on the elasticsearch‑sql component, which provides an approximate SQL query capability via an HTTP service; newer Elasticsearch versions already include this feature.

elasticsearch‑sql

This is an Elasticsearch plugin that offers SQL‑like query functionality through an HTTP service; high‑version Elasticsearch already has this capability.

Because Elasticsearch lacks native Join support, implementing SQL Join requires lower‑level functionality involving Join algorithms.

Code Insight

Source code repository: [email protected]:NLPchina/elasticsearch-sql.git

/**
 * Execute the ActionRequest and returns the REST response using the channel.
 * @see ElasticDefaultRestExecutor#execute
 * @see ESJoinQueryActionFactory#createJoinAction Join algorithm selection
 */
@Override
public void execute(Client client, Map<String, String> params, QueryAction queryAction, RestChannel channel) throws Exception {
    // sql parse
    SqlElasticRequestBuilder requestBuilder = queryAction.explain();

    // join query
    if (requestBuilder instanceof JoinRequestBuilder) {
        // Join algorithm selection, includes: HashJoinElasticExecutor, NestedLoopsElasticExecutor
        // If the join condition is equality (Condition.OPER.EQ), use HashJoinElasticExecutor
        ElasticJoinExecutor executor = ElasticJoinExecutor.createJoinExecutor(client, requestBuilder);
        executor.run();
        executor.sendResponse(channel);
    }
    // other query types ...
}

3. More Than Join

Join Algorithms

Common three Join algorithms: Nested Loop Join, Hash Join, Merge Join

MySQL only supports NLJ or its variants; since version 8.0.18 it also supports Hash Join

NLJ works as two nested loops: the outer loop iterates rows of the first table, the inner loop iterates rows of the second table, comparing each pair and outputting matching rows.

Hash Join consists of two phases: build phase and probe phase.

You can use EXPLAIN to see which Join algorithm MySQL uses. Relevant syntax: FORMAT=JSON or FORMAT=Tree.

EXPLAIN FORMAT=JSON
SELECT * FROM sale_line_info u
JOIN sale_line_manager o ON u.sale_line_code = o.sale_line_code;

{
    "query_block": {
        "select_id": 1,
        // Used join algorithm: nested_loop
        "nested_loop": [
            // Involved tables and keys, other info similar to normal EXPLAIN
            {"table": {"table_name": "o", "access_type": "ALL"}},
            {"table": {"table_name": "u", "access_type": "ref"}}
        ]
    }
}

Elasticsearch Nested Type

Analyzing Elasticsearch business data and use cases, another option is to store related information directly within a document. Elasticsearch provides complete documents for query and retrieval, completely avoiding Join‑related techniques.

This raises considerations about whether the related data is of a belonging type or a shared type, the size of the related data, and the update frequency—factors that influence the decision to use the nested type.

Further usage methods can be found online and in official documentation; we omit detailed discussion here. Our current business feature uses the nested type, solving a major difficulty during query and optimization.

Summary

Through analysis of the execution principles, we gain a clear and deep understanding of the processing flow.

This knowledge makes middleware optimization and technology selection more purposeful, leading to more cautious and careful usage.

Clear filtering criteria, smaller filter ranges, and appropriate LIMIT values can all reduce computational cost and improve performance.

References

[1] How to implement Hash Join in distributed databases: https://zhuanlan.zhihu.com/p/35040231

[2] A detailed explanation of MySQL Join optimization: https://juejin.cn/post/7224046762200154172

-end-

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch sharding mysql Distributed

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.