MSQL+: A Plugin Toolkit for Similarity Search in Distributed Relational Database Systems (VLDB 2018 Demo)
MSQL+ is a plugin toolkit that embeds similarity‑search capabilities directly into distributed relational databases such as Tencent’s TDSQL, using B‑tree indexes on generated signatures, customizable DIST functions, and various pivot‑selection strategies to enable scalable, SQL‑standard, approximate queries across sharded data.
VLDB 2018 Demo Paper
The VLDB 2018 conference (August 27‑31, Rio de Janeiro, Brazil) featured a demonstration paper jointly authored by Tencent’s TDSQL team, Renmin University, and Wuhan University. The paper, titled MSQL+: a Plugin Toolkit for Similarity Search under Metric Spaces in Distributed Relational Database Systems , was accepted as a demo.
Motivation and Background
Similarity search is a fundamental operation in many database applications (text retrieval, spell checking, fingerprint authentication, face recognition, etc.). Existing similarity‑search methods suffer from three major drawbacks:
Separation from existing DBMS: Many approaches require building new systems or indexes (e.g., M‑Tree, D‑Index, kd‑tree) that are hard to integrate with current RDBMS.
Limited data‑space coverage: Different applications define similarity differently, making it difficult to build a universal model.
Unsuitable for big‑data scenarios: Most methods are designed for centralized systems and do not scale to distributed environments.
MSQL+ is designed to overcome these issues by providing a plugin that works directly inside an RDBMS, follows the SQL standard, and leverages the distributed database TDSQL.
Key Features of MSQL+
MSQL+ consists of two main modules:
Index Building: For each data object a comparable signature is generated and a B+-tree index is built on these signatures. Objects whose signatures fall within a similarity range become candidates for similarity queries.
Query Processing: Users issue a SELECT‑FROM‑WHERE statement that includes two constraints: a user‑defined similarity function DIST(r[A], q[A], θ) and a similarity‑range filter. The range filter quickly narrows candidates; the similarity function refines the result set.
Compared with traditional methods, MSQL+ offers:
Implementation using existing RDBMS functionality and B+-tree indexes.
Support for arbitrary data spaces via a customizable DIST function.
Ability to run on both single‑node and distributed RDBMS (TDSQL), enabling parallel processing and higher throughput.
Pivot Selection Strategies
MSQL+ partitions the dataset into shards using pivots. Four pivot‑selection strategies are proposed:
Random : Randomly pick objects as pivots.
MaxVariance : Choose a set of objects with maximal variance.
MaxProb : Select pivots that minimize the expected number of candidates.
Heuristic : A k‑means‑like heuristic that keeps elements in each partition close to its pivot.
Distributed Architecture on TDSQL
MSQL+ is integrated into Tencent’s distributed database TDSQL, which provides strong consistency, high availability, horizontal scalability, and enhanced security. The architecture includes:
Routing Node: Load balancing.
ZooKeeper: Maintains metadata (including pivot information) and synchronizes it to all local executors.
Global Executor: Receives similarity‑search requests, dispatches them to local executors, aggregates results, and generates execution plans.
Local Executor: Executes the query on its assigned data shards, builds signatures, constructs B+-tree indexes, and performs the final similarity filtering.
This design enables parallel similarity search across multiple nodes, dramatically improving query efficiency for large‑scale data.
Conclusion
MSQL+ is a plugin‑style approximate query tool built on top of an RDBMS (TDSQL). It provides a generic, easy‑to‑use, and efficient similarity‑search capability that supports diverse data spaces, adheres to the SQL standard, and benefits from TDSQL’s distributed execution, load balancing, and strong consistency.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.