Databases 12 min read

MSQL+: A Plugin Toolkit for Similarity Search in Distributed Relational Database Systems (VLDB 2018 Demo)

MSQL+ is a plugin toolkit that embeds similarity‑search capabilities directly into distributed relational databases such as Tencent’s TDSQL, using B‑tree indexes on generated signatures, customizable DIST functions, and various pivot‑selection strategies to enable scalable, SQL‑standard, approximate queries across sharded data.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
MSQL+: A Plugin Toolkit for Similarity Search in Distributed Relational Database Systems (VLDB 2018 Demo)

VLDB 2018 Demo Paper

The VLDB 2018 conference (August 27‑31, Rio de Janeiro, Brazil) featured a demonstration paper jointly authored by Tencent’s TDSQL team, Renmin University, and Wuhan University. The paper, titled MSQL+: a Plugin Toolkit for Similarity Search under Metric Spaces in Distributed Relational Database Systems , was accepted as a demo.

Motivation and Background

Similarity search is a fundamental operation in many database applications (text retrieval, spell checking, fingerprint authentication, face recognition, etc.). Existing similarity‑search methods suffer from three major drawbacks:

Separation from existing DBMS: Many approaches require building new systems or indexes (e.g., M‑Tree, D‑Index, kd‑tree) that are hard to integrate with current RDBMS.

Limited data‑space coverage: Different applications define similarity differently, making it difficult to build a universal model.

Unsuitable for big‑data scenarios: Most methods are designed for centralized systems and do not scale to distributed environments.

MSQL+ is designed to overcome these issues by providing a plugin that works directly inside an RDBMS, follows the SQL standard, and leverages the distributed database TDSQL.

Key Features of MSQL+

MSQL+ consists of two main modules:

Index Building: For each data object a comparable signature is generated and a B+-tree index is built on these signatures. Objects whose signatures fall within a similarity range become candidates for similarity queries.

Query Processing: Users issue a SELECT‑FROM‑WHERE statement that includes two constraints: a user‑defined similarity function DIST(r[A], q[A], θ) and a similarity‑range filter. The range filter quickly narrows candidates; the similarity function refines the result set.

Compared with traditional methods, MSQL+ offers:

Implementation using existing RDBMS functionality and B+-tree indexes.

Support for arbitrary data spaces via a customizable DIST function.

Ability to run on both single‑node and distributed RDBMS (TDSQL), enabling parallel processing and higher throughput.

Pivot Selection Strategies

MSQL+ partitions the dataset into shards using pivots. Four pivot‑selection strategies are proposed:

Random : Randomly pick objects as pivots.

MaxVariance : Choose a set of objects with maximal variance.

MaxProb : Select pivots that minimize the expected number of candidates.

Heuristic : A k‑means‑like heuristic that keeps elements in each partition close to its pivot.

Distributed Architecture on TDSQL

MSQL+ is integrated into Tencent’s distributed database TDSQL, which provides strong consistency, high availability, horizontal scalability, and enhanced security. The architecture includes:

Routing Node: Load balancing.

ZooKeeper: Maintains metadata (including pivot information) and synchronizes it to all local executors.

Global Executor: Receives similarity‑search requests, dispatches them to local executors, aggregates results, and generates execution plans.

Local Executor: Executes the query on its assigned data shards, builds signatures, constructs B+-tree indexes, and performs the final similarity filtering.

This design enables parallel similarity search across multiple nodes, dramatically improving query efficiency for large‑scale data.

Conclusion

MSQL+ is a plugin‑style approximate query tool built on top of an RDBMS (TDSQL). It provides a generic, easy‑to‑use, and efficient similarity‑search capability that supports diverse data spaces, adheres to the SQL standard, and benefits from TDSQL’s distributed execution, load balancing, and strong consistency.

Distributed DatabaseTDSQLVLDBMSQL+plugin toolkitRDBMSsimilarity search
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.