Big Data 9 min read

Boost Full‑Text Search with Search SQL: Tokenization, CONTAINS, NEAR & FUZZY

This article explains how Search SQL enables easy full‑text search on Transwarp Search by using standard SQL syntax, covering tokenization, analyzer configuration, CONTAINS queries, and advanced NEAR and FUZZY operators to improve performance and query semantics.

StarRing Big Data Open Lab

Sep 15, 2017

Boost Full‑Text Search with Search SQL: Tokenization, CONTAINS, NEAR & FUZZY

Background

Some full‑text search products such as Elasticsearch and older versions of Transwarp Search only expose REST APIs, requiring users to master Lucene query syntax, which raises development and operational costs.

Search SQL Solution

Search SQL allows developers to perform full‑text search on Transwarp Search using familiar SQL, greatly reducing application development effort.

Implementation Steps

Text Tokenization – Specify an analyzer for the text column when creating a table; the analyzer tokenizes the text, builds an inverted index, and normalizes each word for searching.

Query Condition Tokenization – The query string is tokenized with the same analyzer, then matched against the inverted index using the CONTAINS syntax.

Specifying Analyzer

When creating a table, you can assign an analyzer to a column using WITH (search‑only) or APPEND (search plus exact match). Example:

CREATE TABLE <tableName> (
  <id> STRING,
  <column> STRING <WITH|APPEND> ANALYZER 'ZH'|'EN' <ANALYZER_NAME>
) STORED AS ES
[WITH SHARD NUMBER <m>]
[REPLICATION <n>];

Key parts: WITH / APPEND keyword, language code 'ZH' (Chinese) or 'EN' (English), and analyzer name (e.g., ik, mmseg, standard, english).

CONTAINS Syntax

After tokenization, use CONTAINS to search: CONTAINS([schema.]column, '<text_query>') column : the tokenized column to query.

text_query : the search expression, enclosed in single quotes.

Example:

select * from news_analyze_zh where contains(content, '大数据');

NEAR Operator

The NEAR operator limits the distance between tokens, improving relevance.

CONTAINS(<column>, 'NEAR((token1, token2[,token3,...]), slop[, in_order])');

Parameters:

token : words that must exist in the inverted index; they are not further tokenized.

slop : maximum number of intervening tokens allowed between tokens.

in_order (optional): boolean indicating whether tokens must appear in the given order (default false).

Example:

select * from news_analyze_zh where contains(content, 'near((大数据,应用),1,false)');

FUZZY Operator

The FUZZY operator enables fuzzy phrase search.

CONTAINS(<column>, 'fuzzy(phrase, fuzziness)')

Parameters:

phrase : tokenized phrase; all tokens must appear in the result.

fuzziness : maximum edit distance (Levenshtein distance) allowed.

Example:

select * from news_analyze_zh where contains(content, 'fuzzy(大数据应用,5)');

Advantages

Better query performance : LIKE %word% has O(n) complexity, while tokenized search operates in O(log n), offering faster execution.

Richer retrieval semantics : NEAR and FUZZY provide expressive query capabilities beyond the single‑semantic LIKE operator.

Future articles will demonstrate concrete results of these syntaxes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Tokenization Full-Text Search FUZZY NEAR Search SQL

Written by

StarRing Big Data Open Lab

Focused on big data technology research, exploring the Big Data era | [email protected]

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.