Boost Full‑Text Search with Search SQL: Tokenization, CONTAINS, NEAR & FUZZY
This article explains how Search SQL enables easy full‑text search on Transwarp Search by using standard SQL syntax, covering tokenization, analyzer configuration, CONTAINS queries, and advanced NEAR and FUZZY operators to improve performance and query semantics.
Background
Some full‑text search products such as Elasticsearch and older versions of Transwarp Search only expose REST APIs, requiring users to master Lucene query syntax, which raises development and operational costs.
Search SQL Solution
Search SQL allows developers to perform full‑text search on Transwarp Search using familiar SQL, greatly reducing application development effort.
Implementation Steps
Text Tokenization – Specify an analyzer for the text column when creating a table; the analyzer tokenizes the text, builds an inverted index, and normalizes each word for searching.
Query Condition Tokenization – The query string is tokenized with the same analyzer, then matched against the inverted index using the CONTAINS syntax.
Specifying Analyzer
When creating a table, you can assign an analyzer to a column using WITH (search‑only) or APPEND (search plus exact match). Example:
CREATE TABLE <tableName> (
<id> STRING,
<column> STRING <WITH|APPEND> ANALYZER 'ZH'|'EN' <ANALYZER_NAME>
) STORED AS ES
[WITH SHARD NUMBER <m>]
[REPLICATION <n>];Key parts: WITH / APPEND keyword, language code 'ZH' (Chinese) or 'EN' (English), and analyzer name (e.g., ik, mmseg, standard, english).
CONTAINS Syntax
After tokenization, use CONTAINS to search: CONTAINS([schema.]column, '<text_query>') column : the tokenized column to query.
text_query : the search expression, enclosed in single quotes.
Example:
select * from news_analyze_zh where contains(content, '大数据');NEAR Operator
The NEAR operator limits the distance between tokens, improving relevance.
CONTAINS(<column>, 'NEAR((token1, token2[,token3,...]), slop[, in_order])');Parameters:
token : words that must exist in the inverted index; they are not further tokenized.
slop : maximum number of intervening tokens allowed between tokens.
in_order (optional): boolean indicating whether tokens must appear in the given order (default false).
Example:
select * from news_analyze_zh where contains(content, 'near((大数据,应用),1,false)');FUZZY Operator
The FUZZY operator enables fuzzy phrase search.
CONTAINS(<column>, 'fuzzy(phrase, fuzziness)')Parameters:
phrase : tokenized phrase; all tokens must appear in the result.
fuzziness : maximum edit distance (Levenshtein distance) allowed.
Example:
select * from news_analyze_zh where contains(content, 'fuzzy(大数据应用,5)');Advantages
Better query performance : LIKE %word% has O(n) complexity, while tokenized search operates in O(log n), offering faster execution.
Richer retrieval semantics : NEAR and FUZZY provide expressive query capabilities beyond the single‑semantic LIKE operator.
Future articles will demonstrate concrete results of these syntaxes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
StarRing Big Data Open Lab
Focused on big data technology research, exploring the Big Data era | [email protected]
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
