Scalar‑Vector Hybrid Search in a Data Lake with One SQL on EMR Serverless Spark
EMR Serverless Spark now supports scalar‑vector hybrid search via DLF Global Index, allowing a single Spark SQL statement to perform vector similarity and scalar filtering together, eliminating data movement, reducing latency, and boosting performance for scenarios such as autonomous driving, e‑commerce, and knowledge‑base retrieval.
Background
Traditional large‑scale retrieval often requires two separate systems: a vector database for semantic similarity and a relational database for attribute filtering. This leads to data duplication, high latency, and consistency challenges.
Problem Statement
For example, autonomous‑driving engineers need to find historical frames that match both weather conditions (e.g., heavy rain) and road type (e.g., urban) while also being semantically similar to a target scene. The conventional workflow involves a two‑step process—first a top‑K vector query, then a relational filter—resulting in multiple data transfers and uncertain result counts.
Solution Architecture
EMR Serverless Spark integrates a scalar‑vector hybrid search capability using the DLF Global Index . A vector_search UDF is combined with standard WHERE clauses, enabling Spark to push down both vector and B‑tree index operations and execute them jointly.
The core components are:
Vector index built on a Paimon table column (e.g., embedding)
B‑tree scalar index on columns such as weather, road_type, etc.
AI Function ( ai_embedding_multimodal, ai_query) that generates embeddings and extracts scalar tags directly in SQL.
Implementation Steps
1. Create a Paimon table with both indexes
CREATE TABLE ai_dataset.scene_vectors (
id BIGINT,
path STRING,
weather STRING,
road_type STRING,
speed_range STRING,
embedding ARRAY<FLOAT>
) USING paimon
TBLPROPERTIES (
'row-tracking.enabled'='true',
'data-evolution.enabled'='true',
'morax.lumina-index.enabled'='true',
'global-index.lumina.index-column'='embedding',
'lumina.index.dimension'='1152'
);The table properties trigger automatic index construction; the index is stored in OSS and managed by DLF.
2. Ingest data and generate embeddings
CREATE TABLE IF NOT EXISTS ad_dataset.driving_scenes (
id BIGINT,
path STRING,
weather STRING,
lighting STRING,
road_type STRING,
objects ARRAY<STRING>,
risks ARRAY<STRING>,
scene_tag STRING,
sensor_type STRING,
embedding ARRAY<FLOAT>
) USING paimon
TBLPROPERTIES (
'row-tracking.enabled'='true',
'data-evolution.enabled'='true',
'morax.lumina-index.enabled'='true',
'global-index.lumina.index-column'='embedding',
'lumina.index.dimension'='1152',
'global-index.btree.index-columns'='weather,road_type,lighting,objects,risks,scene_tag'
);
WITH raw AS (
SELECT
monotonically_increasing_id() AS id,
path,
ai_query('...prompt...', content) AS scene_json,
ai_embedding_multimodal(content) AS embedding
FROM read_files('oss://ad-team-raw/camera_front/2025-*/', suffix=>'jpg,png')
)
INSERT INTO ad_dataset.driving_scenes
SELECT
id,
path,
get_json_object(scene_json, '$.weather') AS weather,
get_json_object(scene_json, '$.lighting') AS lighting,
get_json_object(scene_json, '$.road_type') AS road_type,
from_json(get_json_object(scene_json, '$.objects'), 'ARRAY<STRING>') AS objects,
from_json(get_json_object(scene_json, '$.risks'), 'ARRAY<STRING>') AS risks,
'normal' AS scene_tag,
'camera_front' AS sensor_type,
embedding
FROM raw;This single SQL pipeline reads OSS images, calls AI functions to produce tags and embeddings, and writes the enriched rows into the Paimon table, where both indexes are automatically maintained.
3. Perform hybrid retrieval
SELECT id, path, weather, road_type, lighting, objects, risks
FROM vector_search(
'ad_dataset.driving_scenes',
'embedding',
array(0.12F, 0.34F, ...),
10
)
WHERE weather = 'heavy_rain' AND road_type = 'urban';The vector_search function returns the top‑K nearest vectors; the WHERE clause applies scalar filters using the B‑tree index. Spark executes both paths in one pass, avoiding cross‑system data movement.
Use Cases
Autonomous driving : Retrieve corner‑case scenes (e.g., heavy rain + urban road) for model retraining.
Embodied AI : Find robot skill demonstrations that match a task description while satisfying hardware constraints.
E‑commerce : Search visually similar products while enforcing price, brand, and stock filters.
Content safety : Locate semantically similar prohibited content with additional metadata constraints.
Medical imaging : Retrieve similar scans that also match patient age, body part, and diagnosis criteria.
Advantages
Zero data movement : Vector and scalar data reside in the same Paimon table; queries run entirely inside the data lake.
SQL‑native expressiveness : One statement combines vector similarity ( vector_search) and traditional filters, enabling downstream joins, window functions, and aggregations without leaving Spark.
Batch‑processing friendly : Spark’s native scalability handles billions of rows for embedding generation, index building, and large‑scale hybrid retrieval.
Serverless compute + managed index : Compute resources scale on demand; DLF automatically builds and updates the Global Index, eliminating operational overhead.
Performance FAQ
Q1: How fast is hybrid search? With DLF’s storage‑compute separation, the index is loaded on‑demand from OSS. For billion‑row tables, queries typically finish in seconds, offering orders‑of‑magnitude speedup over full‑table scans.
Q2: Which distance metrics are supported? Cosine similarity and Euclidean distance can be selected via the lumina.index.metric-type table property.
Q3: Does index building affect write throughput? No. Index construction runs asynchronously in the DLF background, decoupled from the Spark write job.
Conclusion
EMR Serverless Spark’s scalar‑vector hybrid search brings three key benefits to data lakes: simplified architecture (no external vector store), full SQL expressiveness for combined semantic and attribute queries, and native batch processing for massive datasets. By unifying vector similarity and scalar filtering in a single SQL statement, data engineers can build intelligent retrieval pipelines with minimal operational complexity.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
