Optimization of Local Vector Retrieval: Filtering, Storage, and Sorting Strategies
This article presents a comprehensive study of local vector retrieval optimization, covering memory‑based filtering techniques, Redis‑backed vector storage designs, and various sorting algorithms—including radix and heap‑based approaches—to achieve lower latency and higher throughput for large‑scale ad recommendation systems.
1 Background
The previous article introduced accelerated vector similarity computation; a complete vector retrieval pipeline also requires result sorting and optional pre‑filtering of vectors in the database.
Local vector retrieval still has optimization potential in both filtering and sorting, which this article explores.
2 Filtering Optimization
Although vectors are loaded into memory and SIMD‑optimized for fast computation, further reduction of the candidate set is possible.
Example: a user vector originally compares against 1,000 vectors; attribute‑based filtering can halve this to 500, speeding up retrieval.
2.1 Vector Filtering
Ad vectors are associated with metadata (model version, creative type, platform, template, media). Filtering can be performed on these attributes before similarity calculation.
Memory Object: store ad info as objects with vector attribute; iterate and filter.
Memory Bitmap: generate a bitmap for each attribute value; combine bitmaps via set operations to obtain matching ads.
Memory Inverted Index: two‑level maps—first map indexes by combined attributes to ad ID lists; second map stores vectors by model version and ad ID.
Performance tests show QPS ranking: Inverted Index > Bitmap > Object; CPU usage: Bitmap > Object > Inverted Index; latency: Object > Bitmap > Inverted Index.
2.2 Vector Storage
Two‑stage process: offline loading into Redis and online loading from Redis to memory.
Offline stage offers two schemes:
Scheme 1: single Redis hash where each field stores a compressed vector list (zip + base64); leads to vector redundancy.
Scheme 2: one hash for attribute‑to‑ad‑ID mapping and separate keys for each ad’s vector; avoids large keys and balances load.
Benchmark with 100k ads using Scheme 2 consumes ~270 MiB for vectors and 3 MiB for the inverted index; with four online versions the total memory is about 1 GiB.
Overall comparison:
Scheme 1 – fewer Redis calls but complex parsing, higher memory, hard to incrementally update.
Scheme 2 – simpler parsing, lower memory, easy incremental updates, at the cost of more Redis calls.
3 Sorting Optimization
After filtering and similarity calculation, extracting the Top‑K results can be a bottleneck.
3.1 Full‑list Sorting
Golang’s default sort (quick‑sort + heap + insertion) is slower than SIMD similarity computation. Alternative algorithms evaluated:
Heap sort (common Top‑K method)
Floating‑point radix sort (non‑comparison)
Parallel floating‑point radix sort (divide‑and‑conquer)
Radix sort converts floats to binary, buckets by segments, and handles sign bits to place negatives correctly.
Theoretical complexity O(d·(n+2^(32/d))+n) guides segment count selection; experiments show fewer segments perform better on larger datasets, while 4‑segment radix sort excels below 50 k items.
Parallel radix sort outperforms the others, achieving up to 11× speed‑up over Golang’s default sort, though for very small data the overhead can make it slower.
3.2 Partial (Local) Sorting
When only the Top‑K is needed, three local‑sorting strategies were tested:
Bubble sort Top‑K – O(n·k), fast for very small n.
Quick sort‑based Top‑K – O(n·log n), but performed poorly in benchmarks.
Heap sort Top‑K – O(n·log k), stable and fastest for medium‑to‑large n.
Benchmarks (k = 1000) indicate bubble sort is best under 20 k items, while heap sort dominates for larger volumes, outperforming both radix‑based full sorts for up to 100 k items.
4 Methodology
The practical experience can be abstracted into reusable methods:
Local vector retrieval suitable for sub‑million‑scale datasets.
SIMD‑custom programming for other math‑heavy workloads.
Memory‑based inverted index and bitmap filtering applicable to generic data‑filtering scenarios.
Floating‑point radix sort and partial‑sort algorithms for accelerating diverse sorting tasks.
5 Conclusion
After applying the described filtering, storage, and sorting optimizations, both recall and coarse‑ranking service latencies drop significantly, allowing the system to handle higher QPS and larger ad inventories while maintaining stable performance.
IEG Growth Platform Technology Team
Official account of Tencent IEG Growth Platform Technology Team, showcasing cutting‑edge achievements across front‑end, back‑end, client, algorithm, testing and other domains.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.