How Xianyu Scales Billion‑Item Geo Matching with Fast GeoHash Algorithms
This article explains how Xianyu uses GeoHash‑based spatial indexing, precise and approximate matching, and a rapid neighbor‑search algorithm to efficiently associate billions of GPS‑tagged items with tens of thousands of city‑level business districts, reducing computation from quadrillions to billions of operations.
Abstract
Xianyu divides cities into business districts (AOIs) based on traffic, mall, and residential distributions, then matches user‑posted GPS items to the district they fall into. With over 1 billion items and ~10 000 districts, a naïve point‑in‑polygon calculation would require astronomically many operations.
GeoHash Principle for Point Data
GeoHash encodes latitude and longitude into a string by repeatedly halving the latitude and longitude intervals and recording 0 for left/bottom and 1 for right/top. The longer the string, the finer the grid; each GeoHash cell represents a rectangular area covering all points sharing that prefix.
Polygon GeoHash Encoding
To encode a polygon, compute its minimum bounding rectangle (MBR), encode the southwest corner, then decode that cell to obtain the base GeoHash block. By iteratively moving east and north, adjacent blocks of the same size are generated until they completely cover the MBR. Blocks that do not intersect the polygon are discarded, reducing subsequent calculations.
Fast Neighbor GeoHash Algorithm
Instead of decoding then re‑encoding a neighbor cell, the algorithm uses the regularity of GeoHash characters and the Z‑order curve. Odd‑position characters map to "lon‑lat‑lon‑…" bits, even‑position characters to "lat‑lon‑…" bits. Pre‑computed tables give the eight neighboring characters for each GeoHash symbol; when a neighbor crosses a table boundary, the algorithm propagates the carry to higher‑order characters, effectively performing a fast lookup without full decode‑encode cycles.
Efficient Large‑Scale Point‑Polygon Relationship Building
Each item receives a unique GeoHash; each district receives one or more GeoHash strings (fully or partially contained). A two‑step join is performed: first, items are joined with districts on fully‑contained GeoHashes to establish definite matches; second, items are joined on partially‑contained GeoHashes, followed by precise geometric checks only for these candidates, dramatically reducing the Cartesian product size.
In practice, Xianyu processed 1 billion items and 10 000 districts with roughly 1.8 × 10¹² basic operations—far less than the 2 × 10²⁰ operations required by a naïve approach—completing the computation in under a day on Alibaba’s offline cluster.
Conclusion
The GeoHash‑based spatial index, combined with the fast neighbor lookup, enables scalable point‑in‑polygon matching for massive datasets and can be extended to other spatial relationships such as point‑point, line‑line, and polygon‑polygon queries.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
