Big Data 13 min read

Efficient GeoHash-Based Point‑in‑Polygon Matching for Massive Datasets

By encoding billions of GPS points and ten thousand district polygons into GeoHash cells, using exact matches, approximate filtering, neighbor‑cell lookup tables, and a final precise geometry test, the authors cut the required operations from 2×10^20 to about 1.8×10^12, enabling full processing in under a day.

Xianyu Technology
Xianyu Technology
Xianyu Technology
Efficient GeoHash-Based Point‑in‑Polygon Matching for Massive Datasets

The Xianyu app divides cities into business districts and needs to recommend items (GPS points) that fall inside each district (polygons). With over 1 billion items and ~10 000 districts, a naïve point‑in‑polygon computation would require on the order of 2×10^20 basic operations and is infeasible.

By combining exact GeoHash matching, approximate GeoHash filtering, and a small amount of precise geometry calculation, the authors reduced the workload dramatically: the same data can be processed within one day.

GeoHash encodes latitude and longitude into a base‑32 string by interleaving binary bits (even bits for longitude, odd bits for latitude). Longer strings represent smaller rectangles and higher precision.

Example: the coordinate 30.280245, 120.027162 is encoded step‑by‑step into the GeoHash string wtmk72 . The process involves binary subdivision of latitude and longitude intervals, interleaving the bits, and converting each 5‑bit group to a base‑32 character.

Validation of the result can be done on geohash.org , where the first characters match the manual calculation.

To handle polygons, the algorithm first finds the minimum bounding rectangle (MBR), encodes its southwest corner, and then expands eastward and northward to generate neighboring GeoHash cells until the whole MBR is covered. Cells that do not intersect the polygon are discarded, greatly reducing the number of geometry checks.

Images illustrate the MBR expansion, the final set of GeoHash cells (blue = intersecting, orange = fully contained), and the overall workflow.

A fast neighbor‑cell algorithm is introduced that uses lookup tables derived from the Z‑order curve. Instead of decoding and re‑encoding a cell, the tables directly provide the eight adjacent GeoHash strings. The method is demonstrated with the cell wtmk72 , yielding its eight neighbors such as wtmk71 , wtmk73 , …, wtmk5x .

To build the point‑polygon relationship, each point receives a unique GeoHash, while each polygon receives one or more GeoHash strings (fully or partially covering). A join on fully‑contained strings determines definite matches; a join on partially‑contained strings followed by a precise geometry test resolves the remaining candidates. This reduces the computational complexity from a full Cartesian product to a few million checks.

In practice, processing 1 billion points and 10 000 polygons required roughly 1.8 × 10^12 basic operations—orders of magnitude less than the naïve approach—and completed in under a day on Alibaba’s offline compute platform.

The article also notes that other spatial indexes (R‑tree, quadtree, K‑D tree, grid) can be used for similar large‑scale spatial problems, and mentions the classic ray‑casting method for point‑in‑polygon tests.

algorithmBig Datageohashpoint-in-polygonspatial indexing
Xianyu Technology
Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.