ElasticSearch Powers Ride‑Matching at Haro Mobility: Architecture & Lessons
This article details how Haro Mobility built a search‑driven ride‑matching platform using ElasticSearch and Flink, covering the business background, architectural evolution, data‑sync challenges, performance tuning, stability measures, and the resulting improvements in order completion and user engagement.
Background
Haro Mobility operates multiple on‑demand services (bike sharing, e‑scooters, ride‑hailing, car‑pooling, ticketing, rentals). In 2019 each business unit ran its own Elasticsearch (ES) cluster, manually provisioning machines and synchronizing data from the source databases. This fragmented approach caused high operational overhead, limited horizontal scalability, and made it difficult to provide unified search and recommendation capabilities across services.
Initial Architecture
The first use case was a driver‑passenger matching engine for the car‑pooling service. Passengers create orders with origin, destination, and departure time; drivers submit a matching request and the engine returns the most suitable passenger orders. The original system used PostgreSQL for offline matching and cached the results. This design suffered from stale data (new passenger orders were invisible until the next cache refresh) and could not scale horizontally.
Migration to an ES‑Based Architecture
To achieve real‑time matching and elastic scalability, the team replaced the PostgreSQL‑based pipeline with an ES‑driven solution. The new architecture required:
Near‑real‑time synchronization of source data (PostgreSQL, later MySQL, RocketMQ) into ES.
Ability to express complex business logic (e.g., route scoring, filtering) within the search layer.
Robustness and high availability for a core transaction path.
Data Synchronization Pipeline
Data is ingested through a hybrid batch‑plus‑stream approach:
Batch load : Historical records are extracted from PostgreSQL and bulk‑indexed into ES.
Streaming load : Change data capture (CDC) from the database binlog is published to Kafka. Flink consumes the Kafka topics, transforms the records using its Table API (SQL‑like syntax), and writes them to ES.
Flink was chosen for:
Built‑in high availability and automatic failover.
Exactly‑once processing guarantees (the job runs at “at‑least‑once” in production but can be switched to exactly‑once).
Rich Table API that lets developers write business rules as SQL statements, simplifying maintenance and reuse.
Flink Dual‑Stream Join Challenge
The matching logic requires joining two high‑volume streams (driver requests and passenger orders) with a state window of 4–16 days. Storing the full stream in Flink state would exhaust memory. The solution:
Separate INSERT events from UPDATE events.
Store only insert events in Flink state; updates are applied directly to ES using its partial‑update API.
When a join occurs, the result is written to ES without expanding the state size.
This approach limits state growth while preserving the ability to handle frequent updates.
Custom ES Plugin for Route Scoring
A proprietary ES plugin encapsulates the core route‑scoring algorithm. By running the algorithm inside the ES cluster, the system gains:
Distributed computation without external services.
Hot‑deployment capability – the plugin can be updated without restarting the ES nodes, ensuring zero‑downtime releases.
Performance Tuning
Initial benchmarks showed CPU utilization plateauing at ~50 % due to lock contention, I/O bottlenecks, and oversized route point sets. The following optimizations were applied:
Enable mmap for faster segment loading.
Activate point‑set compression and route thinning (removing intermediate points on long straight segments) to shrink the payload size.
Adjust shard allocation and JVM/G1GC settings to reduce garbage‑collection pauses.
After tuning, throughput increased fourfold, average latency dropped by ~50 %, and CPU usage reached the expected 70–80 % range.
Stability Measures
To guarantee production‑grade reliability, a three‑layer strategy was implemented:
Monitoring : Fine‑grained metrics at cluster, node, and index levels (throughput, latency, error rates, JVM health).
Alerting : Threshold‑based alerts with automated escalation paths.
Remediation Playbooks : Pre‑defined procedures for index rebuilds, traffic shifting between parallel indices, and multi‑region active‑active failover.
The platform also supports on‑demand index recreation and rapid traffic cut‑over, enabling quick recovery from data corruption or node failures.
Results
After deployment, the platform delivered a 49.8 % increase in completed orders and a 37 % rise in active drivers. The unified search layer now serves more than 60 internal services, including car‑pooling matching, economic‑car dispatch, and other transaction‑critical scenarios.
Future Directions
Planned enhancements include:
Integrating additional algorithmic components such as intent recognition and spelling correction, exposed as plug‑and‑play modules.
Extending the CDC pipeline to ingest data from MySQL and RocketMQ sources.
Deploying cross‑region multi‑active ES clusters to achieve geo‑redundancy and lower latency for users in different regions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
