How Agoda Scales Apache Kafka: Two‑Step Logging, Monitoring, and Cost Attribution
This article details Agoda's evolution of Apache Kafka usage—from a two‑step logging architecture that separates developer concerns, through cluster layout, scaling metrics, monitoring and audit pipelines, to cost attribution, authentication, ACLs, and automation tools—highlighting trade‑offs and operational lessons learned.
Background
Agoda processes roughly 1.8 trillion events per day using Apache Kafka. The platform adopted Kafka in 2015 for analytics pipelines and data‑lake ingestion and has grown at an average 2× year‑over‑year rate, reaching the 1.8 trillion‑event scale in 2023. The rapid growth forced a redesign of the Kafka infrastructure to improve scalability, reliability, and operational manageability.
Two‑Step Logging Architecture
Agoda introduced a two‑step logging pattern:
Client library – runs inside each application, writes events to local files, handles file rotation and determines write locations.
Forwarder daemon – a lightweight service deployed on every node, reads the files, extracts metadata (topic, payload type, etc.), sends the records to Kafka, tracks file offsets, and deletes files that have been fully forwarded.
Benefits
Simplified producer API – developers do not need Kafka knowledge.
Enforced serialization standards (e.g., AVRO) via the client library.
Disk buffering adds resilience during Kafka outages.
Operational concerns (batch size, compression, latency, routing) can be changed without touching application code.
Latency Trade‑off
The forwarder adds extra latency. For the majority of analytics workloads the 99th‑percentile end‑to‑end latency (disk write → forwarder → Kafka → ready for consumption) is ~10 seconds, which is acceptable. Latency‑critical applications bypass the two‑step path and write directly to Kafka, achieving sub‑second latency.
Cluster Layout and Scaling Strategy
Instead of a single large cluster per data‑center, Agoda runs multiple smaller Kafka clusters, each dedicated to a specific use case (e.g., analytics, async API, cross‑DC replication, ML pipelines). This isolation limits the blast radius of failures, allows heterogeneous hardware configurations, and simplifies management. The forwarder contains routing logic that directs events to the appropriate cluster without requiring producer changes.
Zookeeper Considerations
Dedicated SSD‑backed nodes host Zookeeper logs and snapshots, physically separated from Kafka brokers to isolate potential issues. Agoda plans to migrate away from Zookeeper in future Kafka releases that support KRaft.
Monitoring, Auditing, and Observability
Metrics are collected via JMXTrans, stored in Graphite, and visualized in Grafana. To guarantee data completeness and timeliness, an audit pipeline runs in the client library’s background thread, aggregates message counts over configurable intervals, and forwards audit events to a dedicated audit Kafka cluster. Audits are consumed by internal analytics platforms (Whitefalcon) and Hadoop, enabling high‑level health dashboards and SLO tracking.
Capacity Planning and Alerting
For each cluster Agoda monitors five resource metrics:
Disk usage (percentage of allocated storage).
Network throughput.
CPU utilization.
Total number of partitions.
Average request‑handler idle percentage.
Each metric is compared against a predefined limit; the ratio (current/limit) is expressed as a percentage. The overall cluster capacity is defined as the maximum of these percentages, yielding a single “capacity‑percentage” number. Alerts fire when this number exceeds configurable thresholds, prompting investigation or physical scaling.
Cost Attribution to Teams
To curb data‑lake bloat, Agoda assigns a monetary cost to Kafka usage based on bytes produced. Each Kafka topic has an owning team; total cluster cost is prorated by the derived capacity percentages, producing a per‑byte cost estimate. Teams can view their per‑topic byte usage and associated cost, incentivizing them to evaluate the value of the data they emit.
Authentication, Authorization, and ACLs
In 2021 Agoda introduced a credential‑generation service and a self‑service portal that allow teams to request Kafka credentials and ACLs. The system manages user lifecycles, credential expiry, and fine‑grained topic permissions, reducing the risk of unauthorized data access.
Automation and Tooling
Managing dozens of clusters requires automation. Agoda leverages open‑source tools such as Cruise Control for load‑aware rebalancing, Kafka‑UI for cluster inspection, internal deployment scripts for broker configuration propagation, rolling restarts, and a custom UI for ACL management. Continuous evaluation of new tools is part of the operational scalability strategy.
Future Work – Forwarder Acknowledgments
A current limitation of the two‑step architecture is that the client library receives no acknowledgment that Kafka has persisted the record (the client only knows that the event was written to disk). Agoda is redesigning the forwarder to expose an endpoint that returns a success response after Kafka confirms the write, enabling end‑to‑end reliability guarantees for producers that require them.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
