Big Data 17 min read

How Agoda Scales Apache Kafka: Two‑Step Logging, Monitoring, and Cost Attribution

This article details Agoda's evolution of Apache Kafka usage—from a two‑step logging architecture that separates developer concerns, through cluster layout, scaling metrics, monitoring and audit pipelines, to cost attribution, authentication, ACLs, and automation tools—highlighting trade‑offs and operational lessons learned.

dbaplus Community
dbaplus Community
dbaplus Community
How Agoda Scales Apache Kafka: Two‑Step Logging, Monitoring, and Cost Attribution

Background

Agoda processes roughly 1.8 trillion events per day using Apache Kafka. The platform adopted Kafka in 2015 for analytics pipelines and data‑lake ingestion and has grown at an average 2× year‑over‑year rate, reaching the 1.8 trillion‑event scale in 2023. The rapid growth forced a redesign of the Kafka infrastructure to improve scalability, reliability, and operational manageability.

Two‑Step Logging Architecture

Agoda introduced a two‑step logging pattern:

Client library – runs inside each application, writes events to local files, handles file rotation and determines write locations.

Forwarder daemon – a lightweight service deployed on every node, reads the files, extracts metadata (topic, payload type, etc.), sends the records to Kafka, tracks file offsets, and deletes files that have been fully forwarded.

Two‑Step Logging Architecture
Two‑Step Logging Architecture

Benefits

Simplified producer API – developers do not need Kafka knowledge.

Enforced serialization standards (e.g., AVRO) via the client library.

Disk buffering adds resilience during Kafka outages.

Operational concerns (batch size, compression, latency, routing) can be changed without touching application code.

Latency Trade‑off

The forwarder adds extra latency. For the majority of analytics workloads the 99th‑percentile end‑to‑end latency (disk write → forwarder → Kafka → ready for consumption) is ~10 seconds, which is acceptable. Latency‑critical applications bypass the two‑step path and write directly to Kafka, achieving sub‑second latency.

Cluster Layout and Scaling Strategy

Instead of a single large cluster per data‑center, Agoda runs multiple smaller Kafka clusters, each dedicated to a specific use case (e.g., analytics, async API, cross‑DC replication, ML pipelines). This isolation limits the blast radius of failures, allows heterogeneous hardware configurations, and simplifies management. The forwarder contains routing logic that directs events to the appropriate cluster without requiring producer changes.

Smaller Kafka Clusters per Use Case
Smaller Kafka Clusters per Use Case

Zookeeper Considerations

Dedicated SSD‑backed nodes host Zookeeper logs and snapshots, physically separated from Kafka brokers to isolate potential issues. Agoda plans to migrate away from Zookeeper in future Kafka releases that support KRaft.

Monitoring, Auditing, and Observability

Metrics are collected via JMXTrans, stored in Graphite, and visualized in Grafana. To guarantee data completeness and timeliness, an audit pipeline runs in the client library’s background thread, aggregates message counts over configurable intervals, and forwards audit events to a dedicated audit Kafka cluster. Audits are consumed by internal analytics platforms (Whitefalcon) and Hadoop, enabling high‑level health dashboards and SLO tracking.

Generating Audits throughout the Pipeline
Generating Audits throughout the Pipeline

Capacity Planning and Alerting

For each cluster Agoda monitors five resource metrics:

Disk usage (percentage of allocated storage).

Network throughput.

CPU utilization.

Total number of partitions.

Average request‑handler idle percentage.

Each metric is compared against a predefined limit; the ratio (current/limit) is expressed as a percentage. The overall cluster capacity is defined as the maximum of these percentages, yielding a single “capacity‑percentage” number. Alerts fire when this number exceeds configurable thresholds, prompting investigation or physical scaling.

Capacity Calculation Diagram
Capacity Calculation Diagram

Cost Attribution to Teams

To curb data‑lake bloat, Agoda assigns a monetary cost to Kafka usage based on bytes produced. Each Kafka topic has an owning team; total cluster cost is prorated by the derived capacity percentages, producing a per‑byte cost estimate. Teams can view their per‑topic byte usage and associated cost, incentivizing them to evaluate the value of the data they emit.

Cost Attribution per Byte
Cost Attribution per Byte

Authentication, Authorization, and ACLs

In 2021 Agoda introduced a credential‑generation service and a self‑service portal that allow teams to request Kafka credentials and ACLs. The system manages user lifecycles, credential expiry, and fine‑grained topic permissions, reducing the risk of unauthorized data access.

Authentication & Authorization Components
Authentication & Authorization Components

Automation and Tooling

Managing dozens of clusters requires automation. Agoda leverages open‑source tools such as Cruise Control for load‑aware rebalancing, Kafka‑UI for cluster inspection, internal deployment scripts for broker configuration propagation, rolling restarts, and a custom UI for ACL management. Continuous evaluation of new tools is part of the operational scalability strategy.

Future Work – Forwarder Acknowledgments

A current limitation of the two‑step architecture is that the client library receives no acknowledgment that Kafka has persisted the record (the client only knows that the event was written to disk). Agoda is redesigning the forwarder to expose an endpoint that returns a success response after Kafka confirms the write, enabling end‑to‑end reliability guarantees for producers that require them.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data pipelineScalabilityCost ManagementApache Kafka
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.