Design and Implementation of a Kafka Gatekeeper for FinOps Billing Data Quality Governance
This article describes the challenges of data quality in Ctrip’s hybrid‑cloud FinOps billing system and presents the design, implementation, and high‑availability deployment of a custom Kafka Gatekeeper proxy that performs pre‑validation, configurable rules, self‑service dashboards, and automated alerts to improve coverage, timeliness, and responsibility attribution.
To manage cloud costs effectively, Ctrip’s Hybrid Cloud team built a FinOps billing system that integrates dozens of self‑built PaaS services and multiple public clouds. The system uses a custom protocol (TripCostAllocationProtocol) to ingest usage data via Kafka, then performs recursive settlement and stores results in an internal data warehouse.
After deployment, the team observed persistent data‑quality issues: low coverage of validation rules, delayed alerts, inefficient root‑cause analysis, and unclear responsibility, which consumed significant engineering effort.
The solution is a new data‑quality governance capability centered on a Kafka Gatekeeper component. Gatekeeper acts as a proxy between Kafka clients and brokers, performing pre‑validation of incoming messages, providing configurable validation rules, self‑service dashboards, and automatic alerts that pinpoint the responsible team.
Key capabilities of Gatekeeper:
Pre‑validation before data enters the billing pipeline.
Configurable rules that can be updated at any time for full coverage.
Self‑service query board showing error counts and root‑cause information.
Automatic alerts to the data source team when non‑compliant data is detected.
The design includes a decoder that parses Kafka messages based on the protocol’s ApiKey and ApiVersion, a validator that applies the configured rules, and a mapping layer that maintains the relationship between client‑side Bootstrap addresses and actual broker endpoints.
Kafka protocol background: the article explains the binary TCP‑based request/response format, focusing on the Metadata and Produce APIs, and shows how the decoder extracts the 4‑byte size, ApiKey, ApiVersion, CorrelationId, ClientId, and payload.
Gatekeeper’s architecture (Figure 2‑4) consists of a decoder, validator, and address‑mapping module. The decoder handles different protocol versions, while the validator checks fields such as required strings and optional timestamps against rules expressed in CEL syntax.
Example schema and configuration:
schema{
Name: "", //required
TimeStamp: 0, //optional
...
}
"Topics": [
{
"Name": "fake.topic",
"Owner": ["Key":"Name"],
"SchemaRules": [
{"Name":"Name","Type":"string","Optional":false},
{"Name":"Timestamp","Type":int,"Optional":true,"Rule":"TimeStamp>0"}
]
}
]When a message violates a rule (e.g., Timestamp = 0), Gatekeeper identifies the owning service (e.g., Service A) and sends an alert, enabling immediate remediation.
High‑availability deployment follows Ctrip’s best practices: multiple Gatekeeper instances are placed across AZs behind a fixed load‑balancer address, ensuring that client‑side Bootstrap addresses remain stable even if individual Gatekeeper pods restart.
Technical challenges such as client‑side metadata refresh behavior (especially for Java clients) are addressed by always returning a reachable load‑balancer address in metadata responses, avoiding connection failures caused by stale IPs.
In summary, Gatekeeper provides a configurable, automated, and self‑service data‑quality layer for Kafka‑based FinOps billing, improving coverage, timeliness, and responsibility attribution, and its architecture can be generalized to other data‑validation scenarios.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.