How Ctrip’s Kafka Gatekeeper Boosts FinOps Data Quality and Automates Cost Governance
This article explains how Ctrip’s hybrid‑cloud FinOps billing system uses a custom Kafka Gatekeeper to detect, locate, and automatically remediate data‑quality issues across dozens of self‑built PaaS services, improving coverage, timeliness, and responsibility attribution while supporting high‑availability deployments.
Background and Problems
The FinOps billing system at Ctrip aggregates cost data from multiple cloud providers and dozens of self‑built PaaS services through a TripCostAllocationProtocol . Billing data is ingested via Kafka, processed, and stored in an internal data warehouse. Existing validation rules in the warehouse still leave high data‑quality issues:
Low coverage – errors that still satisfy business logic are missed.
Delayed alerts – alerts fire only after billing results are generated.
Unclear ownership – it is difficult to pinpoint the team responsible for a data problem.
These issues force developers to spend excessive time troubleshooting data quality instead of building product features.
Kafka Gatekeeper Design
Kafka protocol basics
Kafka uses a TCP‑based binary protocol. Each request begins with a 4‑byte size field followed by a fixed header ( ApiKey, ApiVersion, CorrelationId, ClientId) and then the request payload. The two APIs relevant to the billing pipeline are:
Metadata (ApiKey 3) – used by clients to discover broker addresses and topic metadata.
Produce (ApiKey 0) – used to send records to a topic.
Responses contain the same CorrelationId and a payload specific to the API.
Architecture
Gatekeeper is a transparent proxy placed between Kafka clients and brokers. Clients only need to change their bootstrap address to the Gatekeeper endpoint. Gatekeeper performs three core functions:
Decode incoming Kafka messages according to the protocol version.
Validate the decoded payload against a configurable rule set.
Rewrite broker addresses in Metadata responses so that subsequent client connections also pass through Gatekeeper.
The proxy runs as a set of pods behind a stable load‑balancer IP, providing a single logical address for all clients.
Decoder
The decoder reads the 4‑byte size, extracts ApiKey and ApiVersion, selects the appropriate schema (e.g., Metadata v0, Produce v1), and deserializes the payload. For Produce requests it extracts the topic name, partition, and the record set; for Metadata it extracts the list of broker nodes.
Validator
Validation rules are expressed in JSON. Each rule defines the field name, type, optional flag, and an optional CEL‑style expression. Example for the TripCostAllocationProtocol:
{
"Topics": [
{
"Name": "fake.topic",
"Owner": {"Key":"Name"},
"SchemaRules": [
{"Name":"Name","Type":"string","Optional":false},
{"Name":"Timestamp","Type":"int","Optional":true,"Rule":"Timestamp>0"}
]
}
]
}When a Produce request is decoded, the validator checks each field against the configured rules. If a rule fails (e.g., Timestamp <= 0), Gatekeeper records the failure, increments a Prometheus counter, and emits an alert that includes the Owner (derived from the message) so the responsible team can be notified.
Mapping and High‑Availability
Gatekeeper maintains a mapping table between the client‑visible bootstrap address and the real broker address. In the Metadata response it replaces each broker’s IP with the Gatekeeper’s IP + port. Because the load‑balancer address is stable, Java clients that cache the broker address continue to reach a reachable endpoint after a Gatekeeper pod restart.
Data‑Quality Workflow
Pre‑validation of every Kafka message before it enters the billing pipeline.
Rules can be updated without redeploying Gatekeeper (the JSON file is reloaded on change).
Self‑service dashboards expose error counts, failure rates, and the owning team.
Automatic alerts (e.g., via Prometheus Alertmanager) contain the team identifier from the Owner field.
Implementation Details
Connection handling
Gatekeeper opens a local TCP socket, accepts client connections, establishes a corresponding connection to the real broker, and forwards bytes in both directions after applying decoding/validation or address rewriting.
Configurable validation
Supported rule features include:
Type checking (string, int, etc.).
Presence validation (required vs. optional).
CEL‑style expressions for complex constraints (e.g., Timestamp>0).
Deployment
Gatekeeper pods are replicated across multiple availability zones. A Kubernetes Service of type LoadBalancer provides a single stable IP and port. The service forwards traffic to ready Gatekeeper pods; health checks ensure only healthy instances receive traffic.
Technical challenges
Java clients cache the broker address returned in the first Metadata response. After a Gatekeeper restart the cached IP may become stale, causing connection failures. The solution is to expose a stable load‑balancer address in the Metadata response, ensuring the cached address remains reachable.
Conclusion
Kafka Gatekeeper adds a configurable pre‑validation layer to Kafka streams, catching data‑quality violations at the source, attributing responsibility via the Owner field, and providing dashboards and alerts for rapid remediation. Although originally built for Ctrip’s FinOps billing system, its generic Kafka‑centric design makes it reusable for broader data‑quality governance scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
