Backend Development 19 min read

How Ctrip’s Kafka Gatekeeper Boosts FinOps Data Quality and Automates Cost Governance

This article explains how Ctrip’s hybrid‑cloud FinOps billing system uses a custom Kafka Gatekeeper to detect, locate, and automatically remediate data‑quality issues across dozens of self‑built PaaS services, improving coverage, timeliness, and responsibility attribution while supporting high‑availability deployments.

dbaplus Community

Apr 16, 2025

How Ctrip’s Kafka Gatekeeper Boosts FinOps Data Quality and Automates Cost Governance

Background and Problems

The FinOps billing system at Ctrip aggregates cost data from multiple cloud providers and dozens of self‑built PaaS services through a TripCostAllocationProtocol . Billing data is ingested via Kafka, processed, and stored in an internal data warehouse. Existing validation rules in the warehouse still leave high data‑quality issues:

Low coverage – errors that still satisfy business logic are missed.

Delayed alerts – alerts fire only after billing results are generated.

Unclear ownership – it is difficult to pinpoint the team responsible for a data problem.

These issues force developers to spend excessive time troubleshooting data quality instead of building product features.

Kafka Gatekeeper Design

Kafka protocol basics

Kafka uses a TCP‑based binary protocol. Each request begins with a 4‑byte size field followed by a fixed header ( ApiKey, ApiVersion, CorrelationId, ClientId) and then the request payload. The two APIs relevant to the billing pipeline are:

Metadata (ApiKey 3) – used by clients to discover broker addresses and topic metadata.

Produce (ApiKey 0) – used to send records to a topic.

Responses contain the same CorrelationId and a payload specific to the API.

Architecture

Gatekeeper is a transparent proxy placed between Kafka clients and brokers. Clients only need to change their bootstrap address to the Gatekeeper endpoint. Gatekeeper performs three core functions:

Decode incoming Kafka messages according to the protocol version.

Validate the decoded payload against a configurable rule set.

Rewrite broker addresses in Metadata responses so that subsequent client connections also pass through Gatekeeper.

The proxy runs as a set of pods behind a stable load‑balancer IP, providing a single logical address for all clients.

Decoder

The decoder reads the 4‑byte size, extracts ApiKey and ApiVersion, selects the appropriate schema (e.g., Metadata v0, Produce v1), and deserializes the payload. For Produce requests it extracts the topic name, partition, and the record set; for Metadata it extracts the list of broker nodes.

Validator

Validation rules are expressed in JSON. Each rule defines the field name, type, optional flag, and an optional CEL‑style expression. Example for the TripCostAllocationProtocol:

{
  "Topics": [
    {
      "Name": "fake.topic",
      "Owner": {"Key":"Name"},
      "SchemaRules": [
        {"Name":"Name","Type":"string","Optional":false},
        {"Name":"Timestamp","Type":"int","Optional":true,"Rule":"Timestamp>0"}
      ]
    }
  ]
}

When a Produce request is decoded, the validator checks each field against the configured rules. If a rule fails (e.g., Timestamp <= 0), Gatekeeper records the failure, increments a Prometheus counter, and emits an alert that includes the Owner (derived from the message) so the responsible team can be notified.

Mapping and High‑Availability

Gatekeeper maintains a mapping table between the client‑visible bootstrap address and the real broker address. In the Metadata response it replaces each broker’s IP with the Gatekeeper’s IP + port. Because the load‑balancer address is stable, Java clients that cache the broker address continue to reach a reachable endpoint after a Gatekeeper pod restart.

Data‑Quality Workflow

Pre‑validation of every Kafka message before it enters the billing pipeline.

Rules can be updated without redeploying Gatekeeper (the JSON file is reloaded on change).

Self‑service dashboards expose error counts, failure rates, and the owning team.

Automatic alerts (e.g., via Prometheus Alertmanager) contain the team identifier from the Owner field.

Implementation Details

Connection handling

Gatekeeper opens a local TCP socket, accepts client connections, establishes a corresponding connection to the real broker, and forwards bytes in both directions after applying decoding/validation or address rewriting.

Configurable validation

Supported rule features include:

Type checking (string, int, etc.).

Presence validation (required vs. optional).

CEL‑style expressions for complex constraints (e.g., Timestamp>0).

Deployment

Gatekeeper pods are replicated across multiple availability zones. A Kubernetes Service of type LoadBalancer provides a single stable IP and port. The service forwards traffic to ready Gatekeeper pods; health checks ensure only healthy instances receive traffic.

Technical challenges

Java clients cache the broker address returned in the first Metadata response. After a Gatekeeper restart the cached IP may become stale, causing connection failures. The solution is to expose a stable load‑balancer address in the Metadata response, ensuring the cached address remains reachable.

Conclusion

Kafka Gatekeeper adds a configurable pre‑validation layer to Kafka streams, catching data‑quality violations at the source, attributing responsibility via the Owner field, and providing dashboards and alerts for rapid remediation. Although originally built for Ctrip’s FinOps billing system, its generic Kafka‑centric design makes it reusable for broader data‑quality governance scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Cloud Native Kafka Data Quality FinOps Gatekeeper

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.