Twitter’s Kappa Architecture: Scaling Real-Time Processing of Billions of Events
Twitter migrated from a Lambda-based dual‑pipeline system to a Kappa architecture that relies on a single real‑time stream using Kafka, Google Pub/Sub, Dataflow, and BigTable, dramatically reducing latency, increasing throughput, and improving data accuracy for processing billions of daily events.
Background
Twitter’s original software stack was built on a Lambda architecture that combined a batch layer and a speed (real‑time) layer. The platform needed to process up to 4 trillion events per day, generating petabytes of data from sources such as distributed databases, Kafka, and the Twitter event bus.
Old Architecture
The legacy solution used two independent pipelines:
Batch layer : Scalding jobs pre‑processed raw logs from HDFS, client events, or tweet events and stored results in the Manhattan distributed storage system. The batch pipeline ran in a single data center with replicas in two others.
Speed layer : Kafka topics fed a real‑time stream into the Summingbird platform via Heron. Heron’s output was cached in Twitter’s Nighthawk distributed cache and deployed across three data centers.
Supporting components : Summingbird, TimeSeries AggregatoR, and a Data Access Layer provided unified query services on top of both batch and real‑time stores.
While robust, the system faced pressure from ever‑growing data volumes. The interaction and engagement pipeline, which aggregates multi‑dimensional tweet and user interaction metrics, suffered from back‑pressure in Heron topologies, leading to latency spikes and occasional data loss when containers were restarted.
New Architecture (Kappa)
To simplify the pipeline, Twitter adopted a Kappa architecture that relies on a single real‑time stream. The new flow uses internal preprocessing services to translate Kafka events into Google Pub/Sub messages, then processes them with Cloud Dataflow and stores results in BigTable.
The steps are:
Consume raw events from source Kafka topics, transform and remap fields, and write to an intermediate Kafka topic.
Convert the intermediate Kafka records to Pub/Sub format, attaching a UUID for deduplication and additional metadata.
Publish the enriched events to a Google Pub/Sub topic, with aggressive retries to guarantee at‑least‑once delivery.
Dataflow jobs read from Pub/Sub, perform real‑time deduplication and aggregation.
The aggregated results are written to BigTable and also exported to BigQuery for analytics.
Evaluation
Latency dropped from a variable 10 seconds‑10 minutes range to a stable ~10 seconds.
Throughput increased from ~100 MB/s to roughly 1 GB/s.
At‑least‑once publishing combined with Dataflow deduplication yields near‑exactly‑once processing.
Significant cost savings by eliminating a separate batch pipeline.
Higher aggregation accuracy and the ability to handle delayed events.
No event loss during container restarts.
Monitoring Duplicate Percentage
Twitter runs two independent Dataflow pipelines: one routes raw Pub/Sub data directly to BigQuery, while the other writes deduplicated event counts to BigQuery. This setup enables continuous monitoring of the duplicate‑event percentage.
Comparing Old Batch and New Dataflow Results
Both the new pipeline’s deduplicated output and the legacy batch results are loaded into BigQuery. A scheduled query compares duplicate counts, revealing that over 95 % of the new pipeline’s results match the old batch pipeline. The remaining ~5 % discrepancy is mainly due to the old batch pipeline discarding late‑arriving events, which the new real‑time pipeline captures.
Conclusion
By migrating to a Kappa architecture, Twitter achieved substantial improvements in latency and correctness while simplifying the data pipeline to a single streaming flow.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
