Blog Logo

Dec 2 2024 ~ 5 min read

Twitter's New System Design: Processing Billions of Events in Real Time


twitter-sys-hero

Overview

Every day, Twitter processes approximately 400 billion events, generating petabyte-scale data from diverse sources like Hadoop, Kafka, Google Cloud Storage (GCS), BigQuery, and more. This immense data flow powers critical metrics for ads and engagement services. Transitioning to a more efficient architecture was essential to meet the demands of real-time analytics while reducing complexity and costs.

This blog explores Twitter’s architectural journey, detailing how the adoption of Kappa Architecture enabled it to streamline operations, improve accuracy, and deliver real-time insights.

What Is the Kappa Architecture?

The Kappa Architecture is a modern approach to handling streaming data, designed to unify real-time and historical data processing on a single technology stack. Unlike the Lambda Architecture, which separates batch and streaming components, Kappa treats all data as a continuous stream, offering simplicity and flexibility.

Key Features of Kappa Architecture

  1. Unified Technology Stack

    • Both real-time and historical data processing use the same framework.
    • Data is processed as it arrives, enabling seamless analytics.
  2. Core Components

    • Messaging Engine: Systems like Apache Kafka store incoming data streams.
    • Stream Processing Engine: Processes and analyzes the data in near real-time.
    • Analytics Database: Stores transformed data for querying and analysis.
  3. On-Demand Analytics

    • Historical data can be queried ad hoc for rapid insights, eliminating the need for complex batch systems.

Old Architecture: The Lambda Approach

Twitter initially used the Lambda Architecture, combining batch and real-time processing.

Workflow

  1. Batch Processing:

    • Used Scalding on HDFS for hourly processing.
    • Results stored in Manhattan, a distributed storage system.
  2. Real-Time Processing:

    • Leveraged Heron (on Kafka) to process streaming data.
    • Results cached in Nighthawk for fast access.
  3. Query Service:

    • TSAR consolidated data from both systems for customer-facing services.

Challenges

  • Data Loss: Back pressure in real-time pipelines caused event loss.
  • Latency: Recovering from lags in Heron pipelines was slow.
  • Batch Costs: Managing petabyte-scale batch pipelines increased costs and complexity.

Transition to Kappa Architecture

To address these challenges, Twitter adopted the Kappa Architecture, embracing a streaming-only model that eliminated batch systems.

twitter-sys-design

Pipeline Workflow

On-Premise: Kafka to PubSub

  • Preprocessing Pipelines: Events from Kafka were transformed and enriched.
    Unique identifiers (UUIDs) ensured deduplication.

  • Internal PubSub:
    Guaranteed at-least-once semantics, ensuring all events were processed.

Google Cloud: Dataflow on PubSub

  • Deduplication and Aggregation: Dataflow workers ensured real-time processing with high accuracy.

  • Storage: Aggregated counts were stored in Bigtable for high-throughput querying.

Advantages of Kappa Architecture for Twitter

Simplicity

  • Unified processing pipeline eliminates the need for separate batch and stream systems.
  • Reduced operational overhead with a single stack.

Flexibility

  • Processing logic can be easily modified and re-run on historical data streams.
  • Real-time and historical analytics are seamlessly integrated.

Low Latency

  • Designed for near real-time analytics, reducing delays in processing and querying.

Serving Layer

Twitter’s serving layer integrates Bigtable (for low-latency access) and BigQuery (for in-depth analytics). It supports:

  • Millions of events per second.
  • Sub-10-second latency for customer queries.

Validation and Metrics

Deduplication Accuracy: Compared raw vs. deduplicated events using BigQuery.

Counts Comparison: Matched real-time counts with legacy batch counts, showing <5% discrepancy due to late events missed by batch systems.

MetricImprovement
LatencySignificantly reduced vs. Heron pipelines.
ThroughputHandles high traffic with stability.
Event LossVirtually eliminated with deduplication.

Comparison: Kappa vs. Lambda Architecture

FeatureKappa ArchitectureLambda Architecture
Processing ModesUnified streaming and historicalSeparate batch and streaming
Technology StackSingle stackMultiple technologies
ComplexitySimplerMore complex
Historical AnalyticsStream reprocessingBatch tools (e.g., Hadoop)
LatencyLowerHigher
ScalabilityOptimized for real-timeBetter for large batch data

Benefits for Twitter

Accuracy: Achieved near exactly-once processing with deduplication, improving accuracy over the Lambda approach.

Latency: Eliminated delays by focusing on streaming-only architecture.

Cost Efficiency: Simplified operations, reducing resource demands and operational costs.

Limitations of Kappa Architecture

Batch Data at Scale: Less efficient for analyzing massive datasets compared to batch systems like Hadoop.

Performance Demands: Requires high-speed processing engines to handle both real-time and historical data with low latency.

Final Thoughts

By transitioning to the Kappa Architecture, Twitter has achieved:

  • Low latency and high accuracy.
  • Greater scalability and stability.
  • Simplified operations, leading to significant cost savings.

The Kappa Architecture has proven to be a powerful solution for modern analytics, balancing simplicity, flexibility, and performance. For organizations managing streaming data, it offers an efficient path to real-time insights and streamlined operations.

Sources

You may also like


Headshot of Samarth

Hi, I'm Samarth. I'm a software engineer based in Los Angeles. You can follow me on Twitter, see some of my work on GitHub, or read more about me on LinkedIn.