Operations 12 min read

Inside Stack Overflow’s 2016 Architecture: Handling 61 Million Daily Requests

The article details Stack Overflow’s 2016 infrastructure upgrades—including hardware, networking, load balancing, caching, database, and service layers—that enabled the site to process over 61 million daily requests while reducing processing time by hundreds of hours.

21CTO
21CTO
21CTO
Inside Stack Overflow’s 2016 Architecture: Handling 61 Million Daily Requests

First, a snapshot of key metrics from February 9, 2016 shows a massive increase compared to November 2013, with 209,420,973 HTTP requests, 66,294,789 page loads, 1.24 TB of outbound traffic, and 569 GB of inbound traffic.

HTTP requests: 209,420,973 (+61,336,090)

Page loads: 66,294,789 (+30,199,477)

Outbound traffic: 1,240,266,346,053 bytes (1.24 TB)

Inbound traffic: 569,449,470,023 bytes (569 GB)

SQL queries (HTTP requests): 504,816,843 (+170,244,740)

Redis hits: 5,831,683,114 (+5,418,818,063)

Elastic queries: 17,158,874 (not tracked in 2013)

Tag engine requests: 3,661,134 (+57,716)

SQL query time: 607,073,066 ms (168 h)

Redis hit time: 10,396,073 ms (2.8 h)

The .NET stack now handles about 61 million requests per day, saving roughly 757 hours of processing time compared with 2013, thanks to early‑2015 hardware upgrades and extensive software performance tuning.

Hardware upgrades include:

4 SQL Server database servers (two with updated hardware)

11 IIS web servers (all upgraded)

2 Redis cache/message servers (upgraded)

3 tag‑engine application servers (two new)

3 Elasticsearch search servers (same as 2013)

4 HAProxy load‑balancer servers (two added for CloudFlare CDN)

2 Cisco Nexus 5596 switches with 10 Gbps NICs, plus 2 Fortinet 800C firewalls (replacing Cisco ASA)

2 Cisco ASR‑1001 routers (replacing Cisco 3945)

2 Cisco ASR‑1001‑x routers

Core operational rules include mandatory backups, at least 2 × 10 Gbps bandwidth for all servers and switches, dual power supplies with UPS, rack‑level redundancy, and active‑active disaster recovery across New York and Colorado data centers.

Network Services

Users reach Stack Overflow via the Internet, accelerated by CloudFlare’s global CDN. Traffic enters through four ISPs (Level 3, Zayo, Cogent, Lighttower) and four routers, with BGP load‑balancing. Engineers recommend a 10 Gbps MPLS link between the two data centers for rapid failover.

Load Balancing (HAProxy)

HAProxy 1.5.15 on CentOS 7 terminates TLS/SSL; an upgrade to 1.6 will add HTTP/2 support. Each load balancer has two 10 Gbps NICs, and TLS session caching in RAM reduces CPU overhead. HAProxy monitors traffic, enforces rate limits, and logs alerts.

Web Layer (IIS 8.5, ASP.NET MVC 5.2.3, .NET 4.6.1)

Traffic is distributed to nine primary web servers and two auxiliary servers for metadata. All Stack Overflow, Careers, and Stack Exchange sites run on the primary fleet. Monitoring via Opserver shows the distribution (see Figure 3).

Service Layer (IIS, ASP.NET MVC, .NET, HTTP.SYS)

The service layer runs on Windows Server 2012 R2, providing tag‑engine services (via http.sys) and APIs (via IIS). Redundant deployments ensure up to nine‑fold fault tolerance. Tag data is refreshed every two minutes, and the layer also serves search and navigation data.

Caching & Pub/Sub (Redis)

Redis clusters with 256 GB memory operate in master/slave mode, handling ~160 million ops per month with CPU usage under 2 %. L1 cache resides on web servers; L2 cache is Redis. Misses in L1 fall back to L2, where values are stored as Protobuf using protobuf‑dot‑net and accessed via StackExchange.Redis. If both caches miss, the system queries the primary data source and repopulates the caches.

WebSockets (NetGain)

NetGain provides a lightweight, high‑performance real‑time messaging middleware. During peak periods it supports up to 500 k concurrent WebSocket connections, delivering notifications, votes, answers, and comments.

Search (Elasticsearch)

Each data center runs an Elasticsearch 1.4 cluster with three SSD‑backed nodes (192 GB RAM, dual 10 Gbps NICs). The custom StackExchange.Elastic client, not open‑sourced, powers question‑answer search, chosen for scalability and cost efficiency.

Database (SQL Server)

SQL Server is the sole source of truth; all Elastic and Redis data originate here. Two AlwaysOn Availability Groups provide primary‑secondary replication across New York and Colorado, with asynchronous backups.

Cluster 1: Dell R720xd, 384 GB RAM, 4 TB SSD, dual 12‑core CPUs. Cluster 2: Dell R730xd, 768 GB RAM, 4 TB SSD, dual 8‑core CPUs.

CPU utilization remains low under normal load, spiking only during cache‑maintenance tasks (see Figure 7).

Source: http://www.infoq.com/cn/news/2016/03/Stack-Overflow-architecture-insi
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

architectureOperationsdatabaseload balancingcachingstack overflowweb servers
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.