Inside Stack Overflow’s Redundant Architecture: How It Scales to 170 Million Daily Visits
This article dissects Stack Overflow’s end‑to‑end architecture—covering its dual‑data‑center redundancy, physical and logical server layout, load balancing, web and service tiers, caching strategy, push system, search cluster, database design, and monitoring—showcasing how the platform achieves massive scalability and high availability.
Architecture Overview
Stack Overflow, the renowned programming Q&A community founded by Jeff Atwood and Joel Spolsky in 2008, ranks among the top‑global sites with over 170 million daily page views. Its architecture combines outsourced services and extensive open‑source components, and can be broken down into eight key layers:
Internet
Load Balancing
Web Tier
Service Tier
Cache
Push
Search
Database
Architecture diagram:
Architecture Principles
Everything is redundant. All critical components are duplicated across two data centers (New York and Colorado) with continuous backup.
Physical Architecture
4 Microsoft SQL Server instances (2 on new hardware)
11 IIS web servers (new hardware)
2 Redis servers (new hardware)
3 tag‑engine servers (2 on new hardware)
3 Elasticsearch nodes (new hardware)
4 HAProxy load‑balancers (2 added for CloudFlare support)
2 network devices (Nexus 5596 core + 2232TM Fabric Extender, upgraded to 10 Gbps)
2 Fortinet 800C firewalls (replacing Cisco ASA)
2 Cisco ASR‑1001 routers (replacing Cisco 3945)
2 Cisco ASR‑1001‑x routers
Logical Architecture
The Internet
DNS services: outsourced to CloudFlare plus an in‑house DNS server for added safety.
Load Balancers
HAProxy 1.5.15 on CentOS 7, handling TLS traffic; upcoming HAProxy 1.7 will add HTTP/2 support.
Web Tier
IIS 8.5, ASP.NET MVC 5.2.3, .NET 4.6.1.
Service Tier
IIS, ASP.NET MVC 5.2.3, .NET 4.6.1, and HTTP.SYS.
Cache
Redis is used for L2 caching; L1 consists of HTTP cache. If both miss, the database is queried and the result populates both caches. Cache invalidation follows a publish/subscribe model to keep web‑server caches consistent. Redis CPU usage stays below 2 %.
Push
Open‑source library NetGrain uses WebSockets to push real‑time updates (notifications, vote counts, new navigation items, answers, comments). At peak, about 500 k concurrent WebSocket connections are maintained, some lasting over 18 months.
Search
Elasticsearch cluster with three nodes per cluster; Solr was not chosen because it lacked multi‑index support and required a major re‑index for version 2.x upgrades.
Database
SQL Server is used with a deliberately simple schema—only one stored procedure, slated for removal in favor of pure code.
Monitoring System
Opserver, a lightweight monitoring tool built on ASP.NET MVC, tracks:
Servers
SQL clusters/instances
Redis
Elasticsearch
Exception logs
HAProxy
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.