How a Small E‑commerce Site Scaled to 10 Million Daily Visits: Real‑World Architecture Lessons
This article details a small‑to‑mid‑size e‑commerce platform’s journey from a few thousand daily page views to ten million, covering business challenges, three architecture evolution stages, key technical solutions, performance optimizations, cost‑control strategies, and practical automation tips.
Preface: As an operations engineer with five years of experience in a small‑to‑mid‑size company, I witnessed the website grow from a few thousand daily PV to tens of millions. This article shares a real‑world architecture evolution case study and practical lessons for fellow ops engineers.
Business Background and Challenges
A certain e‑commerce platform started with ~5,000 daily PV and grew to ten million within 18 months. Core challenges included limited budget, small team, unpredictable traffic growth, and the need for 24/7 stability.
Budget constraints – cannot spend like large enterprises.
Small team – must consider maintenance cost.
Unpredictable growth – architecture must be flexible.
Require 7×24 h stable operation.
Architecture Evolution – Three Stages
Stage 1: Monolithic Application (0‑100k PV/day)
Architecture:
Nginx + PHP‑FPM + MySQL + Redis
Single 4‑core 8 GB server handles everythingPain Points:
MySQL slow‑query hell: No proper indexes, queries become extremely slow after 1 M rows.
PHP‑FPM process mis‑configuration: Insufficient processes under high concurrency, causing many 502 errors.
Log files fill disk: Missing log rotation, causing midnight alerts.
Solutions:
Enable slow‑query monitoring and regularly optimize SQL and indexes.
Adjust pm.max_children according to server capacity.
Use logrotate for log management and set disk‑usage alerts.
Stage 2: Vertical Scaling (100k‑1M PV/day)
When a single server’s CPU and memory reach 80 % utilization, we performed vertical scaling:
Web server upgraded from 4 core 8 GB to 8 core 16 GB.
Database server moved to a dedicated 16‑core 32 GB machine with SSD.
Architecture Adjustments:
Frontend: Nginx + CDN static assets
Application: PHP‑FPM on dedicated servers
Data: MySQL master‑slave + Redis cluster
Monitoring: Zabbix + custom scriptsKey Optimizations:
Database read/write separation – master for writes, slaves for reads, QPS +60 %.
Redis caching – hot data cached, proper TTL, cache‑penetration protection.
CDN integration – all static resources served via CDN, bandwidth cost reduced by 70 %.
Pain Points:
Master‑slave replication lag caused data inconsistency.
Redis memory shortage led to frequent evictions and cache miss spikes.
Excessive CDN back‑origin traffic saturated the origin bandwidth.
Stage 3: Horizontal Scaling (1M‑10M PV/day)
Business continued to grow, monolithic architecture became a bottleneck, so we moved to service‑oriented design:
Final Architecture Diagram:
[User]
|
[CDN + WAF]
|
[Load Balancer Nginx]
/ | \
[Web1] [Web2] [Web3] (PHP app servers)
\ | /
[API Gateway]
/ | \
[User Service] [Product Service] [Order Service] (micro‑services)
| | |
[MySQL] [MySQL] [MySQL] (business DBs)
\ | /
[Redis Cluster + MQ Cluster] (cache & messaging)
|
[Monitoring + Log System]Core Technical Solutions
1. Load Balancing Strategy
Using Nginx upstream configuration:
upstream backend {
server 10.0.1.10:9000 weight=3;
server 10.0.1.11:9000 weight=2;
server 10.0.1.12:9000 weight=1 backup;
keepalive 32;
keepalive_requests 1000;
}Why:
Weight distribution based on actual server performance tests.
Backup server for failover.
Keepalive reduces TCP connection overhead.
2. MySQL Optimization
Sharding & Partitioning:
User table sharded into 4 databases by user‑id modulo.
Order table partitioned monthly.
Product table read/write separated, master‑slave delay kept under 1 s.
Key Configuration:
# InnoDB buffer pool set to 70 % of memory
innodb_buffer_pool_size = 22G
# Adjust connections based on load
max_connections = 2000
# Slow query threshold
long_query_time = 0.53. Redis Cluster Design
Using Redis Sentinel for high availability:
3 Redis instances for master‑slave.
3 Sentinel nodes for monitoring.
Client‑side automatic failover.
Cache Design Principles:
Different TTLs for hot data to avoid cache avalanche.
Bloom filter to prevent cache penetration.
Distributed lock to solve cache breakdown.
4. Monitoring & Alerting System
Three‑layer monitoring:
Infrastructure monitoring – CPU, memory, disk, network (Zabbix).
Application monitoring – response time, error rate, QPS (custom).
Business monitoring – order volume, payment success rate, user activity.
Alert Levels:
P0 – immediate phone + SMS.
P1 – SMS within 5 minutes.
P2 – email notification.
Cost‑Control Practices
Cloud Server Cost Optimization
Hybrid cloud – core services on cloud VMs, edge services on physical machines.
Use spot instances for dev/test to cut 60 % cost.
Resource Utilization
Containerization – three‑fold increase in deployment density.
Auto‑scaling – expand during peaks, shrink during lows.
Resource pooling – mix CPU‑intensive and I/O‑intensive workloads.
CDN Cost Reduction
Image compression & WebP conversion.
Appropriate cache TTL settings.
Smart DNS for nearest‑node access.
Cold‑hot data separation – cold data stored in object storage.
Performance Optimization Cases
Case 1: API Response Time
Problem: Product detail API latency rose from 200 ms to 2 s.
Investigation:
Monitoring showed MySQL CPU at 99 %.
Slow‑query log revealed an unindexed join.
Execution plan confirmed index loss.
Solution:
Add composite index.
Rewrite SQL to avoid unnecessary joins.
Introduce caching layer.
Result: Latency reduced to 50 ms.
Case 2: Cache Breakdown
Problem: Hot product cache expiration caused massive DB load.
Solution: Distributed lock with fallback and retry.
function getProductInfo($productId) {
$key = "product:$productId";
$lockKey = "lock:product:$productId";
$data = $redis->get($key);
if ($data !== false) {
return json_decode($data, true);
}
// Distributed lock to prevent cache breakdown
if ($redis->set($lockKey, 1, ['nx', 'ex' => 10])) {
$data = $db->getProduct($productId);
$redis->setex($key, 3600, json_encode($data));
$redis->del($lockKey);
return $data;
} else {
// Other requests wait and retry
usleep(50000);
return $this->getProductInfo($productId);
}
}Automation Toolchain
Deployment Pipeline
GitLab CI/CD + Ansible for automated deployment:
stages:
- test
- build
- deploy
deploy_production:
stage: deploy
script:
- ansible-playbook -i inventory/prod deploy.yml
only:
- tags
when: manualDeployment Strategies:
Blue‑green deployment for zero downtime.
Canary releases to reduce risk.
One‑click rollback.
Log Collection & Analysis (ELK Stack)
Filebeat collects application logs.
Logstash processes and transforms logs.
Elasticsearch stores and indexes logs.
Kibana visualizes log data.
Standardize logs in JSON, normalize key fields, and mask sensitive information.
Key Takeaways & Best Practices
Progressive Evolution: Start simple and evolve architecture as business grows.
Cost‑Benefit Balance: Choose technologies that match team capability and maintenance cost.
Monitoring First: A system without monitoring is a naked runner; monitoring is more critical than features.
Automation Priority: Automate everything possible to reduce human error.
Future Planning
Cloud‑Native Transformation: Containerization + Kubernetes for better resource utilization.
Mid‑Platform Architecture: Business and data middle‑platform to support multiple lines.
AI‑Driven Operations: Intelligent alerts and automated fault diagnosis.
Active‑Active Multi‑Region: Cross‑region deployment for higher availability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
