Operations 14 min read

How a Small E‑commerce Site Scaled to 10 Million Daily Visits: Real‑World Architecture Lessons

This article details a small‑to‑mid‑size e‑commerce platform’s journey from a few thousand daily page views to ten million, covering business challenges, three architecture evolution stages, key technical solutions, performance optimizations, cost‑control strategies, and practical automation tips.

Ops Community

Jul 24, 2025

How a Small E‑commerce Site Scaled to 10 Million Daily Visits: Real‑World Architecture Lessons

Preface: As an operations engineer with five years of experience in a small‑to‑mid‑size company, I witnessed the website grow from a few thousand daily PV to tens of millions. This article shares a real‑world architecture evolution case study and practical lessons for fellow ops engineers.

Business Background and Challenges

A certain e‑commerce platform started with ~5,000 daily PV and grew to ten million within 18 months. Core challenges included limited budget, small team, unpredictable traffic growth, and the need for 24/7 stability.

Budget constraints – cannot spend like large enterprises.

Small team – must consider maintenance cost.

Unpredictable growth – architecture must be flexible.

Require 7×24 h stable operation.

Architecture Evolution – Three Stages

Stage 1: Monolithic Application (0‑100k PV/day)

Architecture:

Nginx + PHP‑FPM + MySQL + Redis
Single 4‑core 8 GB server handles everything

Pain Points:

MySQL slow‑query hell: No proper indexes, queries become extremely slow after 1 M rows.

PHP‑FPM process mis‑configuration: Insufficient processes under high concurrency, causing many 502 errors.

Log files fill disk: Missing log rotation, causing midnight alerts.

Solutions:

Enable slow‑query monitoring and regularly optimize SQL and indexes.

Adjust pm.max_children according to server capacity.

Use logrotate for log management and set disk‑usage alerts.

Stage 2: Vertical Scaling (100k‑1M PV/day)

When a single server’s CPU and memory reach 80 % utilization, we performed vertical scaling:

Web server upgraded from 4 core 8 GB to 8 core 16 GB.

Database server moved to a dedicated 16‑core 32 GB machine with SSD.

Architecture Adjustments:

Frontend: Nginx + CDN static assets
Application: PHP‑FPM on dedicated servers
Data: MySQL master‑slave + Redis cluster
Monitoring: Zabbix + custom scripts

Key Optimizations:

Database read/write separation – master for writes, slaves for reads, QPS +60 %.

Redis caching – hot data cached, proper TTL, cache‑penetration protection.

CDN integration – all static resources served via CDN, bandwidth cost reduced by 70 %.

Pain Points:

Master‑slave replication lag caused data inconsistency.

Redis memory shortage led to frequent evictions and cache miss spikes.

Excessive CDN back‑origin traffic saturated the origin bandwidth.

Stage 3: Horizontal Scaling (1M‑10M PV/day)

Business continued to grow, monolithic architecture became a bottleneck, so we moved to service‑oriented design:

Final Architecture Diagram:

[User]
                  |
               [CDN + WAF]
                  |
            [Load Balancer Nginx]
               /   |   \
          [Web1] [Web2] [Web3] (PHP app servers)
               \   |   /
               [API Gateway]
               /   |   \
   [User Service] [Product Service] [Order Service] (micro‑services)
               |      |      |
          [MySQL] [MySQL] [MySQL] (business DBs)
               \      |      /
                [Redis Cluster + MQ Cluster] (cache & messaging)
                                 |
                         [Monitoring + Log System]

Core Technical Solutions

1. Load Balancing Strategy

Using Nginx upstream configuration:

upstream backend {
    server 10.0.1.10:9000 weight=3;
    server 10.0.1.11:9000 weight=2;
    server 10.0.1.12:9000 weight=1 backup;
    keepalive 32;
    keepalive_requests 1000;
}

Why:

Weight distribution based on actual server performance tests.

Backup server for failover.

Keepalive reduces TCP connection overhead.

2. MySQL Optimization

Sharding & Partitioning:

User table sharded into 4 databases by user‑id modulo.

Order table partitioned monthly.

Product table read/write separated, master‑slave delay kept under 1 s.

Key Configuration:

# InnoDB buffer pool set to 70 % of memory
innodb_buffer_pool_size = 22G
# Adjust connections based on load
max_connections = 2000
# Slow query threshold
long_query_time = 0.5

3. Redis Cluster Design

Using Redis Sentinel for high availability:

3 Redis instances for master‑slave.

3 Sentinel nodes for monitoring.

Client‑side automatic failover.

Cache Design Principles:

Different TTLs for hot data to avoid cache avalanche.

Bloom filter to prevent cache penetration.

Distributed lock to solve cache breakdown.

4. Monitoring & Alerting System

Three‑layer monitoring:

Infrastructure monitoring – CPU, memory, disk, network (Zabbix).

Application monitoring – response time, error rate, QPS (custom).

Business monitoring – order volume, payment success rate, user activity.

Alert Levels:

P0 – immediate phone + SMS.

P1 – SMS within 5 minutes.

P2 – email notification.

Cost‑Control Practices

Cloud Server Cost Optimization

Hybrid cloud – core services on cloud VMs, edge services on physical machines.

Use spot instances for dev/test to cut 60 % cost.

Resource Utilization

Containerization – three‑fold increase in deployment density.

Auto‑scaling – expand during peaks, shrink during lows.

Resource pooling – mix CPU‑intensive and I/O‑intensive workloads.

CDN Cost Reduction

Image compression & WebP conversion.

Appropriate cache TTL settings.

Smart DNS for nearest‑node access.

Cold‑hot data separation – cold data stored in object storage.

Performance Optimization Cases

Case 1: API Response Time

Problem: Product detail API latency rose from 200 ms to 2 s.

Investigation:

Monitoring showed MySQL CPU at 99 %.

Slow‑query log revealed an unindexed join.

Execution plan confirmed index loss.

Solution:

Add composite index.

Rewrite SQL to avoid unnecessary joins.

Introduce caching layer.

Result: Latency reduced to 50 ms.

Case 2: Cache Breakdown

Problem: Hot product cache expiration caused massive DB load.

Solution: Distributed lock with fallback and retry.

function getProductInfo($productId) {
    $key = "product:$productId";
    $lockKey = "lock:product:$productId";
    $data = $redis->get($key);
    if ($data !== false) {
        return json_decode($data, true);
    }
    // Distributed lock to prevent cache breakdown
    if ($redis->set($lockKey, 1, ['nx', 'ex' => 10])) {
        $data = $db->getProduct($productId);
        $redis->setex($key, 3600, json_encode($data));
        $redis->del($lockKey);
        return $data;
    } else {
        // Other requests wait and retry
        usleep(50000);
        return $this->getProductInfo($productId);
    }
}

Automation Toolchain

Deployment Pipeline

GitLab CI/CD + Ansible for automated deployment:

stages:
- test
- build
- deploy

deploy_production:
  stage: deploy
  script:
  - ansible-playbook -i inventory/prod deploy.yml
  only:
  - tags
  when: manual

Deployment Strategies:

Blue‑green deployment for zero downtime.

Canary releases to reduce risk.

One‑click rollback.

Log Collection & Analysis (ELK Stack)

Filebeat collects application logs.

Logstash processes and transforms logs.

Elasticsearch stores and indexes logs.

Kibana visualizes log data.

Standardize logs in JSON, normalize key fields, and mask sensitive information.

Key Takeaways & Best Practices

Progressive Evolution: Start simple and evolve architecture as business grows.

Cost‑Benefit Balance: Choose technologies that match team capability and maintenance cost.

Monitoring First: A system without monitoring is a naked runner; monitoring is more critical than features.

Automation Priority: Automate everything possible to reduce human error.

Future Planning

Cloud‑Native Transformation: Containerization + Kubernetes for better resource utilization.

Mid‑Platform Architecture: Business and data middle‑platform to support multiple lines.

AI‑Driven Operations: Intelligent alerts and automated fault diagnosis.

Active‑Active Multi‑Region: Cross‑region deployment for higher availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Performance Optimization Operations scaling website architecture

Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Business Background and Challenges

Architecture Evolution – Three Stages

Stage 1: Monolithic Application (0‑100k PV/day)

Stage 2: Vertical Scaling (100k‑1M PV/day)

Stage 3: Horizontal Scaling (1M‑10M PV/day)

Core Technical Solutions

1. Load Balancing Strategy

2. MySQL Optimization

3. Redis Cluster Design

4. Monitoring & Alerting System

Cost‑Control Practices

Cloud Server Cost Optimization

Resource Utilization

CDN Cost Reduction

Performance Optimization Cases

Case 1: API Response Time

Case 2: Cache Breakdown

Automation Toolchain

Deployment Pipeline

Log Collection & Analysis (ELK Stack)

Key Takeaways & Best Practices

Future Planning

Ops Community

How this landed with the community

Was this worth your time?

0 Comments

Stage 1: Monolithic Application (0‑100k PV/day)

Stage 2: Vertical Scaling (100k‑1M PV/day)

Stage 3: Horizontal Scaling (1M‑10M PV/day)

Case 1: API Response Time

Case 2: Cache Breakdown