Operations 28 min read

How E‑Commerce Platforms Achieve High Availability and Scalability: Architecture Practices

This article outlines comprehensive e‑commerce platform architecture practices—including caching strategies, indexing, parallel and distributed computing, load balancing, sharding, high availability, monitoring, resource optimization, and messaging—to improve system performance, scalability, and reliability under high concurrency.

ITFLY8 Architecture Home

Oct 15, 2016

How E‑Commerce Platforms Achieve High Availability and Scalability: Architecture Practices

Design Principles

1. Space for Time

1) Multi‑level cache, staticization

Client‑side page cache (HTTP headers contain Expires/Cache‑Control, Last‑Modified (304), server returns no body, allowing the client to reuse the cache and reduce traffic, ETag, etc.).

Reverse‑proxy cache.

Application‑side cache (memcache).

In‑memory database.

Buffer, cache mechanisms (database, middleware, etc.).

2) Index

Hash, B‑tree, inverted index, bitmap.

Hash index combines array addressing and linked‑list insertion for fast data access.

B‑tree index suits query‑driven scenarios, reducing I/O.

Inverted index maps words to documents, widely used in search.

Bitmap is a concise, fast data structure that optimizes both space and speed for massive data calculations.

2. Parallel and Distributed Computing

1) Task splitting, divide‑and‑conquer (MapReduce)

Large‑scale data exhibits locality; using locality, massive data problems are divided and conquered.

MapReduce is a stateless architecture: data sets are distributed to nodes; each node reads local data (map), merges (combine), sorts (shuffle and sort), then distributes to reduce nodes, reducing data transfer and improving efficiency.

2) Multi‑process, multi‑thread parallel execution (MPP)

Parallel Computing uses multiple CPUs/processes/threads to solve a problem simultaneously, improving speed and capacity. It differs from MapReduce by focusing on problem decomposition rather than data decomposition.

3. Multi‑dimensional Availability

1) Load balancing, disaster recovery, backup

As platform concurrency grows, nodes are added to a cluster and requests are distributed via load‑balancing devices, which also provide health checks. Disaster‑recovery backups (online or offline) are needed to prevent unavailability when nodes fail.

2) Read‑write separation

For databases, separating reads from writes improves data‑access availability under high concurrency; consistency must be considered, with emphasis on availability in the CAP theorem.

3) Dependency management

Modules should be loosely coupled, communicating via messaging components; asynchronous operations increase overall system availability.

Asynchronous processing often requires acknowledgment mechanisms (confirm, ack). If a request is processed but the acknowledgment fails (e.g., network instability), the request must be retried, and idempotency must be considered.

4) Monitoring

Monitoring across multiple dimensions provides runtime transparency and white‑box visibility.

4. Scalability

1) Splitting

Both business logic and databases should be split. Long‑running monolithic tasks block resources under high concurrency; logical segmentation and asynchronous non‑blocking execution improve throughput.

When data volume and concurrency increase, read‑write separation alone is insufficient; sharding (database and table) is required, adding routing logic.

2) Statelessness

Stateless modules allow horizontal scaling by simply adding nodes.

5. Optimizing Resource Utilization

1) System capacity limits

Capacity and concurrency are finite; traffic control (rate limiting, queuing, alerts, or dropping) prevents crashes from spikes or attacks.

2) Atomic operations and concurrency control

Shared resource access requires concurrency control and transactional guarantees. Common high‑performance techniques include optimistic lock, latch, mutex, copy‑on‑write, CAS, and MVCC for databases.

3) Logic‑based strategies

Different business logic types (CPU‑intensive, I/O‑intensive) need tailored strategies: event‑driven asynchronous non‑blocking for I/O, single‑threaded to reduce context switches, or multi‑threaded for compute‑heavy tasks. Prioritize resources based on business priority.

4) Fault isolation

Isolate erroneous requests to avoid affecting normal traffic; temporary disabling of faulty modules may be needed. Transient failures (e.g., network instability) should trigger retries.

5) Resource release

Always release resources after processing, regardless of success or failure, to enable timely reuse. Communication architectures should include timeout controls.

Static Architecture Blueprint

The architecture is layered and distributed: vertically CDN, load balancer/reverse proxy, web application, business layer, infrastructure services, data storage; horizontally configuration management, deployment, and monitoring.

Architecture Analysis

1. CDN

CDN redirects user requests to the nearest node based on traffic, node load, distance, and response time, reducing latency and alleviating Internet congestion.

Large e‑commerce platforms (e.g., Taobao, JD) build their own CDN; smaller companies can use third‑party providers (e.g., Blue Cloud, Wangsu, Kuaiyun).

When selecting a CDN provider, consider operating history, scalable bandwidth, flexible traffic options, stable nodes, and cost‑effectiveness.

2. Load Balancing & Reverse Proxy

Large platforms have many business domains, each with its own cluster. DNS can distribute traffic but lacks flexibility due to caching. Commercial hardware (F5, NetScaler) or open‑source software (LVS) operate at Layer 4; redundancy (e.g., LVS + keepalived) provides active‑passive failover.

After Layer 4 distribution, traffic reaches web servers (nginx, HAProxy) at Layer 7 for load balancing or reverse proxy to application nodes.

Choosing a load balancer depends on high‑concurrency performance, session persistence, algorithms, compression, and memory usage. Common solutions:

LVS: Layer 4, high performance, supports NAT, DR, IP tunneling, with keepalived or Heartbeat for redundancy.

nginx: Layer 7, event‑driven, asynchronous, multi‑process, supports domain, path, regex routing, health checks, and session sticky via IP hash or cookie.

HAProxy: Supports Layers 4 and 7, session persistence, cookie‑based routing, rich algorithms (RR, weighted).

Static assets often use dedicated domains and distributed image servers (e.g., mogileFS) with Varnish caching.

3. Application Access

Applications run in containers such as JBoss or Tomcat, providing front‑end shopping, user services, and back‑end systems. Protocols include HTTP and JSON.

Servlet 3.0 asynchronous servlets improve throughput.

HTTP requests pass through Nginx, which balances them to application nodes, enabling simple horizontal scaling.

Session data is stored centrally (e.g., Redis, Memcached) to keep the app layer stateless, allowing horizontal scaling.

4. Business Services

Domain‑specific services (user, product, order, payment, etc.) are modularized with high cohesion and low coupling, improving availability. Large‑scale deployments isolate services per node.

High concurrency uses NIO‑based RPC frameworks (Netty, Mina).

Availability is achieved via multi‑node redundancy, automatic load transfer, and failover using VIP + heartbeat or Zookeeper for leader election.

5. Infrastructure Middleware

1) Communication Component

Handles internal service calls, requiring high concurrency and throughput. Uses long‑lived connections with connection pools, heartbeat maintenance, and timeout handling. Serialization uses efficient Hessian. Server side employs event‑driven NIO (Mina).

2) Router

Database sharding routes queries based on user ID hash (consistent hash). Routing tables map user IDs to shards (leader for writes, replica for reads). Clients maintain shard connection pools; to reduce connection count, routing can be performed at the business service layer.

Router state is stored in MongoDB with replica sets for high availability; clients watch Zookeeper /routers nodes for live router lists.

3) HA

Traditional HA uses virtual IP failover with Heartbeat, keepalived, or VRRP. LVS pairs well with keepalived; HAProxy and Nginx can use Layer 7 failover. Zookeeper provides distributed coordination, leader election, and distributed locks for master‑master or master‑slave setups.

4) Messaging

Asynchronous inter‑system communication uses MQ components. Two major open‑source solutions: RabbitMQ (AMQP, Erlang‑based) and Kafka (LinkedIn, stream processing). RabbitMQ provides reliable delivery with acknowledgments; Kafka uses log‑based pull with LSN for high‑throughput streaming.

Message persistence can be in‑memory or on disk, with clustering and mirroring for high availability.

5) Cache & Buffer

Cache System

Caches reduce backend load and improve read throughput. Cache hit rate, invalidation, and consistency are challenges. Common eviction algorithms include LRU. Distributed caches use consistent hashing to minimize impact of node failures. High‑availability caches replicate data across nodes.

Memcached offers simple protocol, event‑driven libevent handling. Redis and MongoDB provide richer APIs and persistence; Memcached is suited for caching relational data without persistence.

Buffer System

Buffers batch write operations to reduce database pressure, flushing when thresholds are reached.

6) Search

Search is critical for e‑commerce (category navigation, autocomplete, ranking). Open‑source engines include Lucene, Sphinx, Solr. Important non‑functional requirements: distributed indexing, real‑time indexing, performance.

Solr (based on Lucene) offers HTTP/JSON APIs, SolrCloud for distributed indexing with sharding and leader‑replica models, managed by Zookeeper.

Lucene index readers rely on snapshots; commits are costly, affecting real‑time search. Solr4 introduced NRT soft‑commit for near‑real‑time visibility, with periodic hard commits for durability.

Platform stores data in HBase; secondary indexing is limited, so Solr is used for multi‑dimensional search. Consistency between HBase and Solr is ensured via a confirm mechanism that removes items from a pending‑index queue after successful indexing.

7) Log Collection

Transaction logs are collected into distributed storage for centralized analysis. Core components: agent (collects data), collector (aggregates and forwards), store (e.g., HDFS).

Popular open‑source solutions: Cloudera Flume and Facebook Scribe. FlumeNG adds architecture changes, supporting sink groups, load balancing, failover, and configurable channels (memory or file) for reliability.

Key requirements: decoupling application and analysis systems, horizontal scalability, near‑real‑time collection, fault tolerance, transaction support (Flume uses acknowledgments), recoverability, and timed/size‑based rolling for archival.

Source: http://blog.csdn.net/yangbutao/article/details/12242441

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems e-commerce architecture High Availability Caching

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.