Operations 28 min read

From Legacy to Scalable: How TianpiaoChe Revamped Its Ops Architecture

Li Qiang, Operations Director at TianpiaoChe, shares the step‑by‑step transformation of a legacy e‑commerce infrastructure, covering network latency fixes, hardware re‑allocation, OS tuning, open‑source component upgrades, virtualization changes, and future plans, providing practical insights for large‑scale site operations.

dbaplus Community
dbaplus Community
dbaplus Community
From Legacy to Scalable: How TianpiaoChe Revamped Its Ops Architecture

Motivation for Architecture Refactoring

The legacy infrastructure inherited from a previous system exhibited severe latency, chaotic cabling, undersized uplinks, and mismatched hardware and software configurations. A systematic redesign was undertaken to improve performance, reliability, and operational efficiency.

Network Layer Issues

High latency : End‑to‑end round‑trip times up to 200 ms caused by poor fiber links and a single 1 Gbps uplink per cabinet.

“Pan‑silk‑hole” cabling : Unmanaged, tangled cabling that made troubleshooting difficult.

Switch convergence ratio : 24:1 (24 server ports aggregated onto a single 1 Gbps uplink) limited throughput.

Firewall placement : Hardware firewalls were positioned before load balancers, creating bottlenecks; the solution was to remove most hardware firewalls and rely on cloud‑based firewalls.

Server Hardware Issues

Database servers (Dell R820) had only 32 GB RAM, 4 CPU cores, and three 600 GB SAS disks – insufficient for workload.

Virtualization hosts used older E5‑2603/2609 CPUs, while distribution servers used higher‑end E5‑2650 CPUs, leading to uneven performance.

Hardware was re‑planned: database nodes now use SAS + SSD with Facebook FlashCache; all NICs upgraded to Intel i350; CPU‑memory pairing standardized.

Operating System Issues

Default TCP/IP stack parameters (e.g., TIME_WAIT = 60 s) caused excessive socket accumulation.

System limits (max file handles, processes) left at defaults, limiting concurrency.

IRQ‑balance often disabled for data‑intensive services.

Unnecessary services and open ports increased attack surface.

Bonding mode 0 used on many servers caused packet loss and TCP retransmissions.

Open‑Source Component Issues

Inconsistent component versions (e.g., Nginx 0.8 → 1.26, Tomcat 6).

Memcached over‑used for generic caching; replaced by Redis with a more disciplined data‑model.

Image storage evolved from NFS → FastDFS → TFS to achieve scalable, low‑latency access.

MySQL migrated to Percona for enhanced performance and online DDL.

Read/write splitting introduced via OneProxy.

Asynchronous processing added with a RabbitMQ cluster (long‑lived TCP connections).

Internal DNS provided by PowerDNS; backup storage via LizardFS (MooseFS derivative); deployment automation via Rundeck; API gateway via Kong.

Re‑engineered Architecture

Network Redesign

Removed all hardware firewalls; VPN provides secure remote access.

Consolidated cabinet switches from two H3C units to a single Huawei data‑center switch.

Enabled LACP with dual 10 Gbps uplinks per cabinet and activated jumbo frames.

Improved switch convergence ratio from 24:1 to 6:5, allowing near line‑rate traffic flow.

Server Hardware Upgrade

Re‑planned 28 servers according to workload: database nodes receive SAS + SSD + FlashCache; other nodes receive appropriate CPU/RAM.

All core‑business NICs upgraded to Intel i350.

Operating System Customisation

Kernel trimmed to 2.6.32‑431.29.2 with unused drivers removed; packaged as a custom RPM.

KickStart scripts automate network bonding, SSH, DNS, and NIC parameter configuration.

TCP parameters tuned (e.g., net.ipv4.tcp_fin_timeout = 5, net.ipv4.tcp_tw_reuse = 1) to reduce TIME_WAIT buildup.

Unnecessary services and ports disabled; redundant RPMs removed; directories for binaries, configs, logs, and PIDs standardised.

All open‑source components built as “green” binaries that do not depend on system libraries.

Component Stack Changes

Layer‑4 load balancing switched from LVS to HAProxy (flexible CPU allocation, proxy protocol support).

Varnish upgraded from 3.0 to 4.1 to eliminate CLOSE‑WAIT socket leaks.

Reverse proxy path: Nginx → Tengine → OpenResty (adds Lua scripting and module ecosystem).

Virtualisation migrated from VMware to Proxmox KVM.

Caching layer: Memcached → Redis (object cache with explicit data‑model).

Image storage: NFS → FastDFS → TFS (dynamic scaling, 5‑node DS cluster).

Database: MySQL → Percona (online DDL, improved replication).

Read/write splitting via OneProxy.

Message queue: RabbitMQ cluster with persistent TCP connections.

Internal DNS: PowerDNS.

Backup: LizardFS (distributed file system).

Automation: Rundeck.

API gateway: Kong (built on OpenResty).

Future Outlook

Network : Deploy two Huawei CE switches in a stacked configuration for redundancy; avoid VXLAN due to operational complexity.

Log Collection & Analysis : Replace ELK with a Graylog‑based platform built on V8 templates for flexible ingestion.

Dockerisation : Containerise peripheral services to improve isolation and deployment speed.

Web Front‑End Validation : Verify that ECMP + HAProxy can sustain million‑concurrent requests.

Additional Ops Initiatives :

Asset management and VPN integration.

Performance monitoring with Zabbix + OneProxy sharding.

GlusterFS clustering for shared storage.

OVS‑DPDK to achieve near‑line‑rate NIC performance in containerised environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

architectureOperationsDevOpsnetworkVirtualization
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.