Operations 27 min read

How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center

This article details Ctrip's evolution from a single‑site call‑center to a fully dual‑active, multi‑region architecture, covering the overall system design, public network, application, and client layers, unified login mechanisms, heartbeat monitoring, and future software‑only and mobile‑first directions.

Efficient Ops
Efficient Ops
Efficient Ops
How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center

Preface

After reading several chapters of "Google SRE", the author strongly agrees with the automation‑driven operations and rapid fault‑response concepts, and will illustrate them with a practical Ctrip case study.

1. Call Center in Ctrip

The call center is the core of Ctrip's business, handling over 70% of order volume. Its basic architecture includes a PBX system for voice media and queuing, a CTI system for phone‑computer integration (including IVR and recording), and a CRM system for order management.

The architecture also adds remote and home‑office agents to reduce commuting costs.

Ctrip has undergone three major upgrades:

2007: Relocation caused a two‑hour outage due to hardware limits.

2010: New building in Nantong enabled a dual‑active design with SIP‑based modular routing.

2022: Client‑side redesign achieved dual‑active agent access despite legacy analog lines.

2. Dual‑Active Architecture Overview

The dual‑active design is measured by two standards: distinguishing disaster‑recovery from true dual‑active, and minimizing fault‑recovery time to be transparent to users.

The system is divided into three layers:

Public Network Access Layer

Application Layer

Agent Access Layer

Public Network Access Layer

Implemented with carrier‑level configurations, including dual‑site voice trunks, intelligent routing (percentage‑based or caller‑area based), and SIP voice trunks. Ctrip worked with four carriers (China Telecom and China Unicom in Shanghai and Nantong) to obtain dual‑site SIP trunk groups, enabling rapid capacity expansion and automatic failover without extra hardware.

Application Layer

Mirrors typical web‑application design with static routing: local traffic prefers the local cluster, while failures trigger routing to the remote cluster. Four core clusters are fully mapped and deployed in both sites, allowing any site failure to be handled without service impact. Regular disaster‑recovery drills validate this approach.

Agent Access Layer

Three techniques ensure dual‑active client access:

Dual‑center connection – agents' phones register to both centers simultaneously.

Polling – automatic failover to the secondary application server.

Load balancing – standard web‑style distribution.

Why Dual‑Active Agent Access Is Critical

With over 10,000 agents, a single‑site outage can halt business for hours. Past incidents (power‑line short‑circuit, typhoon‑induced water leakage) demonstrated the need for agents to log into the remote site instantly, avoiding massive downtime.

3. Dual‑Active Agent Access Implementation

Prerequisites:

Multi‑site voice routing with global distribution.

Agents sign in at a single location but receive global calls.

IP‑based phones.

Challenges:

Phone registration across two sites.

Client login continuity.

Resource configuration synchronization.

Unified Login

Agents use a single account regardless of location. The solution integrates:

ITDB – a resource database linking MAC addresses of phones and PCs to extension numbers.

IP‑phone MAC ↔ extension mapping.

Virtual agent IDs decoupled from CTI/PBX.

Domain‑account ↔ employee skill‑group mapping.

Dynamic ID pool similar to DHCP.

When an agent logs in, the client obtains the PC’s MAC, retrieves the associated extension from ITDB, and uses the virtual ID to log into CTI, achieving automatic, zero‑touch login.

Resource Configuration

Configure virtual IDs for all agents in the unified login platform.

Share IP‑phone MAC information between the two centers.

Maintain independent extension numbers.

Heartbeat‑Based Failover Strategy

Client‑CTI‑PBX‑IP‑phone linkage.

Two‑step confirmation to avoid false alarms.

Automatic remote login upon confirmed fault.

Fully transparent to agents.

The diagram below shows three login states: normal, fault‑triggered remote login, and complete failure of both sites.

Technical Highlights

Automatic dual‑active switch for online agents during faults.

Planned manual switch by system, region, or skill‑group.

Supports 1000+ concurrent agents with sub‑2‑minute switchover.

Future Directions

Fully software‑based client, eliminating hardware phones.

Mobile client enabling agents to work from any location.

Software‑only clients run on virtual desktops; however, voice processing on PCs raises performance concerns, which are being evaluated. A mobile app prototype currently supports outbound calls, with inbound call handling under development.

Q&A

The author answered audience questions about full‑software solutions, SIP trunk usage, security, outbound‑call blocking, multi‑carrier strategies, dual‑connection mechanisms for phones, network topology, VLAN separation, and automatic extension mapping.

operationsHigh Availabilitydual activeSREcall center
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.