Operations 23 min read

How We Built Same‑City Active‑Active Architecture for a High‑Volume Transaction Platform

This article details the background, design principles, overall architecture, concrete refactoring steps, launch process, results, and emerging challenges of implementing a same‑city active‑active solution to improve reliability, load balancing, disaster recovery, and cost efficiency for a large‑scale transaction system.

dbaplus Community

Apr 25, 2024

How We Built Same‑City Active‑Active Architecture for a High‑Volume Transaction Platform

Background

In 2022 the transaction platform team began exploring an active‑active deployment across data‑center zones to improve system stability. Early attempts lacked large‑scale traffic validation and were later eroded by rapid business iterations. Recent outages at peer companies highlighted the need for rapid recovery and limited impact scope.

The team launched a same‑city active‑active project that keeps the transaction mainline available under extreme conditions by dynamically switching traffic between two zones with minimal added cost and complexity.

Design Idea

Two logical clusters (blue and green) are deployed at the application layer across multiple availability zones (AZs) within a single cloud region. Existing blue‑green deployment mechanisms are reused for HTTP, RPC and DMQ (RocketMQ/Kafka) traffic switching. Storage components (Redis, DB, HBase) remain in a single zone to avoid data‑sync complexity.

Overall Architecture

Access Layer : DNS, primary‑backup SLB, DLB and a DAG router that routes requests based on user‑ID and traffic‑ratio, supporting multi‑zone deployment.

Application Layer : Services are split into logical blue and green clusters; a blue‑green coordination layer masks cross‑zone calls.

Middleware Layer : Each middleware component (DLB, Rainbow Bridge, DMQ, Kafka/Zookeeper, Elasticsearch, custom service registry) has its own cross‑AZ deployment, data‑sync and failover strategy.

Data Layer : A single copy of DB/Redis/HBase with automatic/manual master‑slave switching across zones.

Refactoring Plan

1. Transaction Application Refactoring

Scope : All transaction services participate because of complex inter‑service dependencies and frequent business changes.

Approach : Decompose complex topologies into atomic A‑B‑C chains where A and C are active‑active across zones and B stays single‑zone. All services must recognize a blue‑green tag in the request context and obey traffic scheduling.

Key actions:

Upgrade service JARs to include blue‑green flow‑control APIs.

Integrate a zero‑configuration blue‑green component that injects zone metadata into pods.

Deploy pods in both zones with an environment variable (e.g., ZONE=A or ZONE=B) for zone‑aware monitoring and logging.

2. Dependent Application Refactoring

External services tightly coupled with the transaction flow (e.g., supply‑chain services) are also migrated to the same‑city active‑active model using the same A‑B‑C abstraction.

3. Middleware & Core Components

Common actions:

Mark all compute resources with a ZONE environment variable to enable zone‑aware observability.

Define Recovery Time Objective (RTO) targets for each middleware to guarantee availability when a zone fails.

Component‑specific changes:

DLB (custom traffic gateway) : Stateless, deployed in both zones. Failed endpoints are removed from SLB, achieving sub‑second failover.

Rainbow Bridge (distributed DB proxy) : Operates in manual failover mode; traffic can be switched within minutes after a master‑slave DB switchover.

DMQ (RocketMQ) : Broker shards are spread across zones; a zone failure reduces available shards but the cluster remains operational.

Kafka / Zookeeper : Zookeeper nodes are deployed with a 2N:2N:1 ratio across three zones to keep a majority quorum; Kafka partitions have cross‑zone leaders.

Elasticsearch : Data nodes are balanced across zones; master nodes span at least three zones to ensure resilient election.

Service Registry : A custom Raft‑based registry is deployed multi‑zone to keep RPC service discovery alive during zone loss.

4. Traffic Allocation Strategy

RPC traffic : The DAG router attaches a blue‑green identifier to each request based on user‑ID. Services that lack a tag are either assigned randomly or recomputed from the user‑ID.

MQ traffic ratio : Blue‑green producers publish to the corresponding half of the broker queues (blue → first half, green → second half). Consumption follows the same partition, so adjusting the producer’s blue‑green ratio automatically changes overall MQ traffic proportion with a 5‑10 s lag.

Launch Steps

1. Preparation Phase

Define the approach: reuse existing blue‑green deployment as the active‑active switch, keep the data layer unchanged.

Catalogue business scenarios, MQ usage, container deployment status, and DB/Redis master‑slave zone distribution.

Confirm participation of upstream/downstream services and required JAR upgrades.

Assess cross‑zone call latency impact and plan optimisations.

2. Development & Verification Phase

Upgrade service JARs to support blue‑green flow control and MQ blue‑green publishing/consumption.

Build a blue‑green testing environment with zone‑aware pod ratios, version checks and automated branch merging.

Run regression tests for normal business flow, blue‑green switching, MQ publishing/consumption and record latency metrics.

Validate channel priority (release channel > global channel) and perform staged roll‑outs in pre‑release environments.

3. Production Preparation & Rollout Phase

Integrate logging, monitoring, tracing and container upgrades to propagate blue‑green tags.

Switch production DMQ to the blue‑green 2.0 version for zone‑aware consumption.

Ensure DB and Redis master nodes reside exclusively in either zone A or zone B.

Manually split services into blue‑green clusters; gradually increase green (zone A) capacity to 100 % while keeping blue (zone B) at 50 % and observe for five days.

Address latency spikes with targeted optimisations and enhance the release platform for automated active‑active support.

Enable the container platform to orchestrate multi‑zone deployments and rapid scaling.

Project Outcomes

The same‑city active‑active solution went live on 2023‑12‑14 after ~100 days of preparation. During a five‑day observation period (traffic peaked at 77.8 % of Double‑11 levels) no major anomalies were observed. Key results:

Traffic split achieved roughly 50:50 between zones (minor deviation for RocketMQ).

Core metrics (QPS, latency, error rate) remained stable; latency increased by ~7‑8 ms for calls originating from zone B due to cross‑zone data access.

Cost impact was minimal: existing zone A resources were on a subscription model; the temporary parallel deployment (A 100 % + B 50 %) incurred only a small incremental cost before scaling down zone A.

New Issues and Future Work

Blue‑green releases can cause traffic skew when downstream services are active‑active but not part of the release channel; capacity planning is required for full‑zone traffic.

Latency increase for services that do not use near‑read (DB/Redis/HBase) when called from the non‑primary zone; ongoing optimisation.

Container orchestration must reliably handle zone‑level failures and support rapid scaling during incidents.

Ensuring sufficient resources for rapid zone‑wide scaling, especially during peak events.

Coordinating active‑active traffic switching across multiple large domains (e.g., transaction and search recommendation) and aligning blue‑green identifiers.

Developing loss‑less, production‑grade rehearsal procedures for active‑active failover.

These challenges drive continued engineering research and incremental improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Availability Blue-Green Deployment Multi‑AZ Active-Active cloud architecture

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.