Scaling a Mobile Trading App: Ops, CMDB, and Full‑Chain Stress Testing
In this talk, Guoxin Securities’ operations expert Cai Hua outlines the architecture and massive user base of the Gold Sun mobile trading app, examines the unique operational challenges of securities platforms, and details the company’s comprehensive technical‑operations framework—including CMDB accuracy, automation, standardized release pipelines, multi‑layer monitoring, and full‑chain capacity testing—to ensure stability, scalability, and future innovation.
1. Gold Sun APP Introduction
Gold Sun mobile APP is Guoxin Securities' self‑developed financial investment software integrating market data, trading, and wealth management. It has over 14 million registered users, more than 1.2 million daily active users, and accounts for over 80% of the company's total transaction volume.
The app undergoes more than 3,000 changes per year across 350 system components and is deployed in a two‑site, three‑center architecture using Alibaba Cloud, Microsoft Cloud, Amazon Cloud, and Shanghai Stock Exchange Cloud.
Gold Sun APP covers most of Guoxin's business, including Hong Kong and A‑share trading, market quotes, self‑selected stocks, industry chain maps, account opening, intelligent monitoring, options, wealth management, institutional account opening, and dozens of other services.
2. Operations Challenges
Key operational difficulties for securities businesses include:
Frequent market fluctuations : Unpredictable market spikes can occur at any time, making traffic surges hard to anticipate.
Complex business systems : Numerous subsystems, each with its own monitoring and deployment tools, lead to fragmented tooling and intricate architectures.
Difficulty in fault analysis and location : Distributed systems, scattered data, and long service chains make root‑cause analysis challenging.
Strict regulation : Financial transactions must be error‑free; any failure triggers customer complaints, compensation claims, and stringent accountability.
The industry’s push for digital transformation adds micro‑service migration pressure, coupling of legacy and new systems, rapid iteration, and increasingly frequent releases, all of which amplify operational complexity.
3. Technical Operations System Construction
Facing these challenges, Guoxin has built a comprehensive technical‑operations framework. Development focuses on business logic, architecture optimization, and performance, while operations handle IaaS (network, storage, servers, virtualization), PaaS (foundation components, OS), and ensure business deployment and availability. We do not write code; we are code movers.
Operations prioritize availability through performance, security, efficiency, change management, disaster recovery, continuous delivery, fault handling, and capacity planning.
Four main contradictions were identified:
3.1 Inaccurate Configuration Information
The first contradiction is that CMDB data is often outdated; operational systems are fragmented and not built on a reliable CMDB.
Resource data relies on manually maintained Excel sheets, leading to high maintenance cost and low efficiency. Lack of automated discovery hampers accurate asset management.
Inconsistent online operations cause CMDB inaccuracies, which are critical for automation; any data anomaly can trigger major incidents.
We launched a CMDB accuracy project: standardize manual entry, collect machine data in real time, and cross‑verify with ITIL, basic monitoring, Kunlun big‑data monitoring, event platforms, continuous delivery, and capacity systems.
CMDB architecture diagram:
Resource lifecycle management covers servers, IP addresses, infrastructure, and application resources.
The logical layer includes model management, infrastructure object management, PaaS object management, and application management.
Automatic collection and validation are provided for public cloud, private cloud, servers, and network resources.
Overall DevOps operations architecture:
The cloud‑management platform enables automated provisioning and asset management, while the automation and continuous delivery platforms provide end‑to‑end deployment capabilities. The big‑data platform ingests logs and integrates with CMDB; a centralized event platform and unified alarm channel handle monitoring and alerts. All operational capabilities expose unified APIs to the operations portal.
3.2 Inadequate Standardization
Standardization gaps lead to manual releases, errors, low deployment efficiency, and reliance on specific personnel.
Most incidents stem from release changes. Our continuous release platform aims for controllable, fast, and safe changes, minimizing impact on availability. Control spans process (ITIL), version, configuration, and rollback.
Standardization covers four areas:
Delivery standards: directory standardization, configuration specs, version control, file format, encoding.
Deployment standards: fixed steps, start/stop script norms, verification script norms, deployment checks.
Vendor application integration standards: directory standardization, log separation, version control, start/stop and verification scripts.
Database script standards: version control, SQL syntax, change procedures, backup/check protocols.
Application delivery flow: developers perform CI, artifacts are stored in a unified repository, testers deploy to test environments, security scans are performed, then artifacts flow to production with staged and gray releases.
The artifact repository manages programs, configurations, and deployment scripts across development, test, pre‑release, and production environments, providing unified storage, versioning, and configuration separation.
The delivery pipeline consists of five stages: build, compliance check (version, config, component), test deployment, test result storage, security scan, and production promotion. Artifacts remain immutable throughout.
The platform provides a standardized release process: host selection linked to CMDB, batch strategy definition, version and config selection, pre‑release checks, and automated deployment of applications, including start/stop and health checks.
The continuous delivery platform has integrated end‑to‑end deployment for Gold Sun APP, PC version, options system, centralized trading, and more, dramatically improving release efficiency, safety, and system stability.
3.3 Difficulty in Fault Localization
The third contradiction is the proliferation of monitoring systems and long business chains, making fault localization difficult.
Fault handling aims to reduce impact, prevent known faults, enable rapid switching, and quickly locate unknown faults; the principle is to restore service first, then analyze.
Fault handling stages:
Prevention: capacity planning, monitoring coverage, CI, stress testing.
Detection: alerts, inspections, customer feedback.
Localization: log analysis, monitoring analysis, root‑cause tracing.
Recovery: rollback, throttling, degradation, circuit breaking, disaster‑recovery switch.
Improvement: post‑mortem, verification, continuous operation.
Monitoring challenges include multiple vendor tools, lack of unified view, fragmented configurations, and missing alerts.
We address monitoring through six aspects:
Resource integration: unify vendor monitoring tools into a single visual platform.
Log management: collect system and business logs, link with CMDB, convert logs to metrics.
Metric & alarm management: multi‑dimensional business metrics, dynamic thresholds, intelligent prediction.
Scenario monitoring: customizable monitoring views for specific business scenarios.
Capacity management: combine QPS, resource usage, network bandwidth, and stress‑test data for forecasting.
Call‑chain analysis: leverage micro‑service call chains for rapid fault localization.
Kunlun operations big‑data platform architecture: data sources (network, servers, system logs, business logs, CMDB, cloud‑management), logical layer (data integration, storage, computing, services), delivering business panorama monitoring, log search, call‑chain analysis, alarm management, and capacity monitoring.
The platform consolidates monitoring, logs, and events, providing unified visualization and event management.
Scenario monitoring offers customized curves for risk analysis, fault investigation, performance optimization, and trend analysis, improving efficiency and knowledge accumulation.
From system topology to component trends, service topology, host alerts, log search, and call‑chain based rapid fault location, we move from visualization to resolution.
3.4 Unknown System Capacity
The fourth contradiction is the lack of accurate capacity assessment for sudden market spikes or data‑center failures.
Full‑link stress testing aims to evaluate the maximum capacity of the Gold Sun mobile system and its backend centralized trading system, meeting regulatory requirements, identifying bottlenecks, and exposing high‑concurrency defects.
Traditional testing uses low‑configuration environments that differ from production, leading to inaccurate capacity estimates.
We adopt production‑like environments with real user behavior data, identical packages, and full‑chain testing.
Log‑replay based full‑link capacity testing: collect peak‑period logs, analyze to build data models, feed into a data pool, generate traffic via a test engine, and drive requests through the entire stack.
Testing covers firewall, load balancer, WAF, access service, API gateway, micro‑service clusters, middleware, centralized trading, databases, and Redis cache.
Automation spans model analysis, data extraction, script generation, dynamic traffic control, call‑chain analysis, and result storage.
Performance testing is fully automated; daily pre‑release testing and monthly production testing provide accurate capacity assessments.
4. Future Considerations
Future improvements focus on further standardization, building efficient and secure automated operations platforms, full‑business continuous delivery, fully automated pipelines, console‑less operations, and containerization to accelerate deployment and achieve elastic scaling.
We plan to develop an operations middle‑platform exposing scenario capabilities, and enhance intelligent log and metric analysis with call‑chain and topology for smarter fault localization.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.