Backend Development 12 min read

Improving Stability and High Availability of an Advertising Billing System: Architecture Upgrade and Optimizations

This article describes the background, problems, and a series of architectural upgrades—including MQ replacement, thread‑pool isolation, Redis/TiKV redundancy, and Spark‑based compensation—to enhance the stability, scalability, and high‑availability of an advertising billing system.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Improving Stability and High Availability of an Advertising Billing System: Architecture Upgrade and Optimizations

1 Background Introduction

Service stability and high availability are critical for modern businesses, directly affecting user experience, business continuity, and company reputation. In advertising billing, any interruption can cause huge financial loss, making stability and availability essential.

This article introduces the optimizations and upgrades made to improve the stability and availability of the advertising billing system.

1.1 Advertising Billing Models

Common billing models include CPT, CPM, CPC, CPA, and CPS, each affecting revenue differently for advertisers and platforms.

1.2 Advertising Billing System Functions

The system consists of two main phases:

Ad Retrieval Phase: Generates a unique billing credential; its reliability directly impacts revenue.

Ad Billing Phase: Handles anti‑fraud, billing operations, and post‑processing, ensuring no missed charges and real‑time billing for downstream services.

2 Upgrade Background

2.1 Initial Process

The original billing flow used asynchronous threads and stored failed requests in Redis for later retry via a scheduled task.

Advantages: Improved throughput and response time.

Disadvantages: High‑concurrency loss of billing data with delayed compensation. Scheduled tasks run on a single server, causing uneven load. Complex thread‑pool usage leading to potential deadlocks.

2.1 Upgrade Trigger

2.1.1 Problem Discovery

Alarms indicated that the thread‑pool task queue size far exceeded thresholds, and related business metrics (clicks, revenue) dropped sharply.

2.1.2 Root Cause Analysis

Thread dumps showed all billing threads blocked on countDownLatch.await() . The causes were:

1. Slow database queries for data extraction, increasing order‑creation latency.

2. Parent and child tasks sharing the same thread pool, leading to deadlock when parent tasks wait for unfinished child tasks.

The deadlock scenario is illustrated by a diagram where two parent billing tasks and their pending anti‑fraud child tasks exhaust the core threads, preventing progress and eventually causing OOM.

2.1.3 Solution

1. Restarting the service to quickly restore normal operation.

2. Isolating parent and child tasks into separate thread pools to avoid deadlocks.

2.1.4 Reflection

The existing compensation mechanism was insufficient: scheduled retries could not handle sudden failures promptly, and a restart before failed data was persisted to Redis caused revenue loss.

3 Upgraded Version

3.1 Revised Process

The billing flow remains asynchronous, but after logging the request, a MQ message is sent; the billing action is considered complete once the message is queued.

3.2 Improvement 1: MQ Replacement

Replacing the async thread pool with a message queue improves decoupling and reliability. MQ ensures durable delivery and retry, balancing load across consumers and guaranteeing eventual consistency.

3.3 Improvement 2: Degradation Strategy

If MQ becomes unavailable, the system falls back to asynchronous thread processing, preserving availability while accepting possible latency.

3.4 Improvement 3: Redis/TiKV Redundancy

Billing credentials are stored in Redis; when Redis is down, TiKV acts as a backup. Writes are synchronously sent to Redis and asynchronously replicated to TiKV, ensuring continuity without performance loss.

3.5 Improvement 4: Spark Compensation

Beyond MQ retries, a Spark job processes large batches of failed billing records by extracting key information from logs and re‑sending MQ messages, automating recovery and reducing manual intervention.

4 Conclusion

Ensuring the stability of the advertising billing system is vital for trustworthy settlements and business continuity. Continuous optimization—through MQ, thread‑pool isolation, Redis/TiKV redundancy, and Spark‑based compensation—maximizes efficiency, resource utilization, and competitive advantage in the ad industry.

About the Author

Rong Zhang, Senior Development Engineer at ZhiZhuan Commercial, responsible for ad retrieval, billing, and feature engineering systems.

backendadvertisinghigh availabilityMessage QueueSparkbilling system
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.