How We Cut Publishing Latency by 600ms: A Real‑World Backend Optimization Case Study
Through profiling with flame graphs, log analysis, and targeted refactoring—including async task handling, rule‑engine tuning, data‑load reduction, and cache redesign—we reduced the 95th‑percentile publishing latency on Baixing.com from around 3 seconds to under 1 second, achieving near‑instant “second‑post” performance.
Background
When users click the publish button on Baixing.com, the request passes through a risk‑control system that performs extensive risk and quality analysis before returning a publishing status. This heavy analysis lengthened the response time, prompting the technical team to launch a "Second‑Kill" optimization project at the end of July.
Current State and Goal
Current State
The 95th‑percentile latency for publish & update operations was about 3 seconds.
Goal
Reduce the 95th‑percentile latency to under 1 second.
Problem Identification
Using flame‑graph profiling and historical slow‑query logs, the team pinpointed two major hot‑spots: the cloud‑association analysis module (data loading) and the keyword‑matching module (matching algorithm).
Flame‑graph Y‑axis shows call order; X‑axis shows time percentage per sample.
Flame‑Graph Tool
The flame graph revealed the call stacks that consumed the most time, highlighting the cloud‑association analysis and keyword modules as primary targets.
Log Data
Analysis of slow‑query logs exposed insufficient data structures and inadequate caching, while timing logs of key algorithms quantified the performance gap.
Optimization Plan
Asynchronously process risk‑control sub‑services that can be queued.
Optimize usage of the rule‑engine module on the business side.
Improve the cloud‑association analysis module, which has the highest potential but also the highest difficulty.
Optimize the keyword‑matching module.
Acceptance Environment
Before development, real‑time timing points were added to all publishing modules and key risk‑control components to enable precise measurement of improvements.
Development and Incremental Improvements
6.1 Extracting Asynchronous Tasks
Identified and async‑ified two risk‑control sub‑services. The async services reduced their 95th‑percentile latency to under 20 ms, but the overall publishing curve showed little change due to noise.
6.2 Optimizing Rule‑Engine Usage
Removed obsolete rules and moved some checks to post‑publish, gaining roughly 300 ms of latency reduction.
6.3 Cloud‑Association Analysis
6.3.1 Reducing Redundant Data Loads
The module loaded excessive data via
Data::load. Refactoring the data‑loading logic cut about 600 ms from the overall latency.
6.3.2 Parallelizing Search Requests
Attempted to parallelize HTTP calls to the search service using
curl_multi. Although parallelism sped up the search phase in tests, the single‑threaded multiplexing model and increased CPU usage limited real‑world gains.
6.3.3 Additional Attempts
Implemented cloud pre‑warming and cloud weighting, yielding 50‑100 ms improvements for pre‑warming and negligible speed impact for weighting.
6.4 Keyword Matching Optimization
6.4.1 Matching Algorithm
Replaced the original
mb_strpos()approach with a Trie‑tree algorithm, achieving O(m) matching time where m is the text length.
6.4.2 Cache Structure
Serialized the Trie‑tree into Redis, but deserialization via
json_decode()became a bottleneck (≈52 ms). Switching to
serialize()/unserialize()was slower; using Memcache for direct object storage reduced latency, with Redis as a fallback.
After these changes, the keyword module contributed an additional 300‑400 ms improvement.
Project Results
Timeline
July 31 – August 16, 2017 (13 workdays). One engineer full‑time, plus business support.
Performance Gains
Across all platforms, the 95th‑percentile latency dropped from ~3 seconds to under 1 second, with individual gains of 600 ms (cloud‑association), 300 ms (rule‑engine), and 300‑400 ms (keyword module).
Further Improvement Thoughts
Turn the keyword module into a long‑running service to keep the Trie‑tree resident in memory.
Explore more efficient Trie implementations to reduce size and boost speed.
Modularize the risk‑control system by media type (text, image, audio, video) to enable finer‑grained concurrency optimizations.
Conclusion
Thorough analysis with flame graphs and real‑time metrics is essential before tackling performance problems.
Establish a solid acceptance environment to attribute latency changes to specific code changes.
For deeper dives into search performance, PHP concurrency, Trie algorithms, and serialization overhead, refer to the sections on cloud‑association and keyword optimizations.
Baixing.com Technical Team
A collection of the Baixing.com tech team's insights and learnings, featuring one weekly technical article worth following.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.