Investigation of Midnight Interface Timeout in Vivo E‑commerce Activity System
The article details how a midnight interface timeout in Vivo’s e‑commerce activity system was traced to a logging bottleneck: a synchronous Log4j call blocked all threads while a cron‑driven log‑rotation script copied a 2.6 GB file, and the issue was resolved by switching to asynchronous logging with a non‑blocking appender.
This article documents a detailed investigation of a severe interface timeout that occurred at midnight (00:00) in Vivo's online mall activity system. The author, Jiang Ye from the Vivo Internet Server Team, describes the detection, diagnosis, root‑cause analysis, and resolution of the issue.
Problem Discovery
The activity system, which handles marketing events such as red‑packet rain and coupon grabs that start at 00:00, began generating service‑timeout alerts at midnight. The alerts persisted for several days, prompting a deeper look.
Investigation Overview
The author first examined per‑minute P95 response times using the company’s application monitoring platform and identified the product‑list API as the slowest endpoint (up to 8 s). A simplified request flow was presented, highlighting three stages: user request, application logic, and downstream RPC/database calls.
Downstream Dependency Checks
The team used tracing (call‑chain) tools such as Zipkin and Pinpoint to locate slow downstream services. Because call‑chain data is sampled, some traces were missing, so additional methods were employed:
Inspecting Dubbo timeout logs.
Checking for slow SQL via MySQL slow‑query logs and Druid connection‑pool logs.
Considering other storage services (Redis, ES, Mongo) similarly.
Application‑Side Checks
Potential internal causes were examined:
Heavy application logic (e.g., excessive serialization, reflection).
Garbage‑collection pauses (Full GC/Old GC).
Thread synchronization blocking (threads stuck in BLOCKED state).
GC logs showed no Full or Old GC at the problematic time.
Thread‑Blocking Analysis
Monitoring revealed many threads in BLOCKED state between 00:00:00 and 00:02:00. Stack traces from a representative machine were collected (timestamp 2022‑08‑02 00:00:20):
// 日志打印操作,被线程catalina‑exec‑408阻塞
"catalina‑exec‑99" Id=506 BLOCKED on org.apache.log4j.spi.RootLogger@15f368fa owned by "catalina‑exec‑408" Id=55204
at org.apache.log4j.Category.callAppenders(Category.java:204)
- blocked on org.apache.log4j.spi.RootLogger@15f368fa
at org.apache.log4j.Category.forcedLog$original$mp4HwCYF(Category.java:391)
...
// (similar entries for other threads)The analysis showed that all blocked threads were waiting for the logging thread catalina‑exec‑408 .
Root‑Cause Identification
Further inspection of the logging framework (SLF4J + Log4j) pinpointed the synchronized block in Category.callAppenders as the bottleneck:
public void callAppenders(LoggingEvent event) {
int writes = 0;
for (Category c = this; c != null; c = c.parent) {
// Protected against simultaneous call to addAppender, removeAppender,...
// 这是204行,加了 synchronized
synchronized(c) {
if (c.aai != null) {
writes += c.aai.appendLoopOnAppenders(event);
}
if (!c.additive) {
break;
}
}
}
if (writes == 0) {
repository.emitNoAppenderWarning(this);
}
}The synchronized block iterates over all configured appenders (console and a JSON appender) and writes the log event:
public int appendLoopOnAppenders(LoggingEvent event) {
int size = 0;
Appender appender;
if (appenderList != null) {
size = appenderList.size();
for (int i = 0; i < size; i++) {
appender = (Appender) appenderList.elementAt(i);
appender.doAppend(event);
}
}
return size;
}High I/O usage at midnight was observed, with disk I/O reaching ~99.6 % around 00:00:20. The team discovered a cron job that runs a log‑rotation script at 00:00:
# crontab entry
00 00 * * * /bin/sh /scripts/cutlog.shThe script copies catalina.out to a timestamped archive and truncates the original file:
#!/bin/bash
files=(
xxx
)
time=$(date +%F-%H)
for file in ${files[@]}
do
dir=$(dirname ${file})
filename=$(echo "xxx"|awk -F'/' '{print $NF}')
# Archive catalina.out and clear it
cd ${dir} && cp ${filename} ${filename}-${time} && > ${filename}
doneThe copy operation on a 2.6 GB catalina.out file consumed roughly 104 MB/s, taking about 25 seconds. During this period, all threads attempting to log were blocked, causing the observed interface timeouts.
Resolution
Several mitigation options were considered:
Disable console logging in production (rejected due to debugging needs).
Configure asynchronous logging using Log4j’s AsyncAppender (chosen).
The final solution set the async appender buffer size to 2048 and disabled blocking ( blocking=false ), accepting occasional log loss for higher throughput.
Takeaways
The case emphasizes the importance of treating alerts seriously, performing thorough root‑cause analysis, and ensuring observability and proper logging configuration in high‑traffic systems.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.