Backend Development 17 min read

Investigation of Midnight Interface Timeout in Vivo E‑commerce Activity System

The article details how a midnight interface timeout in Vivo’s e‑commerce activity system was traced to a logging bottleneck: a synchronous Log4j call blocked all threads while a cron‑driven log‑rotation script copied a 2.6 GB file, and the issue was resolved by switching to asynchronous logging with a non‑blocking appender.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Investigation of Midnight Interface Timeout in Vivo E‑commerce Activity System

This article documents a detailed investigation of a severe interface timeout that occurred at midnight (00:00) in Vivo's online mall activity system. The author, Jiang Ye from the Vivo Internet Server Team, describes the detection, diagnosis, root‑cause analysis, and resolution of the issue.

Problem Discovery

The activity system, which handles marketing events such as red‑packet rain and coupon grabs that start at 00:00, began generating service‑timeout alerts at midnight. The alerts persisted for several days, prompting a deeper look.

Investigation Overview

The author first examined per‑minute P95 response times using the company’s application monitoring platform and identified the product‑list API as the slowest endpoint (up to 8 s). A simplified request flow was presented, highlighting three stages: user request, application logic, and downstream RPC/database calls.

Downstream Dependency Checks

The team used tracing (call‑chain) tools such as Zipkin and Pinpoint to locate slow downstream services. Because call‑chain data is sampled, some traces were missing, so additional methods were employed:

Inspecting Dubbo timeout logs.

Checking for slow SQL via MySQL slow‑query logs and Druid connection‑pool logs.

Considering other storage services (Redis, ES, Mongo) similarly.

Application‑Side Checks

Potential internal causes were examined:

Heavy application logic (e.g., excessive serialization, reflection).

Garbage‑collection pauses (Full GC/Old GC).

Thread synchronization blocking (threads stuck in BLOCKED state).

GC logs showed no Full or Old GC at the problematic time.

Thread‑Blocking Analysis

Monitoring revealed many threads in BLOCKED state between 00:00:00 and 00:02:00. Stack traces from a representative machine were collected (timestamp 2022‑08‑02 00:00:20):

// 日志打印操作,被线程catalina‑exec‑408阻塞
"catalina‑exec‑99" Id=506 BLOCKED on org.apache.log4j.spi.RootLogger@15f368fa owned by "catalina‑exec‑408" Id=55204
    at org.apache.log4j.Category.callAppenders(Category.java:204)
    -  blocked on org.apache.log4j.spi.RootLogger@15f368fa
    at org.apache.log4j.Category.forcedLog$original$mp4HwCYF(Category.java:391)
    ...
// (similar entries for other threads)

The analysis showed that all blocked threads were waiting for the logging thread catalina‑exec‑408 .

Root‑Cause Identification

Further inspection of the logging framework (SLF4J + Log4j) pinpointed the synchronized block in Category.callAppenders as the bottleneck:

public void callAppenders(LoggingEvent event) {
  int writes = 0;
  for (Category c = this; c != null; c = c.parent) {
    // Protected against simultaneous call to addAppender, removeAppender,...
    // 这是204行,加了 synchronized
    synchronized(c) {
      if (c.aai != null) {
        writes += c.aai.appendLoopOnAppenders(event);
      }
      if (!c.additive) {
        break;
      }
    }
  }
  if (writes == 0) {
    repository.emitNoAppenderWarning(this);
  }
}

The synchronized block iterates over all configured appenders (console and a JSON appender) and writes the log event:

public int appendLoopOnAppenders(LoggingEvent event) {
  int size = 0;
  Appender appender;
  if (appenderList != null) {
    size = appenderList.size();
    for (int i = 0; i < size; i++) {
      appender = (Appender) appenderList.elementAt(i);
      appender.doAppend(event);
    }
  }
  return size;
}

High I/O usage at midnight was observed, with disk I/O reaching ~99.6 % around 00:00:20. The team discovered a cron job that runs a log‑rotation script at 00:00:

# crontab entry
00 00 * * * /bin/sh /scripts/cutlog.sh

The script copies catalina.out to a timestamped archive and truncates the original file:

#!/bin/bash
files=(
  xxx
)

time=$(date +%F-%H)

for file in ${files[@]}
do
  dir=$(dirname ${file})
  filename=$(echo "xxx"|awk -F'/' '{print $NF}')
  # Archive catalina.out and clear it
  cd ${dir} && cp ${filename} ${filename}-${time} && > ${filename}

done

The copy operation on a 2.6 GB catalina.out file consumed roughly 104 MB/s, taking about 25 seconds. During this period, all threads attempting to log were blocked, causing the observed interface timeouts.

Resolution

Several mitigation options were considered:

Disable console logging in production (rejected due to debugging needs).

Configure asynchronous logging using Log4j’s AsyncAppender (chosen).

The final solution set the async appender buffer size to 2048 and disabled blocking ( blocking=false ), accepting occasional log loss for higher throughput.

Takeaways

The case emphasizes the importance of treating alerts seriously, performing thorough root‑cause analysis, and ensuring observability and proper logging configuration in high‑traffic systems.

BackendmonitoringjavaloggingTomcatperformance troubleshootingshell script
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.