How to Diagnose Disk‑Space Exhaustion During a Traffic Surge and Build a Dynamic Log‑Level Degradation Tool
During a high‑traffic promotion a service ran out of disk space because massive log files weren't cleaned, the investigation revealed a lingering SLS process holding deleted files, and the article walks through the root‑cause analysis, the kill‑process fix, and a Spring‑Boot starter that dynamically degrades log levels to prevent recurrence.
Incident Overview
During a major promotion an online application suddenly generated a large number of alerts indicating that disk usage had spiked to over 80%. The ops team logged into the affected machines and ran df to confirm the high usage.
Initial Diagnosis
The df output showed the root filesystem at 93% usage. Because the promotion caused a surge in request volume, the team first suspected excessive log generation. The machines are configured to automatically compress and clean logs once a file reaches a certain size or the overall usage hits a threshold, but the cleanup did not trigger on the promotion day.
Running du -sm * revealed several service.log files each hundreds of megabytes in size, confirming that logs were the primary consumer of disk space.
Why Deleting Logs Didn't Free Space
Ops manually removed some log files with rm service.log.20201105193331, yet the df output remained unchanged and continued to rise. The team used lsof | grep deleted to list open file descriptors pointing to deleted files and discovered a long‑running SLS (Alibaba Log Service) process that still held the deleted log files open:
lsof | grep deleted
SLS 11526 root 3r REG 253,0 2665433605 104181296 /home/admin/.../service.log.20201205193331 (deleted)Because the file descriptor remained open, the inode's link count was not zero, so the space was not reclaimed.
Background Knowledge
In Linux a file is truly removed only when both its i_count (memory reference count) and i_nlink (hard‑link count) drop to zero. Deleting a file with rm merely removes a directory entry, decreasing i_nlink . If a process still has the file open, i_count stays non‑zero and the blocks stay allocated.
The SLS agent continuously reads logs for collection, which kept the deleted log files alive and prevented disk space from being released.
Immediate Fix
After confirming the culprit, the team killed the offending SLS process: kill -9 11526 Running df again showed the filesystem usage dropping to 80%, confirming that the space was finally reclaimed.
Long‑Term Prevention – Log‑Level Degradation Strategy
To avoid repeating the issue, the team designed a log‑level degradation mechanism that can dynamically lower the verbosity of application logs when disk usage approaches a critical threshold. The solution consists of three parts:
A Spring‑Boot service ( LoggerLevelSettingService) that wraps org.springframework.boot.logging.LoggingSystem to change logger levels at runtime.
A configuration object ( LoggerConfig) that carries loggerName and desired level.
A Spring‑Boot starter that registers the service and a listener ( DegradationSwitchInitializer) which reacts to changes in a central configuration center.
Core Implementation
The service obtains the current LoggerConfiguration, validates the requested level against the list returned by loggingSystem.getSupportedLogLevels(), and calls loggingSystem.setLogLevel() to apply the change. Example code:
public void setRootLoggerLevel(String level) {
LoggerConfiguration loggerConfiguration = loggingSystem.getLoggerConfiguration(ROOT_LOGGER_NAME);
if (loggerConfiguration == null) {
LOGGER.error("no loggerConfiguration with loggerName " + level);
return;
}
if (!supportLevels().contains(level)) {
LOGGER.error("current Level is not support : " + level);
return;
}
if (!loggerConfiguration.getEffectiveLevel().equals(LogLevel.valueOf(level))) {
LOGGER.info("setRootLoggerLevel success, old level '" + loggerConfiguration.getEffectiveLevel() + "', new level '" + level + "'");
loggingSystem.setLogLevel(ROOT_LOGGER_NAME, LogLevel.valueOf(level));
}
}Bulk updates are handled by iterating over a list of LoggerConfig objects and applying the same validation logic.
Configuration Center Integration
The listener receives JSON/YAML payloads from the configuration center, parses them into LoggerConfig instances, and invokes the service to adjust logger levels on‑the‑fly. Sample payload:
[{"loggerName":"com.hollis.degradation.core.logger.LoggerLevelSettingService","level":"WARN"}]When the payload is applied, the specified logger’s output level changes immediately without restarting the application.
Spring‑Boot Starter Packaging
The starter defines two beans: LoggerLevelSettingService and DegradationSwitchInitializer. Conditional annotations ensure the beans are created only when the application enables the degradation feature via properties:
hollis.degradation.enable = true
project.name = testA spring.factories entry registers the auto‑configuration class so that downstream projects can simply add the starter dependency and enable the feature.
Benefits and Considerations
Generality: Works with Log4j, Log4j2, Logback, and JDK logging because it relies on Spring‑Boot’s abstract LoggingSystem.
Configurability: Log levels can be pushed from any external configuration center, enabling rapid response during incidents.
Ease of Use: Packaging as a starter makes adoption trivial for existing Spring‑Boot services.
Non‑intrusiveness: The tool does not modify application code beyond adding the starter dependency.
Conclusion
The article demonstrates a complete workflow: from diagnosing a disk‑space outage caused by lingering log file handles, to killing the offending process, and finally building a reusable, dynamic log‑level degradation framework that can be controlled via configuration. This approach turns a one‑off incident response into a systematic, automated safeguard for future high‑traffic events.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
