Why Is Airflow Draining CPU? A Step‑by‑Step Diagnosis and Fix
A high‑CPU anomaly on a Spark‑enabled machine was traced through application checks, network TIME_WAIT analysis, and Airflow inspection, leading to kernel tweaks and an Airflow configuration change that finally restored normal CPU usage.
1. Problem Phenomenon
Machine A runs Spark Master, Airflow, Hive, Sqoop and other heavy workloads, resulting in high memory and CPU usage. Over the past three days the CPU stayed above 95% for most of the day, especially after 18:00 when Spark tasks are few.
2. Investigation Process
2.1 Check Applications
At around 09:30 the CPU was high while five SparkSubmit tasks were running; no abnormal applications were found and no single app showed excessive CPU or memory consumption.
2.2 Check Network Connections
netstat revealed many TIME_WAIT connections, mainly to MySQL on hadoop11, exceeding 3,700 connections. The kernel parameters were adjusted in /etc/sysctl.conf:
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1After applying with /sbin/sysctl -p, TCP connections normalized but CPU remained high, indicating the issue was not caused by network sockets.
2.3 Check Airflow
Machine A (hadoop16) connects to MySQL on hadoop11 only via Airflow. Airflow runs webserver, scheduler, master, and worker processes, using CeleryExecutor with a parallelism of 16.
(1) Confirm Airflow as the cause
Restarting Airflow temporarily drops CPU usage, which spikes again once Airflow starts, confirming a correlation but not solving the root problem.
(2) Research similar issues
References include a StackOverflow discussion and the Airflow documentation on min_file_process_interval.
(3) Apply fix
The Airflow configuration airflow.cfg was updated: min_file_process_interval = 10 After restarting Airflow, CPU usage returned to normal and matched the new file‑scan interval setting.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
