Why Spark Jobs Keep Running After You Kill Them: Daemon Threads and Driver Behavior
This article investigates why Spark tasks that appear killed in the Web UI continue running on the driver, analyzes the role of daemon versus non‑daemon threads and SparkContext shutdown mechanisms, reproduces the issue with sample code, and provides practical solutions such as using daemon threads or checking SparkContext status.
Problem Phenomenon
Two tasks killed via the Spark Web UI still have backend processes running, causing the driver machine's CPU usage to remain high.
The backend process looks like this:
Problem Reproduction
Test Program
The demo program consists of three parts:
(1) rdd.map operation – simulates a Spark distributed computation task.
<code>//section 1:
//do something for a long time
val rdd = sparkSession.sparkContext.parallelize(List(1,2,3,4,5)).repartition(4)
val rdd2 = rdd.map(x => {
for (i <- 1 to 100) {
for (j <- 1 to 999999999) {
}
if (i % 10 == 0) {
println(i + " rdd map process running!")
}
}
x * 2
})
rdd2.take(10).foreach(println)
</code>(2) Nested loop on the driver – similar to
rdd.collectand runs on the driver.
<code>//section 2:
//do something for a long time in driver
for (i <- 1 to 100) {
for (j <- 1 to 999999999) {
}
if (i % 10 == 0) {
println(i + " main process running!")
}
}
</code>(3) Multi‑threaded operation – tests manually created threads on the driver.
<code>//section 3 multi-thread
def runThread() = {
val t = new Thread(new Runnable {
override def run(): Unit = {
while (true) {
println("Running something!")
TimeUnit.SECONDS.sleep(1)
}
}
})
if (daemonFlag.equals("1")) {
t.setDaemon(true)
}
t.start()
}
</code>The program accepts two arguments:
daemonFlag(0‑false, 1‑true) and
multi(0‑no thread, 1‑start thread).
Test Results
The table below summarizes whether the backend process remains after killing the job under different flag combinations. The key observations are:
When no manual thread is started, killing an
rdd.mapjob stops the backend, but killing a driver‑side computation leaves the backend alive.
When a non‑daemon thread is started, the backend persists regardless of when the job is killed.
When a daemon thread is started, the behavior matches the first case (backend stops).
Problem Cause
Manual Thread Start
A daemon thread provides background services (e.g., garbage collection) and does not prevent the JVM from exiting. The program will terminate only when all non‑daemon threads finish, which explains why a non‑daemon thread keeps the driver process alive after a kill.
In Spark's source,
stopInNewThreadalso creates a daemon thread:
<code>private[spark] def stopInNewThread(): Unit = {
new Thread("stop-spark-context") {
setDaemon(true)
override def run(): Unit = {
try {
SparkContext.this.stop()
} catch {
case e: Throwable =>
logError(e.getMessage, e)
throw e
}
}
}.start()
}
</code>Solution: When the driver program manually starts threads, set them as daemon threads.
Driver‑Side Program
The Web UI kill action eventually calls
StandaloneSchedulerBackend.dead, which invokes
sc.stopInNewThread()inside the
elsebranch of the method:
<code>override def dead(reason: String) {
notifyContext()
if (!stopping.get) {
launcherBackend.setState(SparkAppHandle.State.KILLED)
logError("Application has been killed. Reason: " + reason)
try {
scheduler.error(reason)
} finally {
// Ensure the application terminates, as we can no longer run jobs.
sc.stopInNewThread()
}
}
}
</code>The
TaskSchedulerImpl.errormethod further propagates the error:
<code>def error(message: String) {
synchronized {
if (taskSetsByStageIdAndAttempt.nonEmpty) {
for {
attempts <- taskSetsByStageIdAndAttempt.values
manager <- attempts.values
} {
try {
manager.abort(message)
} catch {
case e: Exception => logError("Exception in error callback", e)
}
}
} else {
throw new SparkException(s"Exiting due to error from cluster scheduler: $message")
}
}
}
</code>Because
sc.stopInNewThread()runs in a newly created daemon thread, the driver‑side computation may continue after the UI kill.
Solution: Add explicit checks for SparkContext state in driver code, e.g.:
<code>//section 2 (modified)
for (i <- 1 to 100 if !sparkSession.sparkContext.isStopped) {
for (j <- 1 to 999999999 if !sparkSession.sparkContext.isStopped) {
}
if (i % 10 == 0) {
println(i + " main process running!")
}
}
</code>These checks ensure the driver loop exits promptly when the Spark application is killed.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.