Big Data 8 min read

Why Spark Jobs Keep Running After You Kill Them: Daemon Threads and Driver Behavior

This article investigates why Spark tasks that appear killed in the Web UI continue running on the driver, analyzes the role of daemon versus non‑daemon threads and SparkContext shutdown mechanisms, reproduces the issue with sample code, and provides practical solutions such as using daemon threads or checking SparkContext status.

Data Thinking Notes

Dec 14, 2022

Why Spark Jobs Keep Running After You Kill Them: Daemon Threads and Driver Behavior

Problem Phenomenon

Two tasks killed via the Spark Web UI still have backend processes running, causing the driver machine's CPU usage to remain high.

The backend process looks like this:

Problem Reproduction

Test Program

The demo program consists of three parts:

(1) rdd.map operation – simulates a Spark distributed computation task.

//section 1:
//do something for a long time
val rdd = sparkSession.sparkContext.parallelize(List(1,2,3,4,5)).repartition(4)
val rdd2 = rdd.map(x => {
  for (i <- 1 to 100) {
    for (j <- 1 to 999999999) {
    }
    if (i % 10 == 0) {
      println(i + " rdd map process running!")
    }
  }
  x * 2
})
rdd2.take(10).foreach(println)

(2) Nested loop on the driver – similar to rdd.collect and runs on the driver.

//section 2:
//do something for a long time in driver
for (i <- 1 to 100) {
  for (j <- 1 to 999999999) {
  }
  if (i % 10 == 0) {
    println(i + " main process running!")
  }
}

(3) Multi‑threaded operation – tests manually created threads on the driver.

//section 3 multi-thread
def runThread() = {
  val t = new Thread(new Runnable {
    override def run(): Unit = {
      while (true) {
        println("Running something!")
        TimeUnit.SECONDS.sleep(1)
      }
    }
  })
  if (daemonFlag.equals("1")) {
    t.setDaemon(true)
  }
  t.start()
}

The program accepts two arguments: daemonFlag (0‑false, 1‑true) and multi (0‑no thread, 1‑start thread).

Test Results

The table below summarizes whether the backend process remains after killing the job under different flag combinations. The key observations are:

When no manual thread is started, killing an rdd.map job stops the backend, but killing a driver‑side computation leaves the backend alive.

When a non‑daemon thread is started, the backend persists regardless of when the job is killed.

When a daemon thread is started, the behavior matches the first case (backend stops).

Problem Cause

Manual Thread Start

A daemon thread provides background services (e.g., garbage collection) and does not prevent the JVM from exiting. The program will terminate only when all non‑daemon threads finish, which explains why a non‑daemon thread keeps the driver process alive after a kill.

In Spark's source, stopInNewThread also creates a daemon thread:

private[spark] def stopInNewThread(): Unit = {
  new Thread("stop-spark-context") {
    setDaemon(true)
    override def run(): Unit = {
      try {
        SparkContext.this.stop()
      } catch {
        case e: Throwable =>
          logError(e.getMessage, e)
          throw e
      }
    }
  }.start()
}

Solution: When the driver program manually starts threads, set them as daemon threads.

Driver‑Side Program

The Web UI kill action eventually calls StandaloneSchedulerBackend.dead, which invokes sc.stopInNewThread() inside the else branch of the method:

override def dead(reason: String) {
  notifyContext()
  if (!stopping.get) {
    launcherBackend.setState(SparkAppHandle.State.KILLED)
    logError("Application has been killed. Reason: " + reason)
    try {
      scheduler.error(reason)
    } finally {
      // Ensure the application terminates, as we can no longer run jobs.
      sc.stopInNewThread()
    }
  }
}

The TaskSchedulerImpl.error method further propagates the error:

def error(message: String) {
  synchronized {
    if (taskSetsByStageIdAndAttempt.nonEmpty) {
      for {
        attempts <- taskSetsByStageIdAndAttempt.values
        manager <- attempts.values
      } {
        try {
          manager.abort(message)
        } catch {
          case e: Exception => logError("Exception in error callback", e)
        }
      }
    } else {
      throw new SparkException(s"Exiting due to error from cluster scheduler: $message")
    }
  }
}

Because sc.stopInNewThread() runs in a newly created daemon thread, the driver‑side computation may continue after the UI kill.

Solution: Add explicit checks for SparkContext state in driver code, e.g.:

//section 2 (modified)
for (i <- 1 to 100 if !sparkSession.sparkContext.isStopped) {
  for (j <- 1 to 999999999 if !sparkSession.sparkContext.isStopped) {
  }
  if (i % 10 == 0) {
    println(i + " main process running!")
  }
}

These checks ensure the driver loop exits promptly when the Spark application is killed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

bigdata Spark kill driver DaemonThread

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.