Big Data 7 min read

Why Do Spark Card Queries Take 10 Seconds? Uncovering a NAS Mount Issue

A customer’s Spark card queries were consistently taking around 10 seconds, prompting a step‑by‑step investigation that revealed a misconfigured NAS mount option (lookupcache=none) as the root cause of the severe slowdown.

GuanYuan Data Tech Team
GuanYuan Data Tech Team
GuanYuan Data Tech Team
Why Do Spark Card Queries Take 10 Seconds? Uncovering a NAS Mount Issue

Problem Overview

The customer reported that card‑query performance in their environment was extremely slow, often taking about 10 seconds per query.

Problem Analysis

Key diagnostic questions were asked: whether the issue was truly a performance problem, if it was persistent or regression, its impact scope, and its occurrence frequency. It was determined to be a persistent, non‑regression issue affecting all card queries, with larger data sets surprisingly performing faster.

Investigation Process

The team examined the Spark UI, noting that most tasks lasted around 9 seconds while a few larger‑data queries completed in 2–3 seconds. Screenshots of the Spark UI showed many tasks each taking about 1 second, but overall task duration was high due to the number of tasks.

To dig deeper, they used Alibaba’s Java diagnostic tool Arthas to trace executor threads. By filtering long‑running calls (e.g.,

trace scala.collection.Iterator hasNext '#cost > 10'

), they identified the method

readCurrentFile

as taking ~123 ms per call, far higher than the ~15 ms observed in a healthy environment.

The analysis linked the delay to reading Parquet files. With

spark.sql.shuffle.partitions = 64

, each query required reading 64 Parquet files, leading to ~9 seconds total (64 × ~140 ms). When

coalesce

was not applied, Spark used fewer tasks, reducing per‑task file reads and improving latency.

Root Cause

The underlying storage was an Alibaba Cloud NAS mounted with the option

lookupcache=none

. This parameter was not present in other customers’ mounts or official documentation. Replicating the mount with

lookupcache=none

reproduced the slowdown, confirming it as the culprit.

Resolution

Removing the

lookupcache=none

option from the NAS mount restored normal query performance.

debuggingPerformancebig dataArthasSparkNAS
GuanYuan Data Tech Team
Written by

GuanYuan Data Tech Team

Practical insights from the GuanYuan Data Tech Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.