Big Data 7 min read

Why Do Spark Card Queries Take 10 Seconds? Uncovering a NAS Mount Issue

A customer’s Spark card queries were consistently taking around 10 seconds, prompting a step‑by‑step investigation that revealed a misconfigured NAS mount option (lookupcache=none) as the root cause of the severe slowdown.

GuanYuan Data Tech Team

Mar 24, 2022

Why Do Spark Card Queries Take 10 Seconds? Uncovering a NAS Mount Issue

Problem Overview

The customer reported that card‑query performance in their environment was extremely slow, often taking about 10 seconds per query.

Problem Analysis

Key diagnostic questions were asked: whether the issue was truly a performance problem, if it was persistent or regression, its impact scope, and its occurrence frequency. It was determined to be a persistent, non‑regression issue affecting all card queries, with larger data sets surprisingly performing faster.

Investigation Process

The team examined the Spark UI, noting that most tasks lasted around 9 seconds while a few larger‑data queries completed in 2–3 seconds. Screenshots of the Spark UI showed many tasks each taking about 1 second, but overall task duration was high due to the number of tasks.

To dig deeper, they used Alibaba’s Java diagnostic tool Arthas to trace executor threads. By filtering long‑running calls (e.g., trace scala.collection.Iterator hasNext '#cost > 10'), they identified the method readCurrentFile as taking ~123 ms per call, far higher than the ~15 ms observed in a healthy environment.

The analysis linked the delay to reading Parquet files. With spark.sql.shuffle.partitions = 64, each query required reading 64 Parquet files, leading to ~9 seconds total (64 × ~140 ms). When coalesce was not applied, Spark used fewer tasks, reducing per‑task file reads and improving latency.

Root Cause

The underlying storage was an Alibaba Cloud NAS mounted with the option lookupcache=none. This parameter was not present in other customers’ mounts or official documentation. Replicating the mount with lookupcache=none reproduced the slowdown, confirming it as the culprit.

Resolution

Removing the lookupcache=none option from the NAS mount restored normal query performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

debugging Big Data Arthas Spark NAS

Written by

GuanYuan Data Tech Team

Practical insights from the GuanYuan Data Tech Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.