Big Data 8 min read

Understanding Spark SQL CacheManager: Caching Mechanism, Triggering, Uncaching, and Canonicalization

This article explains Spark SQL's CacheManager, how it stores cached query results using InMemoryRelation, the ways to trigger and release caches, the internal data structures like IndexedSeq and CachedData, and the role of canonicalization in determining cache reuse.

Big Data Technology & Architecture

Dec 23, 2022

Understanding Spark SQL CacheManager: Caching Mechanism, Triggering, Uncaching, and Canonicalization

CacheManager is the component in Spark SQL that manages in‑memory caching of query results, using InMemoryRelation to store byte buffers and automatically replacing the logical plan with a cached version.

It is only usable inside Spark SQL and is shared across sessions via SharedState. You can view its internal actions by setting log4j.logger.org.apache.spark.sql.execution.CacheManager=ALL in conf/log4j.properties, which will log messages such as "Asked to cache already cached data." when a cache operation is attempted.

How to trigger CacheManager:

Use Spark SQL's cache or persist operators, or the SQL command CACHE TABLE.

Note that Spark Core's cache / persist operators are unrelated to CacheManager.

How to uncache:

Call Dataset.unpersist.

Execute DROP TABLE or TRUNCATE TABLE commands.

Use CatalogImpl to request uncache, refreshTable, or drop temporary/global views.

Cache internal structure: CacheManager stores cached entries in an IndexedSeq[CachedData], where each CachedData holds a LogicalPlan and its corresponding InMemoryRelation. The default implementation of IndexedSeq is a Scala Vector, providing near‑constant‑time element access.

IndexedSeq: An immutable indexed sequence that supports fast random access and length calculation, defined by abstract methods for indexing and length.

CachedData definition:

case class CachedData(plan: LogicalPlan, cachedRepresentation: InMemoryRelation)

InMemoryRelation configuration includes: spark.sql.inMemoryColumnarStorage.compressed (default enabled) spark.sql.inMemoryColumnarStorage.batchSize (default 10000)

Storage level (default MEMORY_AND_DISK)

Optimized physical query plan after analysis

Table name

Statistics of the analyzed query plan

Determining if a query is already cached: Spark compares the canonicalized form of two query plans using sameResult:

final def sameResult(other: PlanType): Boolean = this.canonicalized == other.canonicalized

The canonicalized method normalizes a plan by removing superficial differences (case sensitivity, expression order, ExprId, etc.) so that logically equivalent plans produce identical representations. Canonicalization is defined in QueryPlan.scala and involves rewriting expressions, normalizing Alias and AttributeReference IDs, and recursively canonicalizing child plans. In Spark 3.3.0, 21 specific QueryPlan subclasses override doCanonicalize , including HiveTableScanExec , InMemoryTableScanExec , AdaptiveSparkPlanExec , Join , etc. These overrides focus on copying the plan and normalizing selected attributes, enabling CacheManager to recognize and reuse identical logical queries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL caching Spark Scala CacheManager canonicalization

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.