Understanding Spark SQL CacheManager: Caching Mechanism, Triggering, Uncaching, and Canonicalization
This article explains Spark SQL's CacheManager, how it stores cached query results using InMemoryRelation, the ways to trigger and release caches, the internal data structures like IndexedSeq and CachedData, and the role of canonicalization in determining cache reuse.
CacheManager is the component in Spark SQL that manages in‑memory caching of query results, using InMemoryRelation to store byte buffers and automatically replacing the logical plan with a cached version.
It is only usable inside Spark SQL and is shared across sessions via SharedState. You can view its internal actions by setting log4j.logger.org.apache.spark.sql.execution.CacheManager=ALL in conf/log4j.properties, which will log messages such as "Asked to cache already cached data." when a cache operation is attempted.
How to trigger CacheManager:
Use Spark SQL's cache or persist operators, or the SQL command CACHE TABLE.
Note that Spark Core's cache / persist operators are unrelated to CacheManager.
How to uncache:
Call Dataset.unpersist.
Execute DROP TABLE or TRUNCATE TABLE commands.
Use CatalogImpl to request uncache, refreshTable, or drop temporary/global views.
Cache internal structure: CacheManager stores cached entries in an IndexedSeq[CachedData], where each CachedData holds a LogicalPlan and its corresponding InMemoryRelation. The default implementation of IndexedSeq is a Scala Vector, providing near‑constant‑time element access.
IndexedSeq: An immutable indexed sequence that supports fast random access and length calculation, defined by abstract methods for indexing and length.
CachedData definition:
case class CachedData(plan: LogicalPlan, cachedRepresentation: InMemoryRelation)InMemoryRelation configuration includes: spark.sql.inMemoryColumnarStorage.compressed (default enabled) spark.sql.inMemoryColumnarStorage.batchSize (default 10000)
Storage level (default MEMORY_AND_DISK)
Optimized physical query plan after analysis
Table name
Statistics of the analyzed query plan
Determining if a query is already cached: Spark compares the canonicalized form of two query plans using sameResult:
final def sameResult(other: PlanType): Boolean = this.canonicalized == other.canonicalizedThe canonicalized method normalizes a plan by removing superficial differences (case sensitivity, expression order, ExprId, etc.) so that logically equivalent plans produce identical representations. Canonicalization is defined in QueryPlan.scala and involves rewriting expressions, normalizing Alias and AttributeReference IDs, and recursively canonicalizing child plans. In Spark 3.3.0, 21 specific QueryPlan subclasses override doCanonicalize , including HiveTableScanExec , InMemoryTableScanExec , AdaptiveSparkPlanExec , Join , etc. These overrides focus on copying the plan and normalizing selected attributes, enabling CacheManager to recognize and reuse identical logical queries.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
