How ByteDance Cut Spark History Server Storage by 90% and Boost UI Speed
ByteDance’s Spark History Server was re‑engineered into a cloud‑native UIService that reduces storage usage by over 90%, cuts UI latency by up to 94%, and enables horizontal scaling, dramatically improving the user experience for large‑scale Spark jobs.
ByteDance’s Data Platform SparkSQL team rebuilt the Spark History Server (SHS) as a cloud‑native service called UIService, which reduces both storage consumption and access latency by more than 90% and is now the default history backend for the LakeHouse Analytics Service (LAS).
Background
The open‑source SHS stores Spark event logs (JSON) generated by SparkListenerEvent objects via EventLoggingListener. The FsHistoryProvider periodically scans configured log directories, loads metadata into memory, and replays each log using ReplayListener to populate a KVStore for UI rendering.
Pain Points
High storage overhead : Detailed event logs can reach tens of gigabytes per job; ByteDance’s 7‑day logs occupy ~3.2 PB on HDFS.
Slow replay latency : UI rendering may take minutes after job completion because the server must parse large logs.
Poor scalability : FsHistoryProvider is stateful; each restart requires full metadata reload, and scaling requires complex path routing.
Not cloud‑native : Deploying SHS per tenant in public clouds is costly and hard to isolate.
Solution – UIService
UIService stores only the UI‑relevant metadata in a new UIMetaStore persisted as UIMetaFile. The KVStore is serialized with Kryo (faster than JSON) and contains classes such as AppStatusStore, SQLAppStatusStore, and related wrappers.
# AppStatusStore
org.apache.spark.status.JobDataWrapper
org.apache.spark.status.ExecutorStageSummaryWrapper
org.apache.spark.status.ApplicationInfoWrapper
org.apache.spark.status.PoolData
org.apache.spark.status.ExecutorSummaryWrapper
org.apache.spark.status.StageDataWrapper
org.apache.spark.status.AppSummary
org.apache.spark.status.RDDOperationGraphWrapper
org.apache.spark.status.TaskDataWrapper
org.apache.spark.status.ApplicationEnvironmentInfoWrapper
# SQLAppStatusStore
org.apache.spark.sql.execution.ui.SQLExecutionUIData
org.apache.spark.sql.execution.ui.SparkPlanGraphWrapperThe file format begins with a 4‑byte magic number "UI_S" followed by a sequence of class‑name length, class name, data length, and serialized data blocks.
Key Components
UIMetaLoggingListener : Listens only to stageEnd and JobEnd events, writes a batch snapshot of UIMetaStore instead of streaming every event.
UIMetaProvider : Replaces FsHistoryProvider; it reads UIMetaFile directly, eliminating path‑scanning and pre‑loading of all metadata.
Optimizations
Avoid duplicate writes by tracking already‑serialized instances and persisting only changed data.
Filter out running‑task information; only completed task data is stored.
Provide a fallback to the original event‑log path if UIMetaFile is missing or corrupted, ensuring seamless migration.
Benefits
Storage Savings
Average storage reduced by 85% and total volume by 92.4%; 7‑day logs dropped from 3.2 PB to 350 TB, enabling retention up to 30 days.
Latency Reduction
UI response time improved by 35% on average; 90th, 95th, and 99th percentile latencies dropped by 84.6%, 90.8%, and 93.7% respectively.
pct90
pct95
pct99
AVG
event log
15589ms
37022ms
104259ms
7217ms
UIMeta
2401ms
3410ms
6595ms
1108ms
Architectural Gains
By removing directory scanning and pre‑loading, UI becomes available within seconds after job completion, and the service scales horizontally across tenants in LAS, providing cloud‑native isolation and on‑demand elasticity.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
