Big Data 12 min read

How ByteDance Cut Spark History Server Storage by 90% and Boost UI Speed

ByteDance’s Spark History Server was re‑engineered into a cloud‑native UIService that reduces storage usage by over 90%, cuts UI latency by up to 94%, and enables horizontal scaling, dramatically improving the user experience for large‑scale Spark jobs.

ByteDance Data Platform

Mar 14, 2022

How ByteDance Cut Spark History Server Storage by 90% and Boost UI Speed

ByteDance’s Data Platform SparkSQL team rebuilt the Spark History Server (SHS) as a cloud‑native service called UIService, which reduces both storage consumption and access latency by more than 90% and is now the default history backend for the LakeHouse Analytics Service (LAS).

Background

The open‑source SHS stores Spark event logs (JSON) generated by SparkListenerEvent objects via EventLoggingListener. The FsHistoryProvider periodically scans configured log directories, loads metadata into memory, and replays each log using ReplayListener to populate a KVStore for UI rendering.

Pain Points

High storage overhead : Detailed event logs can reach tens of gigabytes per job; ByteDance’s 7‑day logs occupy ~3.2 PB on HDFS.

Slow replay latency : UI rendering may take minutes after job completion because the server must parse large logs.

Poor scalability : FsHistoryProvider is stateful; each restart requires full metadata reload, and scaling requires complex path routing.

Not cloud‑native : Deploying SHS per tenant in public clouds is costly and hard to isolate.

Solution – UIService

UIService stores only the UI‑relevant metadata in a new UIMetaStore persisted as UIMetaFile. The KVStore is serialized with Kryo (faster than JSON) and contains classes such as AppStatusStore, SQLAppStatusStore, and related wrappers.

# AppStatusStore
org.apache.spark.status.JobDataWrapper
org.apache.spark.status.ExecutorStageSummaryWrapper
org.apache.spark.status.ApplicationInfoWrapper
org.apache.spark.status.PoolData
org.apache.spark.status.ExecutorSummaryWrapper
org.apache.spark.status.StageDataWrapper
org.apache.spark.status.AppSummary
org.apache.spark.status.RDDOperationGraphWrapper
org.apache.spark.status.TaskDataWrapper
org.apache.spark.status.ApplicationEnvironmentInfoWrapper
# SQLAppStatusStore
org.apache.spark.sql.execution.ui.SQLExecutionUIData
org.apache.spark.sql.execution.ui.SparkPlanGraphWrapper

The file format begins with a 4‑byte magic number "UI_S" followed by a sequence of class‑name length, class name, data length, and serialized data blocks.

Key Components

UIMetaLoggingListener : Listens only to stageEnd and JobEnd events, writes a batch snapshot of UIMetaStore instead of streaming every event.

UIMetaProvider : Replaces FsHistoryProvider; it reads UIMetaFile directly, eliminating path‑scanning and pre‑loading of all metadata.

Optimizations

Avoid duplicate writes by tracking already‑serialized instances and persisting only changed data.

Filter out running‑task information; only completed task data is stored.

Provide a fallback to the original event‑log path if UIMetaFile is missing or corrupted, ensuring seamless migration.

Benefits

Storage Savings

Average storage reduced by 85% and total volume by 92.4%; 7‑day logs dropped from 3.2 PB to 350 TB, enabling retention up to 30 days.

Latency Reduction

UI response time improved by 35% on average; 90th, 95th, and 99th percentile latencies dropped by 84.6%, 90.8%, and 93.7% respectively.

pct90

pct95

pct99

AVG

event log

15589ms

37022ms

104259ms

7217ms

UIMeta

2401ms

3410ms

6595ms

1108ms

Architectural Gains

By removing directory scanning and pre‑loading, UI becomes available within seconds after job completion, and the service scales horizontally across tenants in LAS, providing cloud‑native isolation and on‑demand elasticity.

cloud native performance optimization Spark History Server UIService

Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.