Operations 17 min read

How Top Internet Companies Scale Spark CI/CD Across Tens of Thousands of Nodes

This article details a practical, production‑grade Spark CI/CD workflow using GitLab and Jenkins, covering source management, multi‑branch release strategies, automated testing, gray‑release, hot‑fix handling, and rollback mechanisms for large‑scale deployments.

dbaplus Community
dbaplus Community
dbaplus Community
How Top Internet Companies Scale Spark CI/CD Across Tens of Thousands of Nodes

The author, a big‑data architect experienced with Kafka, Flume, Hadoop, and Spark, shares a distilled version of the CI/CD practices used by leading internet companies to manage Spark clusters with tens of thousands of nodes.

CI Overview

Continuous Integration (CI) means regularly merging tested code into the main branch. Benefits include rapid error detection, preventing branch drift, and supporting fast iteration.

Spark CI Implementation

All code resides in private GitLab repositories: spark-src.git for source and spark-bin.git for built distributions. Developers submit Merge Requests (MRs) in GitLab; each MR triggers a Jenkins build via webhook.

Compile all Spark modules.

Run unit tests.

Execute performance tests.

Fail the build if any test fails or performance degrades.

Jenkins reports the result back to GitLab; only successful builds allow the MR to be merged. At least two reviewers must approve before merging.

Spark CD (Continuous Delivery)

Delivery means making a new version available to QA or users promptly. Three release strategies are presented:

Solution 1 – Single Branch

Development occurs on spark-src.git/dev.

Every Monday, the latest code is packaged into spark-bin.git/dev/spark-<em>build#</em>. spark-prod points to the previous week’s release, providing a one‑week testing window.

Bug‑fixes and hot‑fixes create commits with messages containing bugfix or hotfix, triggering immediate builds and updating the appropriate symbolic links.

Solution 2 – Two Branches

Separate dev and prod branches in both source and binary repos.

Weekly releases are built from dev and later fast‑forward merged into prod.

Bug‑fixes are applied on dev, hot‑fixes on prod, with Jenkins handling builds and symbolic updates.

Pros: transparent path switching, easier gray‑release, stable prod due to a full testing cycle.

Cons: potential merge conflicts during cherry‑pick or rebase, and a delay of up to two weeks for bug fixes to reach prod.

Solution 3 – Multi‑Branch

Development on master, with weekly fast‑forward merges into dev and prod.

Bug‑fixes are committed to dev, hot‑fixes to prod, then rebased onto master.

Provides clear separation of dev, staging, and production code bases and ensures consistency through rebase.

Cons: strict rebase requirements, conflict resolution risks, and the need for local rebasing before pushing.

Gray Release and Rollback

Only one dev and one prod version are maintained. Deployments use the symbolic spark link to point to a specific spark-<em>build#</em> directory. Rolling back simply involves pointing the link to an earlier build.

Continuous Deployment

After a release is pushed to spark-bin.git, a custom Git‑based deployment system (or similar) automatically deploys the artifact to staging or production environments, completing the CI/CD pipeline.

Key Takeaways

Automated testing and merge gating ensure only validated code reaches production.

Branching strategies balance simplicity, testing rigor, and gray‑release flexibility.

Rebase‑based integration keeps dev, prod, and master histories aligned but requires disciplined conflict handling.

Symbolic links provide an intuitive rollback mechanism.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Dataci/cdGitLabContinuous DeliverySparkJenkinsrelease-management
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.