Big Data 16 min read

From Hadoop to StarRocks: Revamping a Government Procurement Data Platform

Facing massive data volumes, complex component dependencies, high TCO, and real‑time processing limits, the政采云 platform replaced its Hadoop stack with StarRocks’ minimalist, decoupled architecture, achieving lower costs, elastic scaling, faster queries, easier operations, and robust fault tolerance across diverse government procurement workloads.

StarRocks
StarRocks
StarRocks
From Hadoop to StarRocks: Revamping a Government Procurement Data Platform

Background and Motivation

In the era of digital transformation, data has become a core production asset for both government and enterprises. The政采云 platform, a leading government procurement cloud service, processes massive high‑concurrency data daily. Hadoop originally powered the platform, offering low‑cost batch processing for unstructured and semi‑structured data.

Rapid business growth, increasingly complex analytics, and strict latency requirements exposed Hadoop’s limitations: cumbersome component management, high total cost of ownership (TCO), and an inability to meet real‑time demands.

Key Challenges of the Existing Hadoop Stack

Complex and heavyweight architecture : Managing HDFS, YARN, Hive, Spark and other tightly coupled components required extensive version compatibility work and intricate configuration tuning.

High TCO : Redundant hardware, extensive operational effort, and costly upgrades drove up both capital and operational expenses.

Real‑time performance gap : Hadoop’s batch‑oriented design could not satisfy sub‑second query latency required by modern analytics.

Strong coupling and upgrade difficulty : Adding or removing nodes triggered costly data re‑distribution, limiting flexibility and independent component upgrades.

Domestic adaptation hurdles : The shift to Chinese‑made hardware and operating systems (信创) introduced compatibility and performance risks, especially after CDH ceased open‑source support.

Strategic Requirements

Technology stack must be controllable and fully compatible with domestic hardware/OS.

High performance and real‑time capabilities for massive data analysis.

Simplified architecture and lightweight operations to reduce maintenance burden.

Significant cost optimization, lowering both hardware procurement and ongoing operational expenses.

Elastic scalability to adjust resources on demand.

Choosing StarRocks

After extensive evaluation, the team selected StarRocks as the core engine. StarRocks offers an “extreme performance + extreme simplicity” philosophy, abandoning Hadoop’s stacked model in favor of a unified MPP architecture that integrates storage, metadata, SQL computation, and query optimization.

The commercial partner 镜舟科技 provides deep optimization for domestic hardware and operating systems, ensuring reliable deployment in the Chinese market.

StarRocks Architecture Overview

The new architecture reduces the stack to two core components: Frontend (FE) – Handles metadata management, query planning, client connections, and cluster coordination using Raft for strong consistency. Compute Node (CN) – Executes computation without persisting data, enabling true storage‑compute separation.

HDFS, YARN, Hive, and other Hadoop services are removed. All data storage, metadata, and query execution are consolidated within StarRocks.

Operational Benefits

Simplified deployment : Managing a small set of FE instances and scaling CN nodes as needed dramatically reduces operational complexity.

High availability : FE failures are automatically taken over by Raft replicas; CN failures trigger automatic task rescheduling and cache reconstruction.

Cost efficiency : Object storage replaces HDFS for cold data, cutting storage costs by over 85%; elastic CN scaling eliminates idle resource waste, reducing overall IT spend by more than 30%.

Performance gains : StarRocks delivers higher throughput and lower latency on the same hardware, supporting both batch and real‑time workloads.

Unified SQL interface : All scenarios—ETL, batch reporting, interactive analytics, and real‑time dashboards—use a single standard SQL, removing the need for multiple engines.

Migration Process

The migration tackled several fronts:

Application adaptation : Adjusted downstream services to consume data via StarRocks SQL endpoints.

SQL migration and lineage : Developed an automatic conversion tool that rewrites ~90% of Spark‑SQL statements to StarRocks‑SQL; remaining differences were handled manually.

Data consistency verification : Built a verification framework comparing row counts, key aggregates, and MD5 checksums across thousands of tables to ensure parity between Hadoop and StarRocks results.

Internal big‑data applications : Integrated job development platforms, data tagging, quality, lineage, and custom components (HTools, scheduler) with the new engine.

Results and Impact

Reliability : Automated health checks and self‑healing mechanisms reduced manual interventions.

Storage cost reduction : Cold data tiering to object storage lowered total storage expense by >85%.

Compute cost reduction : Dynamic CN scaling eliminated over‑provisioned resources, improving hardware utilization.

Overall TCO : Combined hardware, software, and labor savings cut total IT cost by >30%.

Operational simplicity : Monitoring focus shifted from dozens of Hadoop services to a handful of FE/CN processes, cutting maintenance time dramatically.

Team productivity : Engineers moved from firefighting infrastructure to platform optimization and business‑value projects.

Future Directions

The platform plans to deepen real‑time analytics, explore intelligent materialized view automation, further integrate lakehouse concepts, and enhance elastic resource management to continue driving cost‑effective, high‑performance data services.

cloud nativeStarRockscost optimizationData WarehouseHadoop migration
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.