How Baiguang Tech Cut Data Platform Costs by 70% with Alibaba Cloud EMR
Baiguang Technology rebuilt its data platform on Alibaba Cloud EMR, leveraging OSS storage, open‑source ecosystems, and custom elastic scheduling to handle massive, complex data workloads, resulting in up to 70% faster cluster operations, 68% lower compute costs, and significant business performance gains.
Background
Baiguang Technology, founded in 2016, is a data‑driven market research and information services company serving over 200,000 enterprises across more than ten industries. Its data products require processing massive, diverse, and complex data sets, leading to exponential compute demand.
The data engineering team treats technical methodology as an economics problem: deliver the widest range of analytical scenarios at the lowest possible cost.
Why Choose Alibaba Cloud EMR
Key advantages of Alibaba Cloud EMR over other platforms include:
High‑availability OSS storage : Enables a LakeHouse architecture that supports diverse data ingestion and complex downstream processing.
Out‑of‑the‑box open‑source ecosystem : Includes Spark, Hadoop, Iceberg/Hudi/Delta, Paimon/Flink, Trino/Presto, etc., ready for production without redeployment.
Highly customizable runtime : Users can tune parameters or develop deeper customizations within the cluster.
Broad Datalake Formation support : Provides a performant catalog with fine‑grained permission control.
Flexible elastic scheduling : Rich configuration enables cost‑effective scaling; managed elastic policies are also available.
Comprehensive service support : Alibaba Cloud offers professional assistance for optimization and issue resolution.
Technical Solution Design
The rebuilt platform targets data engineers, analysts, and scientists, covering data ingestion, cleaning, aggregation, analysis, and delivery. Key stages:
Data Ingestion : Periodic writes to OSS using in‑house tools, simplifying pipeline integration.
Data Cleaning : Spark and Iceberg process raw data into a catalog, with jobs scheduled via Airflow on EMR clusters.
Aggregation & Analysis : Analysts and data scientists run notebooks on EMR; the cluster’s elasticity handles varying workloads.
Core Component Practices
DLF on Iceberg : Initial performance regressions on large Iceberg tables were identified, reported, and resolved jointly with the EMR product team, enabling seamless high‑performance integration.
EMR Elastic Scheduling : The team co‑designed a tiered, cost‑effective scheduling model, improving cluster utilization from ~45% to ~70%.
OLAP Solution : After evaluating AWS Athena, the team selected EMR‑based Trino on Alibaba’s Yitian ARM ECS instances, achieving >20% cost reduction while maintaining SQL compatibility.
Performance Improvements
Post‑reconstruction, EMR cluster operations became dramatically faster:
Cluster start time reduced from >13 minutes to ~3 minutes (≈70% faster).
Scale‑out time cut from >10 minutes to ~2 minutes.
Scale‑in time cut from >5 minutes to ~2 minutes.
Core job execution time dropped from >45 minutes to ~15 minutes.
Cost Improvements
Compute cost per CU‑hour fell from 0.72 CNY to 0.23 CNY, a 68% reduction, and monthly EMR expenses decreased by over 50%.
Business Benefits
Data response time improved from hour‑level to minute‑level, accelerating production.
Faster data delivery enables tighter analyst collaboration.
Higher compute performance supports deeper data exploration.
Overall efficiency creates space for business growth.
Conclusion and Outlook
Baiguang’s CTO praised the EMR‑based data lake for meeting business needs while boosting efficiency and cutting costs, calling the partnership a success. Future plans include exploring additional Alibaba Cloud solutions such as Hologres and EMR Serverless Spark to further innovate elastic computing scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
