Big Data 14 min read

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

This presentation details how Guanyuan Data leverages Apache Spark within its self‑service analytics platform, covering product features, flexible deployment, resource isolation, performance challenges, architectural solutions, and future cloud‑native enhancements to support thousands of users and massive query workloads.

DataFunSummit
DataFunSummit
DataFunSummit
Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

Introduction: The talk by Zhou Xiang, a R&D engineer at Guanyuan Data, introduces the Guanyuan self‑service analytics product and its growing role for business users.

Product overview: Features include form filling, data ingestion, dashboards, portals, ETL, lightweight apps, visual analysis, complex reports, and emphasize visual analysis and smart ETL for business users.

Architecture: The system integrates Apache Spark as the core compute engine, with a control tower for task dispatch, Delta Lake storage, and supports various deployment modes (single‑machine, SaaS, private, cloud) via Docker/K8s.

Challenges: Flexible deployment, multi‑tenant resource isolation, high‑performance low‑latency queries, Spark stability, optimizer overhead, join memory usage, shuffle resource pressure, task cancellation, and overall query experience.

Solutions: Containerized deployment, Spark‑based architecture, storage‑compute separation, support for multiple storage backends, dynamic resource isolation, engine segregation for slow queries, optimizer rule tuning, query validation, shuffle cleanup, and monitoring.

Performance: The platform serves up to 30 000 monthly active users, maintains 90 percent of queries under 2 seconds, processes over 300 000 daily tasks, and runs on clusters up to 20 000 cores.

Future outlook: Plans include more cloud‑native solutions, integration with other engines such as Databricks and ClickHouse, and continued contribution to the open‑source community.

performance optimizationBig Datadata-platformResource SchedulingApache Sparkself‑service analytics
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.