Big Data 14 min read

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

This presentation details how Guanyuan Data leverages Apache Spark within its self‑service analytics platform, covering product features, flexible deployment, resource isolation, performance challenges, architectural solutions, and future cloud‑native enhancements to support thousands of users and massive query workloads.

DataFunSummit

Dec 10, 2022

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

Introduction: The talk by Zhou Xiang, a R&D engineer at Guanyuan Data, introduces the Guanyuan self‑service analytics product and its growing role for business users.

Product overview: Features include form filling, data ingestion, dashboards, portals, ETL, lightweight apps, visual analysis, complex reports, and emphasize visual analysis and smart ETL for business users.

Architecture: The system integrates Apache Spark as the core compute engine, with a control tower for task dispatch, Delta Lake storage, and supports various deployment modes (single‑machine, SaaS, private, cloud) via Docker/K8s.

Challenges: Flexible deployment, multi‑tenant resource isolation, high‑performance low‑latency queries, Spark stability, optimizer overhead, join memory usage, shuffle resource pressure, task cancellation, and overall query experience.

Solutions: Containerized deployment, Spark‑based architecture, storage‑compute separation, support for multiple storage backends, dynamic resource isolation, engine segregation for slow queries, optimizer rule tuning, query validation, shuffle cleanup, and monitoring.

Performance: The platform serves up to 30 000 monthly active users, maintains 90 percent of queries under 2 seconds, processes over 300 000 daily tasks, and runs on clusters up to 20 000 cores.

Future outlook: Plans include more cloud‑native solutions, integration with other engines such as Databricks and ClickHouse, and continued contribution to the open‑source community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Platform resource scheduling Apache Spark Self-Service Analytics

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.