Big Data 26 min read

Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking

The article details how Sparkle Thinking built the Athena Data Factory—a comprehensive, self‑service data development and governance platform that integrates data integration, ETL, real‑time processing, monitoring, and analytics, describing its architecture, key technologies, implementation timeline, operational practices, performance gains, and future directions.

DataFunTalk

May 26, 2024

Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking

FireSpark (Sparkle Thinking) recognized the growing data demands of its online education business and created Athena Data Factory, a one‑stop platform that unifies data integration, development, analysis, and services to support diverse roles such as operations, BA, product, and engineering.

Background : Rapid data growth exposed shortcomings in data openness, system reliability, and ETL efficiency, prompting the need for a more intelligent data development solution.

Product Overview : Athena offers offline development, real‑time development, operations center, self‑service data extraction, data map, and console modules, providing end‑to‑end capabilities from task creation to data delivery.

Architecture : The platform is layered into an interaction layer (Vue + Monaco), service layer (micro‑services, API, permission management), engine layer (Hive, Spark, Flink, Presto, supporting multi‑engine execution), and component layer (Airflow, Tencent Cloud SCF, etc.), enabling flexible engine selection and multi‑tenant support.

Key Technologies & Innovations : Offline ETL leverages Hive, Doris, Python, and API generation; real‑time processing uses Spark Streaming, Flink, and Iceberg; scheduling relies on Airflow 2.0 with fine‑grained monitoring and alerting; monitoring covers task‑level, table‑level, and field‑level checks; integration tools include Sqoop, DataX, and custom Seatunel replacements.

Implementation & Operations : Launched in late 2021, the platform evolved through multiple versions, adding Doris and SparkSQL engines, integrating real‑time modules, and migrating Hadoop to EMR and later to COS. Operational models shifted from project‑based groups to a centralized “Big Data Service Center” with rotating on‑call staff and automated reminders.

Results & Benefits : Daily offline task instances reached ~20,000, serving ~470 internal users (MAU ~130). Over 92% of new tasks are created by non‑big‑data roles, with 2,700+ tasks added in 2023 alone. ETL efficiency improved by 48% average runtime reduction and 107% increase in task count YoY. Cost per task decreased by 29% after migrating to serverless compute.

Lessons Learned & Future Plans : Emphasis on user experience, appropriate technology stack selection, and scalable architecture proved critical. Future work includes AI‑assisted debugging, enhanced observability, containerization of core services, and further cloud‑native optimizations.

Conclusion : Athena Data Factory has significantly boosted data management efficiency and value extraction for Sparkle Thinking, positioning the company for continued data‑driven innovation in online education.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink ETL Spark Airflow

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.