Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking
The article details how Sparkle Thinking built the Athena Data Factory—a comprehensive, self‑service data development and governance platform that integrates data integration, ETL, real‑time processing, monitoring, and analytics, describing its architecture, key technologies, implementation timeline, operational practices, performance gains, and future directions.
FireSpark (Sparkle Thinking) recognized the growing data demands of its online education business and created Athena Data Factory, a one‑stop platform that unifies data integration, development, analysis, and services to support diverse roles such as operations, BA, product, and engineering.
Background : Rapid data growth exposed shortcomings in data openness, system reliability, and ETL efficiency, prompting the need for a more intelligent data development solution.
Product Overview : Athena offers offline development, real‑time development, operations center, self‑service data extraction, data map, and console modules, providing end‑to‑end capabilities from task creation to data delivery.
Architecture : The platform is layered into an interaction layer (Vue + Monaco), service layer (micro‑services, API, permission management), engine layer (Hive, Spark, Flink, Presto, supporting multi‑engine execution), and component layer (Airflow, Tencent Cloud SCF, etc.), enabling flexible engine selection and multi‑tenant support.
Key Technologies & Innovations : Offline ETL leverages Hive, Doris, Python, and API generation; real‑time processing uses Spark Streaming, Flink, and Iceberg; scheduling relies on Airflow 2.0 with fine‑grained monitoring and alerting; monitoring covers task‑level, table‑level, and field‑level checks; integration tools include Sqoop, DataX, and custom Seatunel replacements.
Implementation & Operations : Launched in late 2021, the platform evolved through multiple versions, adding Doris and SparkSQL engines, integrating real‑time modules, and migrating Hadoop to EMR and later to COS. Operational models shifted from project‑based groups to a centralized “Big Data Service Center” with rotating on‑call staff and automated reminders.
Results & Benefits : Daily offline task instances reached ~20,000, serving ~470 internal users (MAU ~130). Over 92% of new tasks are created by non‑big‑data roles, with 2,700+ tasks added in 2023 alone. ETL efficiency improved by 48% average runtime reduction and 107% increase in task count YoY. Cost per task decreased by 29% after migrating to serverless compute.
Lessons Learned & Future Plans : Emphasis on user experience, appropriate technology stack selection, and scalable architecture proved critical. Future work includes AI‑assisted debugging, enhanced observability, containerization of core services, and further cloud‑native optimizations.
Conclusion : Athena Data Factory has significantly boosted data management efficiency and value extraction for Sparkle Thinking, positioning the company for continued data‑driven innovation in online education.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.