Big Data 26 min read

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

The Athena Data Factory, built by Spark Thinking, is a comprehensive one‑stop data development and governance platform that integrates data integration, development, analysis, and services, offering offline, real‑time, and AI pipelines, modular architecture, extensive monitoring, and cost‑optimisation to empower thousands of users across the company.

Big Data Technology & Architecture

May 27, 2024

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

Background – Spark Thinking, an online education leader, faced explosive data growth and limitations in data access, reliability, and ETL efficiency, prompting the need for a unified, high‑performance data development and governance solution.

Product Overview – Athena Data Factory provides a full‑stack platform covering offline development, real‑time development, operations, self‑service data extraction, data mapping, and console management, enabling users from operations, BA, product, R&D, and finance to develop, schedule, monitor, and consume data with minimal friction.

Key Modules – The platform includes an offline development module (HiveSQL, DorisSQL, Python, data sync), a real‑time development module (Spark, Flink, Flink SQL), an operations center (dashboards, task rerun, lineage), a self‑service extraction module (Hive, Spark, Presto APIs), a data‑map module (catalog, lineage, metadata), and a console module (project, source, permission, queue, sensitive‑data management).

Architecture – Four‑layer design: Interaction layer (Vue + Monaco), Service layer (micro‑services, API, permission, resource management), Engine layer (Hive, Spark, Presto, Airflow, multi‑tenant support), and Component layer (Airflow, Tencent Cloud SCF, etc.), ensuring scalability, extensibility, and cloud‑native deployment.

Use Cases & Scenarios – Supports T/H+x batch processing, minute‑level analytics, real‑time streaming (Spark Streaming, Flink, Iceberg), and AI pipelines (data preprocessing, model training, inference), covering a wide range of business needs.

Implementation & Operations – Launched in late 2021, iteratively added Doris, SparkSQL, API generation, real‑time module, and AI support; migrated Hadoop to EMR, introduced versioned GitLab integration, and refined monitoring, alerting, and cost‑control mechanisms.

Results & Benefits – Daily execution of ~20,000 offline tasks, 470 internal users (130 MAU), 92 % of new tasks from non‑big‑data roles, 2,700+ tasks created in 2023, 30 %+ reduction in task runtime, 29 % cost reduction, and significant productivity gains (3 + hours saved per task).

Future Outlook – Plans include AI‑assisted debugging, enhanced observability, containerised core services, and further cloud‑native optimisation to sustain growth and innovation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Cloud Computing Flink Data Platform ETL Spark Airflow

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.