Artificial Intelligence 11 min read

Design and Implementation of a Machine Learning Data Platform at Getui

This article describes Getui's end‑to‑end machine‑learning data platform, covering business use cases, the full ML workflow from data ingestion and feature engineering to model training, deployment, monitoring, and the practical tools and solutions adopted to address common challenges in large‑scale AI projects.

Architecture Digest

Jul 29, 2018

Design and Implementation of a Machine Learning Data Platform at Getui

Machine learning has become a core technology for many internet products, and Getui, a smart big‑data service provider, has built a comprehensive data platform to support various ML‑driven services such as intelligent push, ad targeting, crowd flow prediction, fraud detection, personalized recommendation, and churn prediction.

Background: The platform supports multiple business scenarios, leveraging ML models trained on user tags, ad audience segmentation, and other predictive tasks.

ML Process: Raw data undergoes ETL and is stored in a data warehouse. Feature engineering matches sample data with internal data, extracts and derives features, and then selects algorithms (e.g., logistic regression, RNN) to train models, which are finally applied to the full dataset for predictions.

The standard workflow includes problem definition, data import, filtering, correlation analysis, extensive feature engineering (about 80% of effort), model training, validation, and production deployment.

Common Challenges: Scaling to big‑data volumes, long data extraction times, inconsistent tooling across teams, limited engineering skills among algorithm engineers, high data exploration costs, and lack of shared feature assets.

Solution Overview: Getui's platform aims to standardize modeling processes, provide an end‑to‑end solution from development to online serving, enable feature data sharing across teams, support multi‑tenant usage, and ensure data security.

The solution consists of two parts: a modeling (experiment) platform for algorithm engineers and a production environment that runs data pipelines and integrates with the modeling platform.

Modeling Platform Features: IDE (JupyterHub) for interactive analysis, feature data management, ID‑mapping service, fast data extraction APIs, reusable code libraries, model packaging and versioning, and real‑time monitoring of model performance.

Production Environment: Handles real‑time and offline feature data, stores offline features in Hive, and offers online prediction APIs or batch‑processed results for downstream services.

Practical Implementation Details: Jupyter is chosen as the primary IDE, extended with JupyterHub for multi‑tenant support; TensorFlowOnSpark is used for distributed training; Sparkmagic enables Spark code execution from notebooks; Git is integrated for version control; templates and plugins streamline pipeline creation; custom utilities provide ID mapping, data extraction, visualization, and conversion of notebook code to Azkaban flows.

Model deployment follows a standardized framework, supporting PMML or TensorFlow protobuf formats, allowing analysts to use a common prediction library and decouple modeling from system development.

Key lessons include limitations of TensorFlowOnSpark resource allocation, Jupyter compatibility issues, performance bottlenecks of PMML, the need for diagnostic tools for Spark/Hive, treating models and feature libraries as assets with lifecycle management, and balancing technology stack complexity with operational overhead.

Source: Zhihu article

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning feature engineering AI Model Deployment Data Platform Jupyter

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.