Artificial Intelligence 9 min read

Magic Mirror: A Visual Data‑Intelligence Platform for Low‑Code Machine Learning

Magic Mirror is a big‑data‑based visual analytics platform that lowers the barrier of machine‑learning for non‑experts while accelerating expert workflows through visual UI, modular algorithms, distributed feature generation, and automated binary‑classification modeling.

58 Tech
58 Tech
58 Tech
Magic Mirror: A Visual Data‑Intelligence Platform for Low‑Code Machine Learning

Background

Magic Mirror is a visual data‑intelligence platform built by the Data Product R&D team on a big‑data platform. Traditional machine‑learning modeling presents a high entry barrier for non‑data‑science professionals because the concepts are abstract, the workflow is complex, and strong mathematics and programming skills (Python, R, Java, Scala) are required.

Goal

The platform distinguishes two user groups: non‑expert users (business, operations, data‑product staff) who are sensitive to business rules but lack engineering or algorithm expertise, and expert users who have strong modeling and engineering skills but weaker business understanding. Magic Mirror provides a visual interface, rich algorithm components, easy hyper‑parameter tuning, and detailed evaluation reports to lower the barrier for non‑experts and speed up the workflow for experts.

Overall Architecture

1. User Layer – Integrated with the company’s SSO and BSP, using OA accounts as a unified login.

2. Security Layer – Leverages the big‑data platform’s multi‑tenant system; commands are executed with Hadoop accounts derived from the OA account, granting access to authorized Hive tables, HDFS paths, and resource queues.

3. Resource Layer – Data and results are stored in Hive; model files reside in HDFS. The compute engine is Spark, with most preprocessing and statistical logic written in Scala. Algorithms are mainly from Spark MLlib, with third‑party integrations such as XGBoost, LightGBM, and FM.

4. Logic Layer – Covers six categories (data source/target, preprocessing, statistical analysis, feature engineering, machine learning, tools) comprising about 70 components.

5. Application Layer – Provides project and experiment management, fine‑grained permission control, data integration with DP for direct Hive access, and model management (binary‑classification focus) with comparison and publishing features.

6. Service Layer – Supports offline scheduling (periodic batch prediction) and online prediction via HTTP APIs after model publishing.

Scheduling Dependencies

Dependencies between components are resolved using a topological‑sort‑like approach: the dependency graph is transformed into a two‑dimensional array, and a simple check for all‑zero entries in a row quickly identifies the next executable task, reducing computation overhead.

Feature Generation

Feature engineering consumes the majority of data‑mining time. Magic Mirror integrates third‑party libraries such as FeatureTools for automatic feature generation and distributes the workload across the cluster using Spark. The process includes:

1. Data Definition – Define a main feature table and related sub‑tables linked by key fields.

2. Data Splitting – Partition the main table and its linked sub‑tables by primary key and store the splits in HDFS.

3. Distributed Execution – Spark generates a sequence of identifiers; each identifier is passed to a Python function that runs the feature‑generation logic in a separate Python process.

4. Result Aggregation – The Spark driver collects results from all Python processes and writes the final output.

The distributed design leverages cluster resources to accelerate feature generation while retaining the flexibility of Python libraries.

Automated Modeling (Binary Classification)

Inspired by platforms such as Alibaba PAI, Fourth Paradigm, H2O, TransmogrifAI, and DataRobot, Magic Mirror offers an end‑to‑end automated modeling pipeline for non‑expert users:

1. Data Preprocessing – Statistics on feature dimensions, removal of features with >90% missing values, and split into training (60%), validation (20%), and test (20%) sets.

2. Feature Engineering – Separate handling of numeric, categorical, tree‑based, and non‑tree‑based features.

3. Model Training – Four algorithms (Random Forest, GBDT, Logistic Regression, XGBoost) with default hyper‑parameter grids are trained in parallel on Spark using the 60% training set and 20% validation set.

4. Evaluation Report – Models are ranked by evaluation metrics; the best model’s parameters are retrained on the top 80% of data and finally evaluated on the remaining 20% test set.

Outlook

Future work includes integration with cloud‑window scheduling, support for high‑dimensional features, Python‑based model extensions, and online prediction services.

Big Datamachine learningfeature engineeringSparkvisual analyticsautomated-modeling
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.