How Alibaba Turns Big Data into ‘Data New Energy’ with Automated Tagging and Distributed Knowledge Graphs

Alibaba's senior algorithm expert Yang Hongxia explains how the company fuses massive, heterogeneous data sources into a unified platform, builds automated tag‑production pipelines and large‑scale distributed knowledge graphs, and applies these technologies to drive smarter business decisions and AI‑enabled services.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba Turns Big Data into ‘Data New Energy’ with Automated Tagging and Distributed Knowledge Graphs

What is “Data New Energy”?

Yang Hongxia defines “data new energy” as big data itself, emphasizing that only by combining technology, data, and algorithms can data be truly utilized.

Challenges of Data Fusion at Alibaba

Unlike Google’s relatively homogeneous data (search, maps) or Facebook’s social‑behavior data, Alibaba possesses a vast array of data sources—e‑commerce, advertising, logistics, finance, entertainment, travel, etc.—making the key challenge how to effectively integrate these diverse datasets.

Technical Framework for Data Integration

The integration pipeline starts with robust data collection and storage on Alibaba Cloud, followed by thorough data cleaning ("learn trash from trash" avoidance). Clean data are then transformed into feature layers (static account features, e‑commerce behavior, device attributes) and fed into model layers that include rule‑based models, anomaly detection, supervised/unsupervised learning, graph‑based algorithms, and real‑time anti‑fraud models. An evaluation layer validates the usefulness of the processed data before they reach the application layer.

Automated Tag Production

To quickly generate user tags, Alibaba built a “tag factory” that ingests seed users (similar to Facebook’s look‑alike audience), extracts the most important features from billions of candidates, and tags the remaining population. The system prioritizes fast response, high throughput, and sufficient data quality, delivering cost‑effective, high‑quality tags within hours while continuously incorporating business feedback for optimization and ensuring data security.

Large‑Scale Distributed Knowledge Graph

The knowledge graph abstracts heterogeneous tables into a graph structure, enabling efficient labeling of millions of tables. By precisely labeling the most frequently accessed tables and using inference to label the remaining 90%, labeling accuracy improves from ~55% (rule‑based) to ~88%.

Applications include data asset management—identifying table ownership—and the “Data Map” product, which returns the most relevant table (and eventually the exact SQL query) for a user’s request.

Key Recommendations for Successful Machine‑Learning Deployment

Ensure access to massive, diverse datasets; single‑source data limit model impact.

Leverage a reliable compute platform such as Alibaba Cloud.

Develop algorithms that are generalizable and extensible across projects to reduce development cost.

Alibababig datamachine learningdata platformKnowledge Graphautomated tagging
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.