How Shopee Builds an E‑Commerce Knowledge Graph and Leverages Large Models

This article presents Shopee's comprehensive approach to constructing an e‑commerce knowledge graph, detailing the challenges of heterogeneous data, multi‑language handling, entity disambiguation, and the integration of deep learning and large language models to improve product matching, recommendation, and operational efficiency.

NewBeeNLP
NewBeeNLP
NewBeeNLP
How Shopee Builds an E‑Commerce Knowledge Graph and Leverages Large Models

Overview

For any e‑commerce platform, products are the core link between buyers and sellers. Shopee identifies three key problems: matching buyer intent with seller information, unifying diverse seller expressions, and aligning products across multiple markets and languages.

Why Knowledge Graphs?

Deep learning and large models excel at many tasks but suffer from poor interpretability, high data and compute requirements, and hallucination. Knowledge graphs, composed of entities, relations, and attributes, offer stronger interpretability and structured representation, though they are harder to build and have limited generalization.

Benefits of a Unified Product Knowledge Graph

Improves buyer experience through cross‑product comparison and multi‑dimensional attribute extraction.

Helps sellers deduplicate listings, optimize quality, and receive market‑specific recommendations.

Reduces platform operational costs by aggregating product management, enabling efficient cross‑market analysis and category expansion.

Construction Challenges

Shopee faces four major difficulties:

Multi‑source, heterogeneous information (e.g., varied expressions like "ready stock" across regions).

Inconsistent data quality (misspellings, missing or redundant fields).

Domain‑specific knowledge requirements (e.g., interpreting "50CC" for motorcycles).

Massive scale (billions of items across eight markets and six languages).

Basic Framework

The pipeline follows three steps: information extraction, knowledge fusion, and knowledge processing. Data sources include product detail pages, structured attributes, free‑text descriptions, and buyer reviews.

Information Extraction

Before extraction, Shopee defines an ontology with two layers: a foundational layer (categories L1‑L5, key and sales attributes) and a combinatorial layer (scenes, tags, standard products). Extraction challenges include noisy images, ambiguous text, and multilingual content. Solutions involve:

Text quality assessment using rule‑based heuristics and multi‑task models that jointly classify titles and extract keywords.

Image quality scoring based on resolution, presence of multiple entities, text overlay, and background clutter.

Cross‑modal verification using multimodal models such as ALBEF and BLIP, fine‑tuned for e‑commerce categories.

Knowledge Fusion

Entity disambiguation is tackled for both categories and attributes. Techniques include edit‑distance and semantic similarity for misspellings, multilingual embeddings for cross‑language alignment, and synonym models based on LabSE. Attribute value standardization considers popularity, perplexity, and expressive power to select canonical terms.

Knowledge Processing

Two complementary tasks are performed:

Reasoning and inconsistency detection: using rule‑based association mining and knowledge‑graph embedding inference to validate extracted facts (e.g., brand‑model consistency).

Graph completion via inductive and analogical reasoning, extending relations to unseen entities.

Applications

Knowledge graphs are integrated into search (enhancing query understanding and recall), recommendation (providing fine‑grained category signals), and operations (offering sellers data‑driven quality suggestions, supporting assortment planning, and enabling multi‑modal product generation). Large language models are also explored for AI assistants, automated product description generation, and multimodal content creation.

Knowledge Graph & Large Model Integration

Three integration patterns are discussed: feeding the graph into LLMs, using LLMs to enrich the graph, and joint training. Challenges include rapid product turnover, real‑time graph updates via Retrieval‑Augmented Generation, and inference speed optimization through model quantization and selective layer execution.

Q&A Highlights

Quality scoring can be performed with language models if the scoring criteria are well defined.

Entity disambiguation benefits from domain knowledge, logistics consistency checks, and contextual cues.

Few‑shot prompting and chain‑of‑thought reasoning help LLMs extract knowledge from limited data.

No single off‑the‑shelf tool exists for bidirectional graph‑LLM pipelines; research literature should be consulted.

Directly linking a massive product catalog to an LLM is impractical; selective input strategies are required.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

e-commerceAIlarge language modelmultimodalknowledge graphentity disambiguation
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.