How Xianyu Extracts Second‑Hand Product Attributes with Albert‑Tiny and StructBert

This article analyzes Xianyu's second‑hand attribute extraction pipeline, detailing the challenges of sparse product data, the decomposition into NER and classification tasks, the use of Albert‑Tiny, StructBert and regex methods, deployment strategies, evaluation results, and future directions.

Xianyu Technology
Xianyu Technology
Xianyu Technology
How Xianyu Extracts Second‑Hand Product Attributes with Albert‑Tiny and StructBert

Background

In a C2X marketplace where users post items with minimal structured data, second‑hand goods lack sufficient product attributes. Unique attributes include usage count, purchase channel, packaging completeness, and category‑specific traits such as expiration dates for cosmetics or screen condition for phones.

Problems and Challenges

Second‑hand attribute extraction is an Information Extraction task that can be split into Named Entity Recognition (NER) and text classification. The main challenges are:

Each product category requires a distinct set of attributes, so separate models are needed.

Supervised learning with BERT‑family models demands extensive labeling, which prolongs development cycles.

Solution Overview

The approach combines three methods according to attribute variability:

Fixed or template‑based sentences: use CRF, BiLSTM‑CRF, BERT, or BERT+CRF for NER.

Keywords with limited variations: apply regular expressions and rule‑based methods.

Highly variable expressions: employ BERT‑family models.

Methodology

Annotation Phase : Use Alibaba’s AliNLP e‑commerce NER model to pre‑label data, then manually refine BIO tags for NER attributes and add classification labels on top of the tokenization results.

Model Training Phase :

Albert‑Tiny : lightweight model for real‑time inference; optional CRF layer for NER.

StructBert‑Base : higher accuracy for offline T+1 tasks; optional CRF layer for NER.

Regular Expressions : fastest method, excels on fixed‑pattern attributes but requires domain knowledge.

Rule‑Based Post‑Processing : Normalize NER outputs (e.g., map "175/88A" to "L" size) and resolve conflicts (e.g., downgrade "brand new" to "99% new" when usage count > 0).

Model Details

Albert‑Tiny reduces parameters via factorized word embeddings and cross‑layer parameter sharing, achieving roughly ten‑fold faster inference than BERT‑base with comparable accuracy. Source code: https://github.com/brightmart/albert_zh.

StructBert adds two pre‑training objectives to BERT:

Word Structural Objective – shuffles trigrams and forces reconstruction.

Sentence Structural Objective – predicts whether a sentence is the next, previous, or a random sentence.

These objectives improve downstream performance; StructBert ranks among the top entries on the GLUE leaderboard.

Deployment

Offline T+1 scenario: Deploy via MaxCompute (ODPS) using Python UDFs; model files are uploaded as resources.

Online real‑time scenario: Deploy with PAI‑EAS in a distributed setting, interfacing through iGraph and TPP for data exchange.

Evaluation

For each category and attribute, a sampled set of items is manually reviewed. Accuracy, precision, and recall are computed by comparing model predictions with human labels. Reported results exceed 98% accuracy across major categories, meeting production thresholds.

Results and Applications

High‑precision extraction enables downstream use cases such as pricing optimization, chatbot interactions, high‑quality product pool discovery, and search/recommendation enhancements.

References

https://arxiv.org/abs/1909.11942

– Albert paper https://arxiv.org/abs/1908.04577 – StructBert paper https://github.com/brightmart/albert_zh – Albert_zh source code https://gluebenchmark.com/leaderboard – GLUE leaderboard

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

NLPALBERTattribute extractionInformation ExtractionStructBert
Xianyu Technology
Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.