Bidding Document Classification and Entity Extraction Using BERT-based Models
This article describes how 58.com built an end‑to‑end bidding service that crawls tender documents, classifies them into multiple categories with BERT‑based models (including softmax, sigmoid and ensemble approaches) and extracts key entities using BERT‑CRF and reading‑comprehension techniques, achieving over 90% overall accuracy and dramatically improving recall and precision.
58.com, the largest domestic life‑service platform, integrated nationwide bidding resources into its merchant app, aiming to provide merchants with timely and effective tender opportunities. The project required both classification of tender documents into 30+ service categories and extraction of critical information such as names, phone numbers, companies, and dates.
The pipeline starts with a crawler that pushes raw tender data to the backend, where non‑document texts are filtered out. Valid documents are first classified, then passed to a named‑entity‑recognition (NER) module that extracts key fields before returning the enriched data to the app.
For classification, an initial rule‑based system achieved about 85% accuracy but suffered low recall. The team replaced it with BERT‑based models. A 7‑class model (six business categories plus "other") was trained, first using a softmax output, then switching to sigmoid to handle multi‑label cases. An ensemble of three models trained on random 80% subsets further improved performance. Evaluation on 2,000 sampled documents showed accuracies up to 98.1% and recall above 98% for the best ensemble.
Entity extraction also moved from rule‑based methods (~70% accuracy) to deep learning. A BERT‑CRF model was first employed, followed by a BERT reading‑comprehension (MRC) model that treats the target name as a question and extracts the corresponding phone number from a surrounding paragraph. These approaches raised entity extraction accuracy from 70% to over 97% and reduced inference time.
After deployment, the classification accuracy rose from 85% to 93% and overall document accuracy from 83% to 90%, while entity extraction accuracy exceeded 97%. The system now processes more than 4,000 tender documents daily, serving around 2,300 active users. Future work includes expanding to 30 categories, reducing model latency with lightweight architectures, and further automating rule‑to‑model transitions.
model
acc (self‑eval)
recall (self‑eval)
precision (self‑eval)
BERT softmax (single model)
93.58%
98.63%
87.0%
BERT sigmoid (single model)
93.9%
98.96%
89.17%
BERT sigmoid (3 models)
98.1%
98.82%
96.32%
References include the original BERT paper, Transformer architecture, and recent NER models.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.