Artificial Intelligence 15 min read

Bidding Document Classification and Entity Extraction Using BERT-based Models

This article describes how 58.com built an end‑to‑end bidding service that crawls tender documents, classifies them into multiple categories with BERT‑based models (including softmax, sigmoid and ensemble approaches) and extracts key entities using BERT‑CRF and reading‑comprehension techniques, achieving over 90% overall accuracy and dramatically improving recall and precision.

58 Tech
58 Tech
58 Tech
Bidding Document Classification and Entity Extraction Using BERT-based Models

58.com, the largest domestic life‑service platform, integrated nationwide bidding resources into its merchant app, aiming to provide merchants with timely and effective tender opportunities. The project required both classification of tender documents into 30+ service categories and extraction of critical information such as names, phone numbers, companies, and dates.

The pipeline starts with a crawler that pushes raw tender data to the backend, where non‑document texts are filtered out. Valid documents are first classified, then passed to a named‑entity‑recognition (NER) module that extracts key fields before returning the enriched data to the app.

For classification, an initial rule‑based system achieved about 85% accuracy but suffered low recall. The team replaced it with BERT‑based models. A 7‑class model (six business categories plus "other") was trained, first using a softmax output, then switching to sigmoid to handle multi‑label cases. An ensemble of three models trained on random 80% subsets further improved performance. Evaluation on 2,000 sampled documents showed accuracies up to 98.1% and recall above 98% for the best ensemble.

Entity extraction also moved from rule‑based methods (~70% accuracy) to deep learning. A BERT‑CRF model was first employed, followed by a BERT reading‑comprehension (MRC) model that treats the target name as a question and extracts the corresponding phone number from a surrounding paragraph. These approaches raised entity extraction accuracy from 70% to over 97% and reduced inference time.

After deployment, the classification accuracy rose from 85% to 93% and overall document accuracy from 83% to 90%, while entity extraction accuracy exceeded 97%. The system now processes more than 4,000 tender documents daily, serving around 2,300 active users. Future work includes expanding to 30 categories, reducing model latency with lightweight architectures, and further automating rule‑to‑model transitions.

model

acc (self‑eval)

recall (self‑eval)

precision (self‑eval)

BERT softmax (single model)

93.58%

98.63%

87.0%

BERT sigmoid (single model)

93.9%

98.96%

89.17%

BERT sigmoid (3 models)

98.1%

98.82%

96.32%

References include the original BERT paper, Transformer architecture, and recent NER models.

machine learningdeep learningNLPBERTentity extractiondocument classificationtender
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.