qa_match V1.3: Lightweight Deep Learning QA Matching Tool with Semi‑Automatic Knowledge‑Base Mining and Transformer‑Enhanced Pre‑training
The qa_match open‑source tool from 58 Tongcheng, now at version 1.3, introduces semi‑automatic knowledge‑base mining for cold‑start and online scenarios and upgrades its Simple Pre‑trained Model (SPTM) with Transformer‑based feature representation to improve question‑answer matching performance.
qa_match is a lightweight deep‑learning based question‑answer matching tool released by 58 Tongcheng; version 1.0 launched on March 9 2020, v1.1 in June 2020, and the latest v1.3 in December 2020. The project is open‑source on GitHub (https://github.com/wuba/qa_match) under the Apache License 2.0.
Version 1.3 adds two major features: (1) a semi‑automatic knowledge‑base mining workflow that supports both cold‑start knowledge acquisition and post‑deployment question expansion, and (2) an enhanced Simple Pre‑trained Model (SPTM) that incorporates Transformer‑based feature representations.
The upgrade addresses limitations of earlier releases, which relied on a single‑layer knowledge base and a Bi‑LSTM pre‑training model. By introducing knowledge‑base mining and a Transformer‑augmented SPTM, the system achieves better downstream QA performance.
The semi‑automatic knowledge‑base mining module builds on the existing qa_match pipeline, using the Deep Embedding Clustering (DEC) algorithm (based on SPTM embeddings) to discover standard and expanded questions. In cold‑start, it creates an initial knowledge base from unlabeled data via DEC; after deployment, it continuously extracts new utterances using custom cluster centers.
DEC, originally presented at ICML 2016, jointly learns feature representations and cluster assignments. The implementation replaces the original auto‑encoder with SPTM embeddings and allows custom cluster centers to provide supervised signals, reducing randomness in clustering.
Cold‑start mining scenario: when a new business integrates automatic QA, historical unsupervised data is clustered with DEC to generate standard questions and expanded utterances. The process flow is illustrated in the following diagram:
Post‑deployment mining scenario: after the QA matching model is online, the existing knowledge base is enriched by clustering new user queries with DEC using custom centers, thereby expanding coverage and improving precision‑recall. The workflow diagram is:
Evaluation of the clustering algorithm uses both external and internal metrics; internal quality is measured by the silhouette coefficient. Results are shown in the table below:
Dataset
Model
Silhouette
Runtime
Inference Time
1w
DEC
0.7962
30 min
52 s
10w
DEC
0.9302
3 h 5 min
5 min 55 s
100w
DEC
0.849
11 h 30 min
15 min 28 s
The upgraded SPTM incorporates a shared‑parameter Transformer encoder. Input consists of Word‑Aware token embeddings and Position‑Aware embeddings. The shared Transformer encoder uses multi‑head attention and feed‑forward layers with residual connections, enhancing representation capacity while keeping parameter count low.
Pre‑training time for the Transformer‑based SPTM is summarized in the following figure:
Future plans include releasing a TensorFlow 2.4 compatible version and providing X‑version or PyTorch implementations of qa_match.
Contributions are welcomed via GitHub pull requests or issues (https://github.com/wuba/qa_match.git) and by emailing ailab‑[email protected].
Authors: Lv Yuan‑yuan, Wang Yong, and He Rui – senior algorithm engineers and architect at 58 Tongcheng AI Lab, responsible for the intelligent QA research and development.
References: [1] https://github.com/wuba/qa_match#基于一层结构知识库的自动问答 [2] https://github.com/wuba/qa_match/tree/v1.1#基于sptm模型的自动问答 [3] Xie, Junyuan, Ross Girshick, and Ali Farhadi. "Unsupervised deep embedding for clustering analysis." ICML 2016.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.