Reverse Dictionary Made Easy: Harness WantWords and Hugging Face for Quick NLP Model Integration
This article introduces the open‑source WantWords reverse‑dictionary project, explains its token‑based processing pipeline, walks through Python implementation and model invocation with Hugging Face’s Transformers, reviews NLP’s historical evolution, and shows how front‑end developers can quickly integrate NLP models into products.
Introduction
The open‑source project WantWords is a reverse dictionary that helps solve the “tip‑of‑the‑tongue” problem by mapping a description to possible target words.
Overall Idea
Input text is first tokenized. If the tokenized result contains a single word, the system looks up synonyms from a pre‑trained embedding matrix and boosts scores for known synonym sets. If there are multiple tokens, each token is encoded with BERT, a multi‑channel reverse‑dictionary language model computes relevance scores, and the top‑N tokens are mapped back to words via a dictionary.
Implementation Overview
Key initialization steps include loading a Chinese word‑segmentation tool, a BERT tokenizer, synonym tables, and the bidirectional language model.
# Initialize text segmentation tool
lac = thulac.thulac()
# Import token tokenizer
tokenizer_Ch = BertTokenizer.from_pretrained('bert-base-chinese')
# Load synonym and definition tables
word2index, index2word, (wd_C, wd_sems, wd_POSs, wd_charas), mask_ = load_data()
# Add synonym sets for single‑word queries
index2synset = [[] for i in range(len(word2index))]
for line in open(BASE_DIR + 'word2synset_synset.txt').readlines():
wd = line.split()[0]
synset = line.split()[1:]
for syn in synset:
index2synset[word2index[wd]].append(word2index[syn])
# Load the bidirectional language model
MODEL_FILE = BASE_DIR + 'Zh.model'
model = torch.load(MODEL_FILE, map_location=lambda storage, loc: storage)
model.eval()Tokenization of the description:
# Tokenization
import thulac
lac = thulac.thulac()
fenci = lac.cut(description)
# Obtain a list of words
def_words = [w for w, p in fenci]Single‑word path (score computation using the embedding matrix):
# Find related words via embedding similarity and boost synonym scores
score = (model.embedding.weight.data).mm(model.embedding.weight.data[def_word_idx[0]])
if RD_mode == 'CC':
score[def_word_idx[0]] = -10.
score[np.array(index2synset[def_word_idx[0]])] *= 2
sc, indices = torch.sort(score, descending=True)
# Top‑500 predictions
predicted = indices[:NUM_RESPONSE].detach().cpu().numpy()
score = sc[:NUM_RESPONSE].detach().numpy()Multi‑word path (BERT encoding and model call):
defi = '[CLS] ' + description
# Encode the input
def_word_idx = tokenizer_Ch.encode(defi)[:80]
def_word_idx.extend(tokenizer_Ch.encode('[SEP]'))
# Convert to PyTorch tensor
definition_words_t = torch.tensor(np.array(def_word_idx), dtype=torch.int64, device=device)
# Model inference
score = model('test', x=definition_words_t, w=words_t, ws=wd_sems, wP=wd_POSs, wc=wd_charas, wC=wd_C, msk_s=mask_s, msk_c=mask_c, mode=MODE)
sc, indices = torch.sort(score, descending=True)
# Top‑500 predictions
predicted = indices[0, :NUM_RESPONSE].detach().cpu().numpy()Result conversion back to words:
# Convert indices to words using the dictionary
res = index2word[predicted]NLP Overview
Historical development:
1950‑1970: Rule‑based methods.
1970‑early 2000s: Statistical approaches replaced rules as corpora grew.
2008‑2018: Introduction of deep learning (RNN, LSTM, GRU) and word‑embedding breakthroughs.
Since 2017: Transformer architecture (Google) and BERT (2018) dominate NLP benchmarks.
Current Research Directions
Two main branches:
Natural Language Understanding (NLU)
Natural Language Generation (NLG)
Common tasks (11 categories):
Sequence labeling – tokenization, POS tagging, NER, semantic labeling.
Classification – text classification, sentiment analysis.
Sentence relationship – entailment, QA, natural language inference.
Generative tasks – machine translation, summarization.
Practical Model Invocation
Hugging Face hosts thousands of pre‑trained models. The transformers library lets you load and use them easily.
Method 1: Direct PyTorch usage
import numpy as np
import torch
from transformers import BertTokenizer, BertForMaskedLM
samples = ['[CLS] 诸葛[MASK]是三国时期人物[SEP]'] # Input sentence
mask_index = 3
# ---- step1: token processing ----
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
tokenized_text = [tokenizer.tokenize(i) for i in samples]
input_ids = [tokenizer.convert_tokens_to_ids(i) for i in tokenized_text]
input_ids = torch.tensor(input_ids)
# ---- step2: model call ----
model = BertForMaskedLM.from_pretrained('bert-base-chinese')
model.eval()
outputs = model(input_ids)
# ---- step3: result conversion ----
sample = outputs.logits[0].detach().numpy()
pred = np.argsort(-sample[mask_index], axis=0)[:20]
print(tokenizer.convert_ids_to_tokens(pred))Method 2: Using the pipeline API
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-chinese')
print(unmasker('巴黎是[MASK]国的首都。'))API Development
After the model works, you can expose it via a simple HTTP API using Flask or Django, enabling front‑end applications to call the NLP service directly.
Conclusion
By the end of this guide you should understand the boundaries of modern NLP, know how to quickly call a pre‑trained model, and be able to leverage your front‑end expertise to build intelligent, user‑centric products.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
