Artificial Intelligence 13 min read

Reverse Dictionary Made Easy: Harness WantWords and Hugging Face for Quick NLP Model Integration

This article introduces the open‑source WantWords reverse‑dictionary project, explains its token‑based processing pipeline, walks through Python implementation and model invocation with Hugging Face’s Transformers, reviews NLP’s historical evolution, and shows how front‑end developers can quickly integrate NLP models into products.

ELab Team

Mar 16, 2022

Reverse Dictionary Made Easy: Harness WantWords and Hugging Face for Quick NLP Model Integration

Introduction

The open‑source project WantWords is a reverse dictionary that helps solve the “tip‑of‑the‑tongue” problem by mapping a description to possible target words.

Overall Idea

Input text is first tokenized. If the tokenized result contains a single word, the system looks up synonyms from a pre‑trained embedding matrix and boosts scores for known synonym sets. If there are multiple tokens, each token is encoded with BERT, a multi‑channel reverse‑dictionary language model computes relevance scores, and the top‑N tokens are mapped back to words via a dictionary.

Implementation Overview

Key initialization steps include loading a Chinese word‑segmentation tool, a BERT tokenizer, synonym tables, and the bidirectional language model.

# Initialize text segmentation tool
lac = thulac.thulac()

# Import token tokenizer
tokenizer_Ch = BertTokenizer.from_pretrained('bert-base-chinese')

# Load synonym and definition tables
word2index, index2word, (wd_C, wd_sems, wd_POSs, wd_charas), mask_ = load_data()

# Add synonym sets for single‑word queries
index2synset = [[] for i in range(len(word2index))]
for line in open(BASE_DIR + 'word2synset_synset.txt').readlines():
    wd = line.split()[0]
    synset = line.split()[1:]
    for syn in synset:
        index2synset[word2index[wd]].append(word2index[syn])

# Load the bidirectional language model
MODEL_FILE = BASE_DIR + 'Zh.model'
model = torch.load(MODEL_FILE, map_location=lambda storage, loc: storage)
model.eval()

Tokenization of the description:

# Tokenization
import thulac
lac = thulac.thulac()
fenci = lac.cut(description)
# Obtain a list of words
def_words = [w for w, p in fenci]

Single‑word path (score computation using the embedding matrix):

# Find related words via embedding similarity and boost synonym scores
score = (model.embedding.weight.data).mm(model.embedding.weight.data[def_word_idx[0]])
if RD_mode == 'CC':
    score[def_word_idx[0]] = -10.
score[np.array(index2synset[def_word_idx[0]])] *= 2
sc, indices = torch.sort(score, descending=True)
# Top‑500 predictions
predicted = indices[:NUM_RESPONSE].detach().cpu().numpy()
score = sc[:NUM_RESPONSE].detach().numpy()

Multi‑word path (BERT encoding and model call):

defi = '[CLS] ' + description
# Encode the input
def_word_idx = tokenizer_Ch.encode(defi)[:80]
def_word_idx.extend(tokenizer_Ch.encode('[SEP]'))
# Convert to PyTorch tensor
definition_words_t = torch.tensor(np.array(def_word_idx), dtype=torch.int64, device=device)
# Model inference
score = model('test', x=definition_words_t, w=words_t, ws=wd_sems, wP=wd_POSs, wc=wd_charas, wC=wd_C, msk_s=mask_s, msk_c=mask_c, mode=MODE)
sc, indices = torch.sort(score, descending=True)
# Top‑500 predictions
predicted = indices[0, :NUM_RESPONSE].detach().cpu().numpy()

Result conversion back to words:

# Convert indices to words using the dictionary
res = index2word[predicted]

NLP Overview

Historical development:

1950‑1970: Rule‑based methods.

1970‑early 2000s: Statistical approaches replaced rules as corpora grew.

2008‑2018: Introduction of deep learning (RNN, LSTM, GRU) and word‑embedding breakthroughs.

Since 2017: Transformer architecture (Google) and BERT (2018) dominate NLP benchmarks.

Current Research Directions

Two main branches:

Natural Language Understanding (NLU)

Natural Language Generation (NLG)

Common tasks (11 categories):

Sequence labeling – tokenization, POS tagging, NER, semantic labeling.

Classification – text classification, sentiment analysis.

Sentence relationship – entailment, QA, natural language inference.

Generative tasks – machine translation, summarization.

Practical Model Invocation

Hugging Face hosts thousands of pre‑trained models. The transformers library lets you load and use them easily.

Method 1: Direct PyTorch usage

import numpy as np
import torch
from transformers import BertTokenizer, BertForMaskedLM

samples = ['[CLS] 诸葛[MASK]是三国时期人物[SEP]']  # Input sentence
mask_index = 3

# ---- step1: token processing ----
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
tokenized_text = [tokenizer.tokenize(i) for i in samples]
input_ids = [tokenizer.convert_tokens_to_ids(i) for i in tokenized_text]
input_ids = torch.tensor(input_ids)

# ---- step2: model call ----
model = BertForMaskedLM.from_pretrained('bert-base-chinese')
model.eval()
outputs = model(input_ids)

# ---- step3: result conversion ----
sample = outputs.logits[0].detach().numpy()
pred = np.argsort(-sample[mask_index], axis=0)[:20]
print(tokenizer.convert_ids_to_tokens(pred))

Method 2: Using the pipeline API

from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-chinese')
print(unmasker('巴黎是[MASK]国的首都。'))

API Development

After the model works, you can expose it via a simple HTTP API using Flask or Django, enabling front‑end applications to call the NLP service directly.

Conclusion

By the end of this guide you should understand the boundaries of modern NLP, know how to quickly call a pre‑trained model, and be able to leverage your front‑end expertise to build intelligent, user‑centric products.

Artificial Intelligence Python model deployment NLP BERT Hugging Face reverse dictionary

Written by

ELab Team

Sharing fresh technical insights

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.