How to Boost Text Analysis Accuracy on a 2‑Billion‑Word Corpus

This article explains practical techniques for improving NLP model accuracy on massive corpora, covering challenges of multi‑field text, word‑embedding choices, a fasttext‑based regression demo with book‑review data, feature engineering tricks, and a comparison with tf‑idf + LASSO.

Code DAO
Code DAO
Code DAO
How to Boost Text Analysis Accuracy on a 2‑Billion‑Word Corpus

Natural language processing tasks such as text classification, topic modeling, and sentiment analysis are core to many products at Soroco. The article examines how to raise model accuracy when training on a corpus ranging from 200 million to 2 billion words.

Challenges of Multi‑Field Text

Emails illustrate typical difficulties: they contain separate fields (subject, sender, body, attachments) that may have differing importance. Folding all fields into a single block gives equal weight to each, which can obscure signals such as a positive word in the subject. Variations in greeting phrases (e.g., “Congratulations to the Soroco Team” vs. “Kudos to the Soroco Team”) also affect model performance, and different vectorization methods (fasttext, word2vec, tf‑idf) behave differently on the same data.

Choosing Word Embeddings

The first decision is which embedding to use. The article compares word2vec (one vector per token) with fasttext (character n‑gram based, handling out‑of‑vocabulary words and minor spelling variations). Facebook Research provides pre‑trained fasttext models for 157 languages, trained on Common Crawl and Wikipedia dumps.

Demonstration: Regression on Book‑Review Data

A regression task is defined: given a reviewer’s past ratings and descriptions for a reviewed book, predict the bias they would assign to a new, unseen book. The dataset consists of all reviews by a prolific Goodreads user. Each example includes two books – a “reviewed” book and an “unreviewed” book – with their descriptions, review text, personal rating, and average rating.

from unicodedata import category

def string_map(sentence):
    if not sentence:
        return []
    sentence = sentence.lower()
    to_replace = set()
    for ch in set(sentence):
        if category(ch)[0] not in ("L", "N"):
            to_replace.add(ch)
    for ch in to_replace:
        sentence = sentence.replace(ch, " ")
    return [token for token in sentence.split() if not token.isnumeric()]

Data are split into training (10 000 pairs) and validation (5 000 pairs). Fasttext provides sentence vectors, which are concatenated for the two books and fed to a RandomForestRegressor. The target is the signed difference between personal and average ratings for the reviewed book minus the same difference for the unreviewed book.

import pickle, fasttext, numpy as np
from sklearn.ensemble import RandomForestRegressor

TRAINING_DATA_SIZE = 10000
VALIDATION_DATA_SIZE = 5000

model = fasttext.load_model("goodreads.bin")
model_dimension = model.get_dimension()

train_data = pickle.load(open("train.pkl", "rb"))
regressor = RandomForestRegressor(random_state=1)
for line in train_data:
    desc_vec = model.get_sentence_vector(" ".join(string_map(line["book description"])) )
    review_vec = model.get_sentence_vector(" ".join(string_map(line["review text"])) )
    line["word vector"] = np.concatenate([desc_vec, review_vec])

validation_data = train_data[round(0.9 * len(train_data)) :]
train_data = train_data[: round(0.9 * len(train_data))]

np.random.seed(1)
training_data_x = np.empty((TRAINING_DATA_SIZE, model_dimension * 2))
training_data_y = np.empty((TRAINING_DATA_SIZE,))
for i in range(TRAINING_DATA_SIZE):
    pt1, pt2 = np.random.choice(train_data, 2)
    x = np.concatenate([pt1["word vector"], pt2["word vector"][:model_dimension]])
    y = (int(pt1["personal rating"]) - float(pt1["average rating"])) - (int(pt2["personal rating"]) - float(pt2["average rating"]))
    training_data_x[i] = x
    training_data_y[i] = y

regressor.fit(training_data_x, training_data_y)
pickle.dump(regressor, open("regressor.pkl", "wb"))

The model’s fit is measured with the R² score on the validation set. A baseline "random embedding" is introduced, where each word vector is drawn from a multivariate normal distribution with the same dimensionality as the fasttext model.

class RandomEmbedding:
    def __init__(self, dim=300, seed=None) -> None:
        self.dim = dim
        self.seed = seed
        self.rng = np.random.default_rng(seed=seed)
        self.mapping = defaultdict(lambda: self.rng.normal(0, 1, dim).astype(np.float32))
    def get_dimension(self):
        return self.dim
    def get_word_vector(self, word):
        return self.mapping[word]
    def get_sentence_vector(self, sentence):
        words = sentence.split()
        return sum([self.mapping[w] for w in words]) / len(words)

Feature engineering improves R²: after removing stop words, the mean of remaining word vectors is computed, and a "bounding‑box" feature (min and max per dimension across the sentence) is concatenated, doubling the feature count. Although this raises the risk of over‑fitting, validation performance increases.

Comparison with TF‑IDF + LASSO

For contrast, the same regression task is solved using tf‑idf vectors and LASSO regression. Sparse high‑dimensional features make the Random Forest less efficient, and the resulting R² is substantially lower than the embedding‑based approach, as shown in the accompanying chart.

Conclusion

The article demonstrates that word‑embedding techniques, especially fasttext with character n‑grams, provide richer semantic representations that improve model fit on large‑scale text data. Adding bounding‑box features further boosts performance, while traditional tf‑idf + LASSO remains less effective for this regression scenario.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonregressionNLPtext classificationword embeddingsfasttextword2vec
Code DAO
Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.