How to Convert Text into Tensors: One‑Hot, Word2Vec, FastText & Visualization

This tutorial explains how to transform raw text into tensor representations using one‑hot encoding, Word2Vec, and FastText, provides step‑by‑step code examples, discusses their advantages and drawbacks, and shows how to visualize embeddings with TensorBoard.

JavaEdge
JavaEdge
JavaEdge
How to Convert Text into Tensors: One‑Hot, Word2Vec, FastText & Visualization

1. What is Text Tensor Representation?

Text tensor representation converts textual data into a tensor (usually a matrix) so that each word becomes a vector and the sequence of vectors forms a matrix representing the whole text.

['人生','该','如何','起头'] => [[1.32,4.32,0.32,5.2],
                         [3.1,5.43,0.34,3.2],
                         [3.21,5.32,2,4.32],
                         [2.54,7.32,5.12,9.54]]

2. Why Use Tensor Representations?

Representing text as tensors enables computers to process and understand natural language for downstream analysis and tasks such as classification, similarity search, and generation.

3. Common Methods

3.1 One‑Hot Encoding

Each word is encoded as a vector of length *n* (the vocabulary size) with a single 1 at the word’s index and 0 elsewhere.

['改变','要','如何','起手'] => [[1,0,0,0],
                         [0,1,0,0],
                         [0,0,1,0],
                         [0,0,0,1]]

Implementation Example

from sklearn.externals import joblib
from keras.preprocessing.text import Tokenizer

vocab = {"周杰伦","陈奕迅","王力宏","李宗盛","吴亦凡","鹿晗"}

t = Tokenizer(num_words=None, char_level=False)
t.fit_on_texts(vocab)

for token in vocab:
    zero_list = [0] * len(vocab)
    token_index = t.texts_to_sequences([token])[0][0] - 1
    zero_list[token_index] = 1
    print(token, "的one‑hot编码为:", zero_list)

tokenizer_path = "./Tokenizer"
joblib.dump(t, tokenizer_path)

Pros: Simple and easy to implement.

Cons: Cannot capture relationships between words; high dimensionality for large vocabularies leads to high memory usage.

3.2 Word2Vec

Word2Vec is an unsupervised learning technique that produces dense word vectors by training a shallow neural network. It offers two training architectures:

3.2.1 CBOW (Continuous Bag of Words)

Predicts a target word from its surrounding context words.

3.2.2 Skip‑Gram

Predicts surrounding context words from a given target word.

3.3 Word Embedding

Word embedding is the general term for mapping words into high‑dimensional dense vectors. In a narrow sense, it refers to the embedding layer of a neural network that learns an embedding matrix during training.

4. Training Word Vectors with FastText

4.1 Obtain Training Data

wget -c http://mattmahoney.net/dc/enwik9.zip -P data
unzip data/enwik9.zip -d data

4.2 Train the Model

import fasttext
model = fasttext.train_unsupervised('data/enwik9')

4.3 Set Hyper‑parameters

model = fasttext.train_unsupervised('data/enwik9', "cbow", dim=300, epoch=1, lr=0.1, thread=8)

4.4 Verify Model Effectiveness

model.get_nearest_neighbors('sports')
model.get_nearest_neighbors('music')
model.get_nearest_neighbors('dog')

4.5 Save and Reload the Model

model.save_model("enwik9.bin")
model = fasttext.load_model("enwik9.bin")

5. Visualizing Word Embeddings

import torch
from torch.utils.tensorboard import SummaryWriter
import fileinput

writer = SummaryWriter()
embedded = torch.randn(100, 50)
meta = list(map(lambda x: x.strip(), fileinput.FileInput("./vocab100.csv")))
writer.add_embedding(embedded, metadata=meta)
writer.close()

Start TensorBoard to view the embeddings:

tensorboard --logdir runs --host 0.0.0.0

Conclusion

Text tensor representation is the foundation for processing natural language with machine learning. One‑hot encoding, Word2Vec, and other dense embedding techniques transform words into vectors that capture semantic relationships. FastText provides an efficient way to train such embeddings, and hyper‑parameter tuning along with TensorBoard visualization helps assess model quality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

text representationfasttextWord2vectensorone-hot
JavaEdge
Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.