How to Convert Text into Tensors: One‑Hot, Word2Vec, FastText & Visualization
This tutorial explains how to transform raw text into tensor representations using one‑hot encoding, Word2Vec, and FastText, provides step‑by‑step code examples, discusses their advantages and drawbacks, and shows how to visualize embeddings with TensorBoard.
1. What is Text Tensor Representation?
Text tensor representation converts textual data into a tensor (usually a matrix) so that each word becomes a vector and the sequence of vectors forms a matrix representing the whole text.
['人生','该','如何','起头'] => [[1.32,4.32,0.32,5.2],
[3.1,5.43,0.34,3.2],
[3.21,5.32,2,4.32],
[2.54,7.32,5.12,9.54]]2. Why Use Tensor Representations?
Representing text as tensors enables computers to process and understand natural language for downstream analysis and tasks such as classification, similarity search, and generation.
3. Common Methods
3.1 One‑Hot Encoding
Each word is encoded as a vector of length *n* (the vocabulary size) with a single 1 at the word’s index and 0 elsewhere.
['改变','要','如何','起手'] => [[1,0,0,0],
[0,1,0,0],
[0,0,1,0],
[0,0,0,1]]Implementation Example
from sklearn.externals import joblib
from keras.preprocessing.text import Tokenizer
vocab = {"周杰伦","陈奕迅","王力宏","李宗盛","吴亦凡","鹿晗"}
t = Tokenizer(num_words=None, char_level=False)
t.fit_on_texts(vocab)
for token in vocab:
zero_list = [0] * len(vocab)
token_index = t.texts_to_sequences([token])[0][0] - 1
zero_list[token_index] = 1
print(token, "的one‑hot编码为:", zero_list)
tokenizer_path = "./Tokenizer"
joblib.dump(t, tokenizer_path)Pros: Simple and easy to implement.
Cons: Cannot capture relationships between words; high dimensionality for large vocabularies leads to high memory usage.
3.2 Word2Vec
Word2Vec is an unsupervised learning technique that produces dense word vectors by training a shallow neural network. It offers two training architectures:
3.2.1 CBOW (Continuous Bag of Words)
Predicts a target word from its surrounding context words.
3.2.2 Skip‑Gram
Predicts surrounding context words from a given target word.
3.3 Word Embedding
Word embedding is the general term for mapping words into high‑dimensional dense vectors. In a narrow sense, it refers to the embedding layer of a neural network that learns an embedding matrix during training.
4. Training Word Vectors with FastText
4.1 Obtain Training Data
wget -c http://mattmahoney.net/dc/enwik9.zip -P data
unzip data/enwik9.zip -d data4.2 Train the Model
import fasttext
model = fasttext.train_unsupervised('data/enwik9')4.3 Set Hyper‑parameters
model = fasttext.train_unsupervised('data/enwik9', "cbow", dim=300, epoch=1, lr=0.1, thread=8)4.4 Verify Model Effectiveness
model.get_nearest_neighbors('sports')
model.get_nearest_neighbors('music')
model.get_nearest_neighbors('dog')4.5 Save and Reload the Model
model.save_model("enwik9.bin")
model = fasttext.load_model("enwik9.bin")5. Visualizing Word Embeddings
import torch
from torch.utils.tensorboard import SummaryWriter
import fileinput
writer = SummaryWriter()
embedded = torch.randn(100, 50)
meta = list(map(lambda x: x.strip(), fileinput.FileInput("./vocab100.csv")))
writer.add_embedding(embedded, metadata=meta)
writer.close()Start TensorBoard to view the embeddings:
tensorboard --logdir runs --host 0.0.0.0Conclusion
Text tensor representation is the foundation for processing natural language with machine learning. One‑hot encoding, Word2Vec, and other dense embedding techniques transform words into vectors that capture semantic relationships. FastText provides an efficient way to train such embeddings, and hyper‑parameter tuning along with TensorBoard visualization helps assess model quality.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
