How to Convert Text into Numerical Features for NLP: Tokenization, One‑Hot Encoding, and Word Embedding

This article walks through the essential steps of turning raw natural language into machine‑readable numbers, covering categorical vs. numerical features, one‑hot encoding of categorical data, tokenization, building vocabularies, and using word embeddings, illustrated with an IMDB sentiment‑analysis example in Keras.

Lisa Notes
Lisa Notes
Lisa Notes
How to Convert Text into Numerical Features for NLP: Tokenization, One‑Hot Encoding, and Word Embedding

Natural Language Processing (NLP) aims to enable computers to understand, process, and generate human language. To feed text into machine‑learning models, the text must be transformed into numerical representations.

1. Data Processing – Categorical vs. Numerical Features Machine‑learning models only handle numeric data (0/1). Categorical attributes such as gender ("male", "female") are categorical features, while age (e.g., 25) is a numerical feature. Directly mapping categories to integers (e.g., USA = 1, China = 2, India = 3) leads to meaningless arithmetic (1 + 2 = 3), so a more appropriate encoding is required. One‑hot encoding represents each category with a binary vector; for countries, a 197‑dimensional vector can be used.

2. Text Processing – Tokenization Tokenization splits a sentence into the smallest encoding units (words). Example: the sentence “…To be or not to be, it is a question…” becomes the token list ["to", "be", "or", "not", "it", "is", "a", "question"]. The frequency of each token is counted, and tokens are mapped to integer indices based on descending frequency, forming a vocabulary.

One‑hot encoding can then represent each token as a binary vector whose dimension equals the vocabulary size (the Vocabulary ).

3. Word Embedding One‑hot vectors for a large vocabulary (e.g., >100 k English words) lead to extremely high dimensionality and computational cost. Word embeddings map words to dense, low‑dimensional vectors, reducing dimensionality while preserving semantic relationships.

Practical Example – IMDB Sentiment Classification The IMDB dataset contains 50 k English movie reviews, split into 25 k training and 25 k test samples. The task is to predict whether a review is positive or negative. The workflow in Keras includes four steps: Tokenization, Build Dictionary, One‑hot Encoding, and Align Sequence.

Environment Setup

实验环境设置:</code><code>   安装Anaconda,新建环境“nlp-eng”【用于英文文本任务,个人习惯】,并配置好Keras框架;</code><code>   设置JupyterNotebook;修改默认路径,安装目录插件。

Data Loading Challenges The reviews are stored as individual .txt files. Using Python’s os module, os.path.join and os.listdir are employed to read all files into a list texts_train. The Tokenizer in Keras expects a list of strings, so the collected texts are passed to tokenizer.fit_on_texts(texts_train) to build the dictionary.

tokenizer.fit_on_texts(texts_train)   #基于训练数据建立的字典

After building the dictionary, each review is converted to a sequence of token indices, padded to a uniform length, and fed into an embedding layer followed by a classifier.

The article also notes practical tips such as handling missing values (using index 0) and the importance of aligning sequences before feeding them to the model.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TokenizationNLPData preprocessingKerasWord EmbeddingOne-hot encodingIMDB sentiment analysis
Lisa Notes
Written by

Lisa Notes

Lisa's notes: musings on daily life, work, study, personal growth, and casual reflections.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.