Artificial Intelligence 11 min read

Fundamentals and Practical Applications of Text Mining: Workflow, Methods, and a Sentiment Analysis Case Study

This article outlines the end‑to‑end text‑mining workflow—from data acquisition and preprocessing to feature extraction, algorithm selection, and model evaluation—while demonstrating a sentiment‑analysis case study that combines LDA topic modeling with deep‑learning classifiers.

JD Tech Talk

Apr 19, 2019

Fundamentals and Practical Applications of Text Mining: Workflow, Methods, and a Sentiment Analysis Case Study

In the era of the Internet, massive amounts of unstructured or semi‑structured textual data contain valuable patterns and knowledge that can be uncovered through text mining, a branch of data mining that extracts previously unknown, understandable, and usable information from text.

1. Text Data Acquisition – Data sources include ready‑made datasets (e.g., patents, academic papers) and web‑crawled information; the latter requires writing or using existing crawlers.

2. Data Preprocessing – Essential steps such as tokenization, numeric and date handling, word embedding, part‑of‑speech tagging, lemmatization, domain‑specific dictionary construction, and stop‑word removal prepare the corpus for feature extraction.

3. Feature Extraction and Selection – After preprocessing, techniques like TF‑IDF, covariance, mutual information, information gain, cross‑entropy, genetic algorithms, word2vec, CountVectorizer, and doc2vec are employed to obtain meaningful features; an example uses Gensim’s doc2vec to convert patents into document vectors for similarity comparison.

4. Algorithm Mining – Traditional machine‑learning classifiers (logistic regression, Naïve Bayes, SVM) built on bag‑of‑words features are contrasted with deep‑learning models (TextCNN, Bi‑LSTM, RNN). The fastText model from Facebook is highlighted for its linear structure and speed, using word embeddings and hierarchical softmax.

5. Model Evaluation – Standard metrics such as accuracy, precision, and error rate are used to assess model performance.

Case Study: Sentiment Analysis of Online Reviews

The case study aims to analyze consumer sentiment in e‑commerce reviews. The workflow includes data collection (100 positive and 100 negative comments), preprocessing (stop‑word removal, tokenization, constructing a term‑frequency matrix), LDA topic modeling, and sentiment scoring.

Method 1 – LDA Topic Modeling – LDA generates a document‑topic matrix and a topic‑term matrix based on Dirichlet‑based Bayesian inference; Gibbs sampling and perplexity analysis determine the optimal number of topics (k=5).

Method 2 – Sentiment Scoring – A sentiment lexicon is built, and pointwise mutual information (PMI) is used to compute the association between topic terms and positive/negative sentiment words. The sentiment score for each topic is obtained by weighting term sentiment with topic probabilities.

Results and Evaluation – The five topics correspond to hotel facilities, food & accommodation, hotel features, surrounding environment, and location. Topics 1, 2, and 4 show negative sentiment, while topics 3 and 5 are positive, leading to actionable insights for service improvement.

Conclusion – The article summarizes the text‑mining pipeline and demonstrates its practical value through a sentiment‑analysis example, encouraging further exploration of deep‑learning techniques in text mining.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning natural language processing Sentiment Analysis TF-IDF text-mining LDA

Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.