Master Linux Text Analysis: Count Words, Characters, and Frequencies with Simple Shell Scripts
This guide shows how to use basic Linux commands such as wc, tr, fold, sort, and uniq to create a text file from a manual page, then extract and rank the most common words and characters, handle case sensitivity, remove punctuation, and filter long uncommon words.
Linux’s command line offers powerful one‑liners for text analysis. The article demonstrates how to generate a sample file containing the manual page of the man command and then apply a series of shell pipelines to count and rank words and characters.
Creating the sample text file
Run the following command to capture the man output into linuxmi.com.txt: man man > linuxmi.com.txt The resulting file contains the full manual page, which serves as the data source for subsequent analyses.
Finding the most frequent words
Use this pipeline to list the top ten words (case‑insensitive, punctuation removed):
cat linuxmi.com.txt | tr ' ' '\012' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | grep -v '[^a-z]' | sort | uniq -c | sort -rn | headThe output shows word counts such as 5773 the, 90 man, etc.
Displaying individual characters
To split a string into one‑character lines: echo 'www.linuxmi.com' | fold -w1 The result lists each character on its own line.
Most frequent characters
Count and rank characters in the manual file:
fold -w1 < linuxmi.com.txt | sort | uniq -c | sort -rn | headTypical output includes counts for symbols, letters, and digits.
Case‑insensitive character frequency
Convert to uppercase before counting to ignore case:
fold -w1 < linuxmi.com.txt | sort | tr '[:lower:]' '[:upper:]' | uniq -c | sort -rn | head -20This yields counts for letters like E, A, T, etc.
Removing punctuation from the count
Exclude punctuation symbols before counting:
fold -w1 < linuxmi.com.txt | tr '[:lower:]' '[:upper:]' | sort | tr -d '[:punct:]' | uniq -c | sort -rn | head -20Processing multiple files together
Combine several text files and compute character frequencies:
cat *.txt | fold -w1 | tr '[:lower:]' '[:upper:]' | sort | tr -d '[:punct:]' | uniq -c | sort -rn | head -8Finding rare long words (≥10 characters)
List uncommon words that are at least ten characters long:
cat linuxmi.com.txt | tr '' '\012' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | tr -d '[0-9]' | sort | uniq -c | sort -n | grep -E '..................' | headThese one‑liners illustrate how simple Unix utilities can be combined to perform powerful text statistics without installing additional software.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
