Fundamentals 6 min read

Master Linux Text Analysis: Count Words, Characters, and Frequencies with Simple Shell Scripts

This guide shows how to use basic Linux commands such as wc, tr, fold, sort, and uniq to create a text file from a manual page, then extract and rank the most common words and characters, handle case sensitivity, remove punctuation, and filter long uncommon words.

Liangxu Linux

Jul 18, 2021

Master Linux Text Analysis: Count Words, Characters, and Frequencies with Simple Shell Scripts

Linux’s command line offers powerful one‑liners for text analysis. The article demonstrates how to generate a sample file containing the manual page of the man command and then apply a series of shell pipelines to count and rank words and characters.

Creating the sample text file

Run the following command to capture the man output into linuxmi.com.txt: man man > linuxmi.com.txt The resulting file contains the full manual page, which serves as the data source for subsequent analyses.

Finding the most frequent words

Use this pipeline to list the top ten words (case‑insensitive, punctuation removed):

cat linuxmi.com.txt | tr ' ' '\012' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | grep -v '[^a-z]' | sort | uniq -c | sort -rn | head

The output shows word counts such as 5773 the, 90 man, etc.

Displaying individual characters

To split a string into one‑character lines: echo 'www.linuxmi.com' | fold -w1 The result lists each character on its own line.

Most frequent characters

Count and rank characters in the manual file:

fold -w1 < linuxmi.com.txt | sort | uniq -c | sort -rn | head

Typical output includes counts for symbols, letters, and digits.

Case‑insensitive character frequency

Convert to uppercase before counting to ignore case:

fold -w1 < linuxmi.com.txt | sort | tr '[:lower:]' '[:upper:]' | uniq -c | sort -rn | head -20

This yields counts for letters like E, A, T, etc.

Removing punctuation from the count

Exclude punctuation symbols before counting:

fold -w1 < linuxmi.com.txt | tr '[:lower:]' '[:upper:]' | sort | tr -d '[:punct:]' | uniq -c | sort -rn | head -20

Processing multiple files together

Combine several text files and compute character frequencies:

cat *.txt | fold -w1 | tr '[:lower:]' '[:upper:]' | sort | tr -d '[:punct:]' | uniq -c | sort -rn | head -8

Finding rare long words (≥10 characters)

List uncommon words that are at least ten characters long:

cat linuxmi.com.txt | tr '' '\012' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | tr -d '[0-9]' | sort | uniq -c | sort -n | grep -E '..................' | head

These one‑liners illustrate how simple Unix utilities can be combined to perform powerful text statistics without installing additional software.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux command-line text processing grep Unix utilities wc

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.