Fundamentals 34 min read

Master Regular Expressions in Shell: Practical sed & gawk Guide

This tutorial explains how to create and use regular expressions with the sed editor and gawk program in shell scripts, covering basic concepts, pattern types, special characters, quantifiers, grouping, and real‑world examples such as counting files, validating phone numbers, and verifying email addresses.

Liangxu Linux
Liangxu Linux
Liangxu Linux
Master Regular Expressions in Shell: Practical sed & gawk Guide

Regular Expressions

In shell scripting, sed and gawk rely on regular expressions (regex) to select or transform text streams.

What a regular expression is

A regex is a pattern template that Linux utilities match against input text. If the input satisfies the pattern the line is processed; otherwise it is discarded.

Regex engines

POSIX Basic Regular Expressions (BRE) – used by most utilities, including sed (often a subset for speed).

POSIX Extended Regular Expressions (ERE) – used by gawk and many modern tools.

BRE and ERE differ in which metacharacters are special and in the syntax for quantifiers.

Basic BRE patterns

Literal text matches itself. Example:

echo "This is a test" | sed -n '/test/p'
This is a test

Regex matching is case‑sensitive; the pattern /this/p fails because the input contains a capital T.

echo "This is a test" | sed -n '/this/p'   # no output
echo "This is a test" | sed -n '/This/p'   # prints the line

Special characters and escaping

Metacharacters such as ., *, ?, +, ^, $, [], (), | have special meaning. To match them literally, prefix with a backslash ( \).

echo "The cost is $4.00" | sed -n '/\$/p'
The cost is $4.00

Anchors

^

anchors a pattern to the start of a line; $ anchors it to the end.

echo "Books are great" | sed -n '/^Books/p'
Books are great

Combining both anchors (e.g. /^book$/) matches a line that consists solely of the word book.

Dot character

.

matches any single character except a newline.

echo "at ten oclock we" | sed -n '/.at/p'
at ten oclock we

Character classes

Square brackets define a set of characters. Ranges can be expressed with a hyphen.

echo "cat" | sed -n '/[ch]at/p'
cat

Case‑insensitive matching can be expressed with a class that includes both cases:

echo "Yes" | sed -n '/[Yy]es/p'
Yes

Negated character classes

Placing ^ as the first character inside brackets negates the class.

echo "cat" | sed -n '/[^ch]at/p'   # no match because the preceding character is c or h

Intervals (brace quantifiers)

In ERE, {m} means exactly m repetitions; {m,n} means between m and n repetitions. gawk requires the --re-interval option to enable this syntax.

echo "bet" | gawk --re-interval '/be{1,2}t/{print}'
bet

Alternation (pipe)

The pipe | provides logical OR between alternatives.

echo "The cat is asleep" | gawk '/cat|dog/{print}'
The cat is asleep

Grouping

Parentheses group sub‑patterns so that quantifiers apply to the whole group.

echo "Saturday" | gawk '/Sat(urday)?/{print}'
Saturday

Practical shell examples

Counting files in $PATH

Convert the colon‑separated $PATH to a space‑separated list, iterate over each directory, and count the entries.

#!/bin/bash
# Count files in each directory listed in $PATH
mypath=$(echo "$PATH" | sed 's/:/ /g')
for dir in $mypath; do
    count=0
    for file in "$dir"/*; do
        [ -e "$file" ] && count=$((count + 1))
    done
    echo "$dir - $count"
done

Validating US phone numbers

The following ERE matches the four common US formats:

^\(?[2-9][0-9]{2}\)?( |-|\.)[0-9]{3}( |-|\.)[0-9]{4}$

Script isphone prints only lines that contain a valid phone number:

#!/bin/bash
# Filter out invalid US phone numbers
gawk --re-interval '/^\(?[2-9][0-9]{2}\)?( |-|\.)[0-9]{3}( |-|\.)[0-9]{4}$/{print}'

Validating email addresses

A regex that accepts usernames containing letters, digits, ._+- and hostnames containing letters, digits, ._-. The top‑level domain is limited to 2‑5 letters:

^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

Script isemail prints only matching lines.

Key differences between sed and gawk

Both tools use different regex engines; gawk supports most ERE features (including {m,n} intervals) while sed is limited to BRE.

These examples demonstrate how regular expressions enable powerful text filtering, data validation, and automation in shell scripts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

regular expressionsBashregextext processingsedgawk
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.