Mastering Awk: From Basics to Advanced Text Processing
This comprehensive guide introduces Awk, explains its command‑line syntax, script structure, patterns, built‑in variables, arrays, functions, operators, statements, I/O handling, and practical examples, enabling readers to harness Awk for powerful text processing tasks on Unix-like systems.
Table of Contents
What is Awk Command‑line Syntax Script Composition Pattern Regular Expression Expressions Arrays Built‑in Variables Delete ARGV element Add ARGV element ARGV and ARGC CONVFMT and OFMT ENVIRON RLENGTH and RSTART Operators Statement Math Functions String Functions sub gsub index length match split sprintf substr tolower toupper I/O Functions getline close system
Awk, together with sed and grep, is often called the "three swords" of Linux. While all three can match text, sed and awk can also edit text, whereas grep cannot. Sed is a non‑interactive, stream‑oriented editor; awk is a pattern‑matching programming language that supports variables, functions, loops, and conditional statements, making it more powerful than simple command‑line tools.
Using Awk you can:
Treat a text file as a database of records and fields.
Use variables while processing the database.
Perform arithmetic and string operations.
Employ common programming structures such as conditionals and loops.
Format output.
Define custom functions.
Execute UNIX commands within an Awk script.
Process the output of UNIX commands.
We start with the most basic command‑line syntax and gradually explore Awk’s programming capabilities.
Command‑line Syntax
Awk has two forms of command‑line syntax, similar to sed:
The program part is analogous to a sed script; it consists of a sequence of pattern { action } pairs. When a record matches a pattern, the associated action is executed. In the first form, the program must appear as the first non‑option argument.
Awk parses input into records (default separator is a newline) and fields (default separator is whitespace). The record separator can be changed with the built‑in variable RS, and the field separator with FS or the -F option. Fields are accessed as $1, $2, …, while $0 holds the entire record.
Standard command‑line options include: -F ERE: set the field separator to an extended regular expression. -f progfile: read an Awk script from a file (multiple -f files are concatenated in order). -v assignment: assign a variable before processing begins (e.g., -v var=value).
Examples:
Access a variable set with -v inside the script:
The BEGIN pattern runs before any input is processed; END runs after all input has been processed.
Records and Fields
In a database, a table consists of records (rows) and fields (columns). Awk treats a text file similarly: each line is a record, split into fields by the field separator. You can change the separator, for example to colon for /etc/passwd:
Access fields with $1, $2, …, $NF (last field) and $(NF-1) (second‑last). The built‑in variable NF holds the number of fields in the current record.
Script Composition
An Awk script is a series of pattern { action } blocks. If the pattern is omitted, the action runs for every input line. A simple example that prints each line:
Functions can be defined as:
Function parameters are local; variables defined outside functions are global:
Statements can be separated by newlines or semicolons; a backslash ( \) can continue a long statement onto the next line:
Pattern
Patterns determine when an action is executed. Types include: /regular expression/: extended regular expression.
Relational expression (e.g., $1 > 5). BEGIN: runs before the first record. END: runs after the last record. pattern, pattern: address range, similar to sed.
Example: print lines containing the digit 3:
Negate a pattern with !:
Logical AND ( &&) and OR ( ||) can combine patterns:
Match a field with an expression: $n ~ /ere/:
Print only the first line:
Regular Expression
For a thorough review of regular‑expression syntax, see the POSIX specification or related articles.
Expressions
Expressions combine constants, variables, operators, and functions. Variables may be user‑defined, built‑in (uppercase), or field variables ( $n). Uninitialized string variables default to an empty string; numeric variables default to 0.
Arrays
Arrays are associative; indices can be numbers or strings. Assignment: array[index]=value Iterate with for (item in array) or test membership with if (item in array):
for (item in array) if (item in array)Complete example:
Built‑in Variables
Awk provides many built‑in variables. Important ones include:
ARGC : number of command‑line arguments (size of ARGV).
ARGV : array of command‑line arguments (excluding options).
CONVFMT : format for converting numbers to strings (default "%.6g").
OFMT : format for numbers when printed (default "%.6g").
ENVIRON : associative array of environment variables.
FILENAME : name of the current input file.
NR : total number of records read so far.
FNR : number of records read from the current file.
FS : field separator (default whitespace).
NF : number of fields in the current record.
RS : record separator (default newline).
OFS : output field separator (default whitespace).
ORS : output record separator (default newline).
RLENGTH : length of the substring matched by match().
RSTART : start position of the substring matched by match().
ARGV and ARGC
Similar to C’s int main(int argc, char **argv). ARGV holds file names and variable assignments; ARGC is its length. Example usage:
You can modify ARGV to add, delete, or replace elements. Deleting an element skips the corresponding file:
Adding an element:
CONVFMT and OFMT
CONVFMTcontrols how numbers are converted to strings internally; default "%.6g". Changing it:
OFMTaffects number‑to‑string conversion during output:
ENVIRON
ENVIRONis an associative array of environment variables. Example:
You can pass values to Awk via environment variables:
Iterate over ENVIRON:
RLENGTH and RSTART
Both are set by match(). RLENGTH is the length of the matched substring; RSTART is its start position (1‑based). Example:
Operators
Awk supports arithmetic, relational, logical, string concatenation, and ternary operators. See the Expressions in awk section of the man page for a complete list.
Statement
Common statements include print, printf, delete, break, continue, exit, and next. Example of printf:
breakexits a loop; continue skips to the next iteration. delete removes an array element. exit terminates processing after executing the END block. next skips the rest of the current record and reads the next one.
Output redirection examples:
Write specific columns to separate files:
Pipe output to a command (e.g., sort -n):
Math Functions
Awk provides standard math functions: atan2(y,x), cos(x), sin(x), exp(x), log(x), sqrt(x), int(x), rand() (returns a random number in [0,1)), and srand([expr]) to set the seed.
Example of generating a random number:
Set a different seed with srand() to obtain different sequences across runs:
Generate an integer between 1 and n :
String Functions
Awk includes many string manipulation functions.
sub
sub(ere, repl[, in])replaces the first occurrence of ere with repl in in (default $0) and returns the number of replacements.
Example:
In the replacement string, & represents the matched text.
Example using &:
gsub
gsub(ere, repl[, in])performs a global substitution (all matches).
index
index(s, t)returns the position (1‑based) of substring t in s, or 0 if not found.
Example:
length
length([s])returns the length of string s; if omitted, $0 is used.
Example:
match
match(s, ere)searches s for the regular expression ere. It returns the start position or 0 if no match, and sets RSTART and RLENGTH.
Example:
split
split(s, a[, fs])splits string s into array a using field separator fs (default FS). Returns the number of fields.
Example:
Iterating with for (i=1; i<=n; i++) preserves order.
sprintf
sprintf(fmt, expr, ...)works like printf but returns the formatted string instead of printing.
Example:
substr
substr(s, m[, n])returns the substring of s starting at position m (1‑based) with length n. If n is omitted, the rest of the string is returned.
Example:
tolower / toupper
tolower(s)converts s to lower case; toupper(s) converts to upper case.
Examples:
I/O Functions
getline
expression | getline [var]reads a line from the output of expression. If var is supplied, the line is stored there; otherwise $0 and NF are updated.
Example reading from a file:
Without a variable, the line becomes the current record:
close
close("command")closes a pipe opened by getline or by redirection. Use with care to avoid infinite loops.
Example:
system
system("command")executes an external command.
Example:
Conclusion
This article provides a concise yet comprehensive overview of Awk, covering its syntax, script structure, patterns, built‑in variables, arrays, functions, operators, statements, and I/O handling. Readers are encouraged to experiment with the examples and explore Awk’s powerful text‑processing capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
