We are still cooking the magic in the way!
Text Processing: grep, sed & awk
Text Processing: grep, sed & awk
In a production environment, raw text is everywhere — application logs containing thousands of lines, nginx access logs growing at tens of thousands of requests per minute, CSV exports from monitoring systems, configuration files spanning hundreds of servers. The ability to search, extract, transform, and summarise that text from the command line — without writing a Python script, without opening a file in an editor, without waiting for a dashboard to load — is one of the highest-leverage skills a DevOps engineer can possess. Three tools do the heavy lifting: grep, sed, and awk. Each has a distinct purpose and the three compose cleanly through pipes.
grep selects lines. sed transforms text (character-level substitution and deletion). awk computes over structured fields (arithmetic, conditionals, aggregates). When in doubt, use the weakest tool that solves the problem — a grep one-liner is faster to write, faster to run, and easier for the next engineer to read than an equivalent awk program.
grep — Searching Text at Scale
grep prints every line from its input that matches a pattern. Its name comes from the ed editor command g/re/p (globally match a regular expression and print). The default pattern language is basic regular expressions (BRE); the -E flag enables extended regular expressions (ERE), and -P enables Perl-compatible regular expressions (PCRE) on systems that support it.
Flags you will use daily in production:
-i— case-insensitive match-v— invert match (print lines that do NOT match)-r/-R— recursive search through directories-l— print only file names that contain a match-n— prefix each output line with its line number-c— print a count of matching lines per file-A N/-B N/-C N— print N lines After / Before / around each match (context)-o— print only the matching portion of the line, not the whole line-E— extended regex (alternation|, quantifiers+,?, grouping())--color=auto— highlight matches (set this in your shell profile)
grep -F for literal strings: When your search pattern contains characters that are special in regex (., *, [, $), use grep -F (fixed string) to skip regex interpretation entirely. Searching for 10.0.1.5 with plain grep matches any character where the dots appear. grep -F "10.0.1.5" matches the literal string. This matters when searching logs for IP addresses, stack trace class names (e.g. java.lang.NullPointerException), or any string that looks like a regex.
sed — Stream Editing for Transformation
sed (stream editor) reads input line by line, applies a script of editing commands, and writes to standard output. It never modifies files in place unless you pass -i. The single most-used sed command is s/pattern/replacement/flags — substitute. Understanding sed means understanding this one command deeply.
The substitution flags that matter:
g— replace all occurrences on the line (not just the first)i— case-insensitive match (GNU sed)p— print the line after substitution (useful with-nto print only changed lines)2,3— replace only the Nth occurrence on the line
sed -i differences between GNU and BSD: On Linux, sed -i '' fails (the empty string after -i is treated as the next argument). On macOS (BSD sed), sed -i '' is the correct in-place edit without backup. To write portable scripts, use sed -i.bak (always creates a backup) or invoke perl -pi -e instead, which behaves consistently. In Dockerfiles and CI pipelines targeting Linux containers this distinction rarely bites you, but it will bite you the moment a teammate runs your script on a Mac.
awk — Field-Oriented Data Processing
awk is a complete programming language designed around the concept of records and fields. By default, it splits each input line (record) on whitespace into fields ($1, $2, ..., $NF for the last field, $0 for the whole line). It runs a pattern-action program against every record. The canonical form is awk '/pattern/ { action }'.
Built-in variables that appear constantly in real scripts:
NR— current line (record) number across all filesNF— number of fields in the current recordFS— field separator (default: whitespace; set with-F)OFS— output field separator (default: space)$0— the entire current line$1...$NF— individual fieldsBEGIN { }— runs once before any input is readEND { }— runs once after all input is consumed
Composing the Three Tools: Real-World Pipelines
The real power of these tools emerges when you pipe them together. Each tool in the chain is a specialist that does one thing extremely well. You read the chain left-to-right as a data-processing narrative.
Performance Considerations at Scale
When processing log files that are gigabytes in size — common in production — tool choice and order have real performance implications:
- Put
grepfirst in the pipeline. It rejects lines early, reducing the data thatsedandawkhave to process. Filtering 100 million lines down to 10,000 before handing them toawkis orders of magnitude faster than feeding all 100 million toawk. - Use
grep -Ffor fixed strings. Regex matching has overhead; plain string matching (Boyer-Moore algorithm) is significantly faster when you do not need regex. - Prefer
awkover a shell loop for column arithmetic. A shell loop that processes one line per iteration forks a subprocess per line;awkprocesses millions of lines in a single process invocation. - For very large files, consider
LC_ALL=C. Prefixing withLC_ALL=Cdisables multi-byte character handling and can makegrepandawk3-5x faster on ASCII-dominant logs.
scripts/ directory. Pipelines that live only in your shell history are operational knowledge that vanishes when you leave the team. Document the input format, the expected output, and the external dependencies at the top of the file.
Common Failure Modes
grepreturns exit code 1 (no match) and halts a pipeline withset -e. In scripts that useset -e(covered in the next lesson), agrepthat finds nothing exits with code 1, which aborts the script. Guard against this withgrep ... || truewhen zero matches is an acceptable outcome.- sed in-place on a symlink follows the link, not the link itself. When your config file is a symlink (common in Ansible-managed environments),
sed -ireplaces the symlink target. Verify the result withls -laafter editing. - awk field numbering is 1-based, not 0-based.
$0is the whole line;$1is the first field. Engineers from Python or Go backgrounds assign the wrong field number and silently get garbage output — always verify with a small sample first. - Locale-sensitive regex in awk. Character classes like
[[:alpha:]]match differently depending on the locale. SetLC_ALL=Cfor predictable, ASCII-only matching in production scripts.
With grep, sed, and awk in your toolkit — and an understanding of how to compose them — you can answer in seconds questions that would otherwise require writing a Python script, loading data into a database, or waiting for a BI dashboard to refresh. At scale, that speed is the difference between a 10-minute incident and a 10-second diagnosis.