Shell Scripting & Automation

Robust Scripts: Errors & Safety

18 min Lesson 8 of 28

Robust Scripts: Errors & Safety

The difference between a hobbyist shell script and one that runs unattended in production at 3 AM is not cleverness — it is defensive engineering. Bash's default behavior is dangerously permissive: it continues executing after a command fails, silently expands unset variables to empty strings, and lets a failed command inside a pipeline hide behind the exit code of the last command. None of that is acceptable when your script is deleting old backups, rotating secrets, or deploying to production. This lesson covers the canonical set of guards that every professional shell script must have, and the tooling that enforces them automatically.

The Big Three: set -euo pipefail

These three flags, combined on one line, transform Bash from a forgiving interpreter into a strict one. Place them immediately after the shebang — no exceptions.

#!/usr/bin/env bash set -euo pipefail

Here is what each flag does and why it matters:

  • -e (errexit): Exit immediately when any command returns a non-zero status. Without this flag, Bash happily executes rm -rf /var/data even if the preceding mkdir /var/data silently failed. With -e, the script stops at the point of failure rather than cascading into a worse state.
  • -u (nounset): Treat any reference to an unset variable as an error. The classic disaster looks like: rm -rf "${DEPLOY_DIR}/" where DEPLOY_DIR is a typo or was never exported. Without -u, this silently expands to rm -rf "/". With -u, the script aborts with DEPLOY_DIR: unbound variable.
  • -o pipefail (pipefail): Make a pipeline return the exit status of the rightmost command that failed, rather than always returning the exit status of the last command. Without this, false | true exits 0 — a silent failure swallowed by the pipeline. With pipefail, it exits 1.
Production pitfall — -e and subshells: set -e does not propagate into subshells spawned with ( ) or command substitutions unless you also set it there. Always test failure paths, not just the happy path.

Traps: Guaranteed Cleanup

A trap is a handler that the shell executes when a specified signal or pseudo-signal occurs. The two traps every production script needs are EXIT and ERR.

EXIT fires whenever the script terminates — whether it exits normally, hits a set -e failure, or receives a signal. Use it to clean up temporary files, release locks, or log completion. ERR fires specifically when a command fails (works in concert with set -e), making it the right place to emit a structured error message.

#!/usr/bin/env bash set -euo pipefail # ---------- cleanup trap ---------- TMPDIR_WORK="" cleanup() { local exit_code=$? if [[ -n "${TMPDIR_WORK}" && -d "${TMPDIR_WORK}" ]]; then rm -rf "${TMPDIR_WORK}" echo "[cleanup] removed temp dir ${TMPDIR_WORK}" >&2 fi exit "${exit_code}" } trap cleanup EXIT # ---------- error trap ---------- on_error() { local line=$1 echo "[ERROR] Script failed at line ${line}" >&2 } trap 'on_error ${LINENO}' ERR # ---------- script body ---------- TMPDIR_WORK=$(mktemp -d) echo "Working in ${TMPDIR_WORK}" cp /etc/app/config.yml "${TMPDIR_WORK}/" # ... do real work ... echo "Done"

Several points deserve attention. First, cleanup captures $? immediately — any subsequent command would overwrite it. Second, the function checks that the temp directory variable is non-empty and that the path actually exists before attempting deletion; a guard against the case where the script failed before mktemp ran. Third, cleanup re-exits with the original exit code so that whatever invoked your script (CI runner, systemd, cron) receives the right status.

Lock files and traps: Scripts that must not run concurrently (database migrations, file-system compactions) use a lock file combined with a trap. Create the lock with flock or by writing the PID to /var/run/myscript.pid; remove it in the EXIT trap. This pattern is used in production at scale by tools like mysqld_safe and nginx.

Defensive Variable Handling

Beyond -u, Bash provides expansion operators that let you express intent precisely and fail fast with a meaningful message.

#!/usr/bin/env bash set -euo pipefail # Require the variable to be set AND non-empty; print message on failure : "${DATABASE_URL:?DATABASE_URL must be set and non-empty}" # Use a default if variable is unset or empty (safe fallback, not an error) LOG_LEVEL="${LOG_LEVEL:-info}" # Use a default only if unset (but allow empty string) TIMEOUT="${TIMEOUT-30}" echo "Connecting to ${DATABASE_URL} with log level ${LOG_LEVEL}"

The :? form is the idiomatic guard at the top of any script that depends on environment variables injected by a CI system or secrets manager. When DATABASE_URL is missing, the script stops immediately with a clear message rather than passing an empty string to a downstream command that produces a cryptic error pages later.

Safe Temporary Files

Hardcoded temp paths like /tmp/my-script.tmp are a security vulnerability (symlink attacks) and a concurrency bug (two instances collide). Always use mktemp.

#!/usr/bin/env bash set -euo pipefail WORKDIR=$(mktemp -d) # unique directory: /tmp/tmp.XjK3m9 OUTFILE=$(mktemp) # unique file: /tmp/tmp.aB2cP7 trap 'rm -rf "${WORKDIR}" "${OUTFILE}"' EXIT # Safe: predictable prefix, unpredictable suffix LOCKFILE=$(mktemp -t deploy.XXXXXX) trap 'rm -f "${LOCKFILE}"' EXIT

Static Analysis with ShellCheck

ShellCheck is a static analysis tool that finds bugs in shell scripts before you run them. It catches unquoted variables, wrong conditional syntax, POSIX-vs-bash incompatibilities, and dozens of other common mistakes. At big-tech companies, ShellCheck runs in the CI pipeline as a mandatory lint gate — a shell script that does not pass ShellCheck does not merge.

Script safety layers: ShellCheck, set flags, traps, variable guards Script Safety Layers Layer 1 — Static Analysis shellcheck script.sh (runs in CI before merge) Catches: unquoted vars, bad syntax, POSIX mismatches Layer 2 — Runtime Guards set -euo pipefail (stops on failure, unset var, pipe error) Fails fast; prevents silent error propagation Layer 3 — Signal Handlers trap cleanup EXIT & trap on_error ERR Guaranteed cleanup; structured error logging Layer 4 — Input Validation ${VAR:?msg} • mktemp • [[ -f ]] • flock
Four overlapping safety layers that every production-grade Bash script should have.

Install ShellCheck and run it locally before committing:

# Install sudo apt-get install shellcheck # Debian/Ubuntu brew install shellcheck # macOS # Run against a single script shellcheck deploy.sh # Run against all scripts in the repo (used in CI) find . -name "*.sh" -print0 | xargs -0 shellcheck # Inline disable for a specific rule (use sparingly, add a comment explaining why) # shellcheck disable=SC2086 # word-splitting intentional here eval "${DYNAMIC_CMD}"

ShellCheck integrates with VS Code (shellcheck extension), Vim (ALE), and GitHub Actions (ludeeus/action-shellcheck). A typical CI step looks like:

- name: Lint shell scripts uses: ludeeus/action-shellcheck@master with: scandir: './scripts' severity: warning
The full defensive template: Every production script at a mature engineering organization starts from a template that combines all four layers — shebang, set -euo pipefail, a cleanup trap, an error trap, and validated required variables. Keep such a template in your team's internal toolbox and enforce it through a linter. Scripts that deviate require an explicit written justification, not just a comment.

Combining Everything: A Safe Script Skeleton

Here is the canonical skeleton that combines all the techniques in this lesson. Copy it as the starting point for every new production script.

#!/usr/bin/env bash # Description: (one line describing what this script does) # Usage: ./script.sh [--dry-run] <target> # Author: team-sre@company.com set -euo pipefail # ---- required environment variables ---- : "${APP_ENV:?APP_ENV must be set (staging|production)}" : "${DEPLOY_TOKEN:?DEPLOY_TOKEN must be set}" # ---- optional with defaults ---- LOG_LEVEL="${LOG_LEVEL:-info}" DRY_RUN="${DRY_RUN:-false}" # ---- temp workspace ---- WORKDIR=$(mktemp -d) LOGFILE=$(mktemp) # ---- cleanup on any exit ---- cleanup() { local code=$? rm -rf "${WORKDIR}" "${LOGFILE}" [[ ${code} -ne 0 ]] && echo "[FATAL] Script exited with code ${code}" >&2 exit "${code}" } trap cleanup EXIT # ---- error location ---- trap 'echo "[ERROR] at line ${LINENO}: ${BASH_COMMAND}" >&2' ERR # ---- main logic ---- main() { echo "[INFO] env=${APP_ENV} dry_run=${DRY_RUN}" # ... your commands here ... } main "$@"

The main "$@" pattern — placing all logic in a main function and calling it at the end — ensures that the entire script is parsed before any code runs, which prevents subtle bugs caused by calling a function before it is defined. It also makes the script easier to test in isolation and to source safely from other scripts.