As scripts grow beyond a screen of text, a common trap emerges: copy-pasted blocks, tangled logic, and a maintenance nightmare the moment requirements change. Top-tier engineering teams treat shell scripts like any other codebase — with functions, clear interfaces, and exit-code contracts. This lesson covers how to write functions that are safe, composable, and testable, and how to structure scripts so a new team member can navigate them confidently at 3 AM during an incident.
Declaring and Calling Functions
Bash supports two syntactically equivalent ways to declare a function. The POSIX-compatible form is preferred for portability:
#!/usr/bin/env bash
set -euo pipefail
# POSIX-compatible declaration (preferred)
log_info() {
echo "[INFO] $(date +%Y-%m-%dT%H:%M:%S) $*"
}
# Bash-specific declaration (works, but less portable)
function log_error {
echo "[ERROR] $(date +%Y-%m-%dT%H:%M:%S) $*" >&2
}
# Calling functions — identical to calling any command
log_info "Deployment started"
log_error "Disk space below threshold"
Two important points: first, a function must be defined before it is called — Bash reads top to bottom. Second, notice $* passes all arguments as a single string, while "$@" preserves word boundaries and is preferred when forwarding arguments to sub-commands.
Local Variables: Keeping State Contained
Without the local keyword every variable you set inside a function is global. In a script with twenty functions this produces subtle, hard-to-debug collisions. The rule at production scale: every variable inside a function is local unless it is intentionally shared.
#!/usr/bin/env bash
set -euo pipefail
get_free_disk_mb() {
local mount_point="${1:-/}"
local free_kb
free_kb=$(df -k "$mount_point" | awk 'NR==2 {print $4}')
echo $(( free_kb / 1024 ))
}
check_disk_space() {
local threshold_mb="${1:-500}"
local free_mb
free_mb=$(get_free_disk_mb "/")
if (( free_mb < threshold_mb )); then
log_error "Low disk space: ${free_mb}MB free (threshold: ${threshold_mb}MB)"
return 1
fi
log_info "Disk OK: ${free_mb}MB free"
}
check_disk_space 1000
Notice how free_kb and free_mb are both declared local. When get_free_disk_mb returns, those names vanish and cannot pollute the caller's scope.
Production pitfall — the unset local trap:local varname=$(command) always exits with code 0, even when the command fails. This is because local itself is the last command and succeeds. With set -e, the failure is swallowed. Split the declaration: local varname on one line, then varname=$(command) on the next.
Exit Codes as a Contract
Functions communicate success or failure to their callers via exit codes, exactly like any Unix command. Code 0 means success; any non-zero value means failure. This is the core contract that enables if statements, &&/|| chaining, and the behavior of set -e to work correctly.
#!/usr/bin/env bash
set -euo pipefail
wait_for_service() {
local host="$1"
local port="$2"
local max_retries="${3:-30}"
local retry_interval=2
local attempt=0
while (( attempt < max_retries )); do
if nc -z -w 2 "$host" "$port" 2>/dev/null; then
log_info "Service ${host}:${port} is reachable"
return 0
fi
(( attempt++ )) || true
log_info "Waiting for ${host}:${port} — attempt ${attempt}/${max_retries}"
sleep "$retry_interval"
done
log_error "Service ${host}:${port} did not become reachable after ${max_retries} attempts"
return 1
}
# The caller decides what to do with success or failure
if ! wait_for_service "db.internal" 5432 15; then
log_error "Database unreachable — aborting deployment"
exit 1
fi
The return statement sets the function exit code. When a function ends without an explicit return, the exit code of the last command executed becomes the function's exit code — a useful default, but one you should be conscious of.
Key idea — exit code semantics: Standardize your non-zero codes across a script. Using return 1 for every failure works in small scripts; for larger automation platforms, define named exit codes at the top (readonly ERR_DISK=2 ERR_NETWORK=3 ERR_AUTH=4) so monitoring systems can categorize failures automatically.
Capturing Return Values
Bash functions cannot return arbitrary strings the way most languages do — return only carries an integer. The idiomatic pattern is to echo the result to stdout and capture it with command substitution.
#!/usr/bin/env bash
set -euo pipefail
# Function "returns" a value by printing to stdout
get_git_sha() {
local length="${1:-7}"
git rev-parse --short="${length}" HEAD
}
current_sha=$(get_git_sha)
long_sha=$(get_git_sha 12)
log_info "Deploying commit: ${current_sha} (full: ${long_sha})"
# Returning multiple values: use global variables (sparingly) or stdout columns
get_container_stats() {
local container_name="$1"
docker stats --no-stream --format "{{.CPUPerc}} {{.MemUsage}}" "$container_name"
}
read -r cpu mem <<< "$(get_container_stats "api-server")"
log_info "API container — CPU: ${cpu} MEM: ${mem}"
Organizing Large Scripts: The Standard Layout
Google's Shell Style Guide and Netflix's internal tooling playbook both converge on the same structural pattern for scripts longer than roughly 100 lines. Following this pattern means any engineer on your team can scan a script and instantly know where to look for a given piece of logic.
The five-section layout used in production shell scripts at scale. Each section has a single responsibility.
The most important structural pattern is the main function with an entry point guard. This allows your script to be both executed directly and sourced by other scripts or test frameworks without side effects.
#!/usr/bin/env bash
# =============================================================
# deploy.sh — rolling deployment helper
# Usage: ./deploy.sh <environment> [--dry-run]
# =============================================================
set -euo pipefail
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly LOG_FILE="/var/log/deploy.log"
readonly SLACK_WEBHOOK="${SLACK_WEBHOOK:-}"
# --- utility functions ---------------------------------------
log_info() { echo "[INFO] $(date +%T) $*" | tee -a "$LOG_FILE"; }
log_error() { echo "[ERROR] $(date +%T) $*" | tee -a "$LOG_FILE" >&2; }
die() { log_error "$*"; exit 1; }
require_command() {
local cmd="$1"
command -v "$cmd" >/dev/null 2>&1 || die "Required command not found: ${cmd}"
}
# --- business logic ------------------------------------------
preflight_checks() {
require_command docker
require_command kubectl
require_command curl
check_disk_space 2000
log_info "Preflight checks passed"
}
check_disk_space() {
local threshold_mb="$1"
local free_mb
free_mb=$(df -k / | awk 'NR==2 {print int($4/1024)}')
(( free_mb >= threshold_mb )) || die "Insufficient disk: ${free_mb}MB free"
}
notify_slack() {
[[ -z "$SLACK_WEBHOOK" ]] && return 0
local message="$1"
curl -sf -X POST "$SLACK_WEBHOOK" \
-H "Content-Type: application/json" \
-d "{\"text\": \"${message}\"}" >/dev/null
}
# --- main orchestration --------------------------------------
main() {
local env="${1:?Usage: deploy.sh <environment>}"
local dry_run="${2:-}"
log_info "=== Deployment to ${env} started ==="
preflight_checks
notify_slack ":rocket: Deploy to *${env}* started by $(whoami)"
if [[ "$dry_run" == "--dry-run" ]]; then
log_info "Dry-run mode — skipping actual deploy steps"
else
log_info "Deploying..."
# real deploy logic here
fi
log_info "=== Deployment to ${env} complete ==="
notify_slack ":white_check_mark: Deploy to *${env}* succeeded"
}
# --- entry point guard ---------------------------------------
# Allows this file to be sourced for testing without executing main()
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
main "$@"
fi
Pro practice — SCRIPT_DIR for portable includes:readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" gives you the absolute path to the directory containing the current script, regardless of where it is called from. Use it to source sibling files: source "${SCRIPT_DIR}/lib/logging.sh". This is the canonical pattern at Google, GitHub, and HashiCorp for multi-file shell projects.
Building a Shared Library
When multiple scripts share the same utility functions, extract them into a library file and source it. This is how large DevOps codebases avoid copy-paste drift:
Functions defined in a sourced file exist in the calling script's environment for its entire lifetime. This gives you modular, reusable code without any package manager or interpreter overhead.
The die Helper Pattern
Every production script at Netflix, Stripe, and similar organizations has a die function. It centralizes the "print an error and exit with failure" pattern, keeping the rest of the script clean:
die() {
log_error "$*"
# Optional: send alert, clean up temp files, etc.
exit 1
}
# Usage — replaces multi-line error handling throughout the script
[[ -f "$CONFIG_FILE" ]] || die "Config file not found: ${CONFIG_FILE}"
[[ "$EUID" -eq 0 ]] || die "This script must run as root"
[[ -n "$API_KEY" ]] || die "API_KEY environment variable is required"
Well-organized functions, strict locals, exit-code contracts, and the five-section layout are not stylistic preferences — they are the difference between scripts that survive production incidents and scripts that create them. Treat your shell code with the same discipline you apply to any other language in your stack.