Site Reliability Engineer Beginner to Expert
A comprehensive roadmap to master Site Reliability Engineering from Linux and networking fundamentals to advanced SLOs, observability, incident management, and automation on AWS.
This roadmap takes you from the fundamentals of Linux and systems thinking through to advanced observability, chaos engineering, and SRE organisational culture. Each stage builds on the last master reliability principles and AWS tooling together, treating every production system as an opportunity to learn, automate, and improve. The goal is not zero failures, but fast recovery and continuous reduction of their impact.
SRE Foundations
What is SRE?
Google's SRE origin, the SRE vs DevOps distinction, and the core principles of reliability engineering.
The SRE Role & Responsibilities
On-call engineering, toil reduction, capacity planning, and the balance between feature velocity and reliability.
Reliability as a Feature
Why reliability is a product requirement, not just an operational concern.
The SRE Workbook
Practical guidance and case studies that complement the original SRE book.
Linux & Systems Fundamentals
Linux Process Management
ps, top, htop, kill, nice, systemd units, and process states.
File Systems & I/O
inodes, mount points, df/du, lsof, and diagnosing disk I/O bottlenecks with iostat.
Memory & CPU Analysis
vmstat, free, /proc/meminfo, perf, and understanding OOM killer behaviour.
Kernel & System Calls
strace, ltrace, dmesg, and reading kernel logs for low-level system debugging.
Networking Fundamentals for SREs
TCP/IP & the OSI Model
How packets travel from source to destination, TCP handshake, and connection states.
DNS Deep Dive
Resolution chain, TTL, dig/nslookup, CNAME vs A records, and DNS propagation issues.
HTTP/HTTPS & TLS
Request lifecycle, status codes, headers, TLS handshake, and certificate chains.
Network Troubleshooting Tools
curl, traceroute, mtr, tcpdump, netstat/ss, and Wireshark for packet analysis.
Shell Scripting & Automation Basics
Bash Scripting
Variables, conditionals, loops, functions, error handling (set -euo pipefail), and exit codes.
Text Processing
grep, awk, sed, cut, sort, uniq, and jq for structured log and JSON processing.
Cron & Scheduling
crontab syntax, systemd timers, and avoiding overlapping job execution.
Python for SRE
Scripting operational tasks in Python requests, subprocess, boto3, and argparse.
Version Control & GitOps Practices
Git for Operations
Branching strategies, pull requests, tagging releases, and git blame for audit trails.
GitOps Principles
Git as the single source of truth declarative config, automated reconciliation, and self-healing.
Runbooks as Code
Store operational runbooks in Git alongside the systems they describe for version control and discoverability.
Containerisation & Orchestration
Docker Operations
Container lifecycle, resource limits (--cpus, --memory), log drivers, and health checks.
Kubernetes for SREs
Pod lifecycle, rolling deployments, resource requests/limits, liveness and readiness probes.
Amazon EKS Operations
Managed node groups, Fargate profiles, cluster upgrades, and EKS add-ons.
Horizontal Pod Autoscaler & KEDA
Scale workloads based on CPU, memory, and custom metrics from SQS or Prometheus.
Service Level Indicators (SLIs)
What are SLIs?
SLIs are quantitative measures of service behaviour availability, latency, throughput, and error rate.
Choosing the Right SLIs
User-centric SLIs vs system metrics measure what the user experiences, not what is easy to collect.
The Four Golden Signals
Latency, traffic, errors, and saturation Google's framework for monitoring any system.
RED & USE Methods
Rate/Errors/Duration for request-driven services; Utilisation/Saturation/Errors for resources.
Service Level Objectives (SLOs)
Defining SLOs
SLO = SLI over a time window with a target threshold how to set achievable, meaningful targets.
Error Budgets
Error budget = 1 - SLO target. Use it to balance reliability work vs feature velocity.
Error Budget Policies
Define what happens when the budget is consumed feature freeze, reliability sprints, or escalation.
SLO Review Cadence
Regularly review and adjust SLOs as user expectations and system capabilities evolve.
Service Level Agreements (SLAs)
SLA vs SLO vs SLI
The hierarchy: SLI measures it, SLO targets it internally, SLA commits to it externally.
Setting Conservative SLAs
SLAs should be set tighter than SLOs to provide a safety buffer before breaching customer commitments.
SLA Breach Consequences
Financial penalties, service credits, and reputational impact of breached SLAs.
Metrics & Prometheus
Prometheus Architecture
Scrape model, targets, exporters, the Prometheus data model, and labels.
PromQL
Instant vectors, range vectors, rate(), irate(), histogram_quantile(), and recording rules.
Alerting Rules
Write Prometheus alerting rules with for clauses, severity labels, and runbook annotations.
Amazon Managed Service for Prometheus
Serverless Prometheus-compatible monitoring at scale on AWS with AMP.
Dashboards & Visualisation
Grafana Fundamentals
Panels, variables, annotations, alerts, and dashboard-as-code with Grafonnet/Jsonnet.
Amazon Managed Grafana
Fully managed Grafana on AWS with native CloudWatch, AMP, and X-Ray data sources.
Dashboard Design Principles
Design dashboards for on-call use clear signal hierarchy, avoid noise, and link to runbooks.
Amazon CloudWatch Dashboards
Build operational dashboards with CloudWatch metrics, alarms, and Logs Insights widgets.
Logging & Log Management
Structured Logging
JSON log format, log levels, correlation IDs, and consistent field naming conventions.
Amazon CloudWatch Logs
Log groups, log streams, metric filters, subscription filters, and Logs Insights queries.
Centralised Logging with OpenSearch
Ship logs from Lambda, EC2, and ECS to Amazon OpenSearch for full-text search and dashboards.
AWS Fluent Bit & Fluentd
Collect and route logs from containers and EC2 instances to CloudWatch or OpenSearch.
Distributed Tracing
Distributed Tracing Concepts
Traces, spans, context propagation, sampling, and the OpenTelemetry data model.
AWS X-Ray
Instrument Lambda, ECS, and API Gateway; read service maps and analyse trace segments.
AWS Distro for OpenTelemetry (ADOT)
Collect traces and metrics with the AWS-supported OpenTelemetry distribution.
Correlating Logs, Metrics & Traces
Use trace IDs in logs, link metrics to traces, and build a unified observability workflow.
Alerting & On-Call Management
Alerting Best Practices
Alert on symptoms not causes, avoid alert fatigue, and ensure every alert has a runbook.
Amazon CloudWatch Alarms
Composite alarms, anomaly detection alarms, metric math alarms, and SNS integration.
Alertmanager
Route, deduplicate, group, and silence Prometheus alerts; integrate with PagerDuty and Slack.
On-Call Rotations & Escalation Policies
Design fair rotations, primary/secondary escalation, and override schedules in PagerDuty or OpsGenie.
Incident Management
Incident Lifecycle
Detection, triage, response, mitigation, resolution, and follow-up the five phases of an incident.
Incident Commander Role
Coordinate responders, manage communication, and make decisions under pressure.
Severity Levels
Define SEV1–SEV4 (or P1–P4) with clear criteria for customer impact and response SLAs.
Incident Communication
Internal Slack war rooms, external status pages (Statuspage.io), and stakeholder updates.
AWS Systems Manager OpsCenter
Centralise operational issues, correlate findings from GuardDuty, Config, and CloudWatch.
Post-Incident Reviews (PIRs)
Blameless Post-Mortems
Focus on systemic failures, not individual mistakes psychological safety is critical.
Post-Mortem Template
Impact summary, timeline, root cause, contributing factors, action items, and lessons learned.
Root Cause Analysis (RCA) Techniques
5 Whys, fishbone diagrams, and fault tree analysis for structured root cause investigation.
Action Item Tracking
Assign owners, set deadlines, and track PIR action items to completion in Jira or Linear.
Chaos Engineering
Chaos Engineering Principles
Hypothesis-driven experimentation, blast radius control, and GameDay practices.
AWS Fault Injection Service (FIS)
Run controlled fault injection experiments on EC2, ECS, EKS, RDS, and more with AWS FIS.
LitmusChaos on EKS
CNCF chaos engineering framework for Kubernetes pod kill, network latency, and disk fill experiments.
GameDays
Structured team exercises that simulate real outage scenarios to validate runbooks and incident response.
Toil Identification & Reduction
What is Toil?
Manual, repetitive, automatable work that scales with service growth and provides no enduring value.
Measuring Toil
Track toil as a percentage of engineer time the SRE goal is to keep toil below 50%.
Toil Reduction Strategies
Automate tickets, self-service provisioning, runbook automation, and event-driven remediation.
AWS Systems Manager Automation
Create SSM Automation runbooks to execute common operational tasks without manual intervention.
Infrastructure as Code for SREs
Terraform for SREs
Write, plan, and apply infrastructure changes safely remote state, workspaces, and drift detection.
AWS CloudFormation & CDK
Manage AWS resources declaratively with CloudFormation stacks or CDK constructs.
Ansible for Configuration Management
Idempotent server configuration, patching automation, and drift remediation with Ansible.
IaC Testing & Validation
Lint and test infrastructure code with tflint, Checkov, cfn-lint, and Terratest.
CI/CD Reliability
Deployment Strategies
Rolling, blue/green, canary, and feature flag deployments risk profiles and rollback mechanisms.
AWS CodeDeploy
Automate EC2, Lambda, and ECS deployments with health checks and automatic rollback.
GitHub Actions for SRE
Automate operational workflows AMI baking, certificate rotation, and compliance checks.
Deployment Observability
Track deployment frequency, lead time, change failure rate, and MTTR (DORA metrics).
Capacity Planning
Demand Forecasting
Analyse traffic trends, seasonality, and growth projections to model future resource needs.
Load Testing
Simulate production load with k6, Locust, or Artillery to validate capacity assumptions.
AWS Compute Optimizer
ML-based right-sizing recommendations for EC2, Lambda, ECS, and EBS volumes.
Auto Scaling & Predictive Scaling
Configure dynamic and predictive scaling policies to handle demand spikes automatically.
Reliability Patterns
Circuit Breaker Pattern
Prevent cascading failures by temporarily blocking requests to a failing downstream service.
Retry & Exponential Backoff
Retry transient failures with jitter and backoff to avoid thundering herd problems.
Bulkhead Pattern
Isolate failures by partitioning resources separate thread pools, queues, or service instances.
Rate Limiting & Throttling
Protect services from overload using token bucket or leaky bucket rate limiting.
Graceful Degradation
Serve reduced functionality rather than full failure when dependencies are unavailable.
Security for SREs
Least Privilege & IAM Roles
Apply least privilege to automation scripts, Lambda functions, and EC2 instance profiles.
Secrets Management
Retrieve credentials at runtime from AWS Secrets Manager or Parameter Store no hardcoded secrets.
AWS GuardDuty for SREs
Integrate GuardDuty findings into incident response workflows and PagerDuty escalation.
Vulnerability Management
Automate patching with AWS Systems Manager Patch Manager and track CVEs with Amazon Inspector.
Cost Reliability
Cost Anomaly Detection
Use AWS Cost Anomaly Detection to alert on unexpected spend spikes automatically.
Tagging & Cost Attribution
Tag all resources consistently to enable service-level cost attribution and showback.
Spot Instance Reliability
Build fault-tolerant workloads on Spot with interruption handlers and mixed instance policies.
AWS Budgets & Alerts
Set cost and usage budgets with alert thresholds to catch runaway workloads early.
Runbook & Documentation Culture
Writing Effective Runbooks
Clear, step-by-step procedures with expected outcomes, rollback steps, and escalation paths.
AWS Systems Manager Run Command
Execute runbook steps remotely across EC2 fleets without SSH using SSM Run Command.
Architecture Decision Records (ADRs)
Document why systems are designed the way they are to aid future incident diagnosis.
Operational Readiness Reviews (ORRs)
Gate new services going on-call with a checklist: SLOs defined, runbooks written, dashboards built.
Advanced Observability
OpenTelemetry Collector
Deploy the OTel Collector as a sidecar or DaemonSet to pipeline traces, metrics, and logs.
Continuous Profiling
Profile CPU and memory in production with Pyroscope or Amazon CodeGuru Profiler.
Real User Monitoring (RUM)
Capture frontend performance from real user sessions with Amazon CloudWatch RUM.
Synthetic Monitoring
Simulate user journeys 24/7 with CloudWatch Synthetics canaries to detect outages proactively.
SRE Maturity & Org Culture
SRE Team Models
Embedded vs centralised vs consulting SRE choosing the right model for your organisation.
DORA Metrics
Deployment frequency, lead time, change failure rate, and MTTR as engineering performance indicators.
Production Readiness Reviews
A structured checklist every new service must pass before receiving SRE support and going on-call.
SRE Maturity Model
Assess and grow your SRE practice across five dimensions: monitoring, incident response, toil, SLOs, and culture.
You might also enjoy
Check out some of our other posts on similar topics
4 related posts