Roadmaps

Site Reliability Engineer Beginner to Expert

Site Reliability Engineer Beginner to Expert

A comprehensive roadmap to master Site Reliability Engineering from Linux and networking fundamentals to advanced SLOs, observability, incident management, and automation on AWS.

27 Stages
All Levels

This roadmap takes you from the fundamentals of Linux and systems thinking through to advanced observability, chaos engineering, and SRE organisational culture. Each stage builds on the last master reliability principles and AWS tooling together, treating every production system as an opportunity to learn, automate, and improve. The goal is not zero failures, but fast recovery and continuous reduction of their impact.

01
1

SRE Foundations

4 topics · 3 required · 1 recommended
Understand what Site Reliability Engineering is, where it came from, and how it differs from traditional ops.

What is SRE?

Required

Google's SRE origin, the SRE vs DevOps distinction, and the core principles of reliability engineering.

The SRE Role & Responsibilities

Required

On-call engineering, toil reduction, capacity planning, and the balance between feature velocity and reliability.

Reliability as a Feature

Required

Why reliability is a product requirement, not just an operational concern.

The SRE Workbook

Recommended

Practical guidance and case studies that complement the original SRE book.

02
2

Linux & Systems Fundamentals

4 topics · 3 required · 1 recommended
Master the operating system skills every SRE depends on daily.

Linux Process Management

Required

ps, top, htop, kill, nice, systemd units, and process states.

File Systems & I/O

Required

inodes, mount points, df/du, lsof, and diagnosing disk I/O bottlenecks with iostat.

Memory & CPU Analysis

Required

vmstat, free, /proc/meminfo, perf, and understanding OOM killer behaviour.

Kernel & System Calls

Recommended

strace, ltrace, dmesg, and reading kernel logs for low-level system debugging.

03
3

Networking Fundamentals for SREs

4 topics · 4 required
Understand the network layers that underpin distributed systems and cloud infrastructure.

TCP/IP & the OSI Model

Required

How packets travel from source to destination, TCP handshake, and connection states.

DNS Deep Dive

Required

Resolution chain, TTL, dig/nslookup, CNAME vs A records, and DNS propagation issues.

HTTP/HTTPS & TLS

Required

Request lifecycle, status codes, headers, TLS handshake, and certificate chains.

Network Troubleshooting Tools

Required

curl, traceroute, mtr, tcpdump, netstat/ss, and Wireshark for packet analysis.

04
4

Shell Scripting & Automation Basics

4 topics · 3 required · 1 recommended
Write Bash scripts to automate operational tasks and reduce manual toil.

Bash Scripting

Required

Variables, conditionals, loops, functions, error handling (set -euo pipefail), and exit codes.

Text Processing

Required

grep, awk, sed, cut, sort, uniq, and jq for structured log and JSON processing.

Cron & Scheduling

Required

crontab syntax, systemd timers, and avoiding overlapping job execution.

Python for SRE

Recommended

Scripting operational tasks in Python requests, subprocess, boto3, and argparse.

05
5

Version Control & GitOps Practices

3 topics · 2 required · 1 recommended
Use Git to version infrastructure, runbooks, and automation code.

Git for Operations

Required

Branching strategies, pull requests, tagging releases, and git blame for audit trails.

GitOps Principles

Required

Git as the single source of truth declarative config, automated reconciliation, and self-healing.

Runbooks as Code

Recommended

Store operational runbooks in Git alongside the systems they describe for version control and discoverability.

06
6

Containerisation & Orchestration

4 topics · 3 required · 1 recommended
Operate containerised workloads and understand the platforms SREs are responsible for.

Docker Operations

Required

Container lifecycle, resource limits (--cpus, --memory), log drivers, and health checks.

Kubernetes for SREs

Required

Pod lifecycle, rolling deployments, resource requests/limits, liveness and readiness probes.

Amazon EKS Operations

Required

Managed node groups, Fargate profiles, cluster upgrades, and EKS add-ons.

Horizontal Pod Autoscaler & KEDA

Recommended

Scale workloads based on CPU, memory, and custom metrics from SQS or Prometheus.

07
7

Service Level Indicators (SLIs)

4 topics · 4 required
Identify and measure the metrics that best represent user experience for your services.

What are SLIs?

Required

SLIs are quantitative measures of service behaviour availability, latency, throughput, and error rate.

Choosing the Right SLIs

Required

User-centric SLIs vs system metrics measure what the user experiences, not what is easy to collect.

The Four Golden Signals

Required

Latency, traffic, errors, and saturation Google's framework for monitoring any system.

RED & USE Methods

Required

Rate/Errors/Duration for request-driven services; Utilisation/Saturation/Errors for resources.

08
8

Service Level Objectives (SLOs)

4 topics · 3 required · 1 recommended
Set meaningful reliability targets and use them to make data-driven engineering decisions.

Defining SLOs

Required

SLO = SLI over a time window with a target threshold how to set achievable, meaningful targets.

Error Budgets

Required

Error budget = 1 - SLO target. Use it to balance reliability work vs feature velocity.

Error Budget Policies

Required

Define what happens when the budget is consumed feature freeze, reliability sprints, or escalation.

SLO Review Cadence

Recommended

Regularly review and adjust SLOs as user expectations and system capabilities evolve.

09
9

Service Level Agreements (SLAs)

3 topics · 2 required · 1 recommended
Understand the contractual reliability commitments and how SLOs feed into them.

SLA vs SLO vs SLI

Required

The hierarchy: SLI measures it, SLO targets it internally, SLA commits to it externally.

Setting Conservative SLAs

Required

SLAs should be set tighter than SLOs to provide a safety buffer before breaching customer commitments.

SLA Breach Consequences

Recommended

Financial penalties, service credits, and reputational impact of breached SLAs.

10
10

Metrics & Prometheus

4 topics · 3 required · 1 recommended
Collect and query time-series metrics to understand system health and trends.

Prometheus Architecture

Required

Scrape model, targets, exporters, the Prometheus data model, and labels.

PromQL

Required

Instant vectors, range vectors, rate(), irate(), histogram_quantile(), and recording rules.

Alerting Rules

Required

Write Prometheus alerting rules with for clauses, severity labels, and runbook annotations.

Amazon Managed Service for Prometheus

Recommended

Serverless Prometheus-compatible monitoring at scale on AWS with AMP.

11
11

Dashboards & Visualisation

4 topics · 3 required · 1 recommended
Build meaningful dashboards that surface actionable information for on-call engineers.

Grafana Fundamentals

Required

Panels, variables, annotations, alerts, and dashboard-as-code with Grafonnet/Jsonnet.

Amazon Managed Grafana

Recommended

Fully managed Grafana on AWS with native CloudWatch, AMP, and X-Ray data sources.

Dashboard Design Principles

Required

Design dashboards for on-call use clear signal hierarchy, avoid noise, and link to runbooks.

Amazon CloudWatch Dashboards

Required

Build operational dashboards with CloudWatch metrics, alarms, and Logs Insights widgets.

12
12

Logging & Log Management

4 topics · 2 required · 2 recommended
Collect, centralise, and query logs to diagnose incidents and understand system behaviour.

Structured Logging

Required

JSON log format, log levels, correlation IDs, and consistent field naming conventions.

Amazon CloudWatch Logs

Required

Log groups, log streams, metric filters, subscription filters, and Logs Insights queries.

Centralised Logging with OpenSearch

Recommended

Ship logs from Lambda, EC2, and ECS to Amazon OpenSearch for full-text search and dashboards.

AWS Fluent Bit & Fluentd

Recommended

Collect and route logs from containers and EC2 instances to CloudWatch or OpenSearch.

13
13

Distributed Tracing

4 topics · 3 required · 1 recommended
Track requests across microservices to diagnose latency and pinpoint failure sources.

Distributed Tracing Concepts

Required

Traces, spans, context propagation, sampling, and the OpenTelemetry data model.

AWS X-Ray

Required

Instrument Lambda, ECS, and API Gateway; read service maps and analyse trace segments.

AWS Distro for OpenTelemetry (ADOT)

Recommended

Collect traces and metrics with the AWS-supported OpenTelemetry distribution.

Correlating Logs, Metrics & Traces

Required

Use trace IDs in logs, link metrics to traces, and build a unified observability workflow.

14
14

Alerting & On-Call Management

4 topics · 4 required
Design actionable alerting and sustainable on-call rotations that minimise engineer burnout.

Alerting Best Practices

Required

Alert on symptoms not causes, avoid alert fatigue, and ensure every alert has a runbook.

Amazon CloudWatch Alarms

Required

Composite alarms, anomaly detection alarms, metric math alarms, and SNS integration.

Alertmanager

Required

Route, deduplicate, group, and silence Prometheus alerts; integrate with PagerDuty and Slack.

On-Call Rotations & Escalation Policies

Required

Design fair rotations, primary/secondary escalation, and override schedules in PagerDuty or OpsGenie.

15
15

Incident Management

5 topics · 4 required · 1 recommended
Detect, respond to, and resolve incidents quickly while communicating clearly with stakeholders.

Incident Lifecycle

Required

Detection, triage, response, mitigation, resolution, and follow-up the five phases of an incident.

Incident Commander Role

Required

Coordinate responders, manage communication, and make decisions under pressure.

Severity Levels

Required

Define SEV1–SEV4 (or P1–P4) with clear criteria for customer impact and response SLAs.

Incident Communication

Required

Internal Slack war rooms, external status pages (Statuspage.io), and stakeholder updates.

AWS Systems Manager OpsCenter

Recommended

Centralise operational issues, correlate findings from GuardDuty, Config, and CloudWatch.

16
16

Post-Incident Reviews (PIRs)

4 topics · 4 required
Learn from incidents systematically to prevent recurrence and build institutional knowledge.

Blameless Post-Mortems

Required

Focus on systemic failures, not individual mistakes psychological safety is critical.

Post-Mortem Template

Required

Impact summary, timeline, root cause, contributing factors, action items, and lessons learned.

Root Cause Analysis (RCA) Techniques

Required

5 Whys, fishbone diagrams, and fault tree analysis for structured root cause investigation.

Action Item Tracking

Required

Assign owners, set deadlines, and track PIR action items to completion in Jira or Linear.

17
17

Chaos Engineering

4 topics · 2 required · 2 recommended
Proactively inject failures into production systems to discover weaknesses before they cause incidents.

Chaos Engineering Principles

Required

Hypothesis-driven experimentation, blast radius control, and GameDay practices.

AWS Fault Injection Service (FIS)

Required

Run controlled fault injection experiments on EC2, ECS, EKS, RDS, and more with AWS FIS.

LitmusChaos on EKS

Recommended

CNCF chaos engineering framework for Kubernetes pod kill, network latency, and disk fill experiments.

GameDays

Recommended

Structured team exercises that simulate real outage scenarios to validate runbooks and incident response.

18
18

Toil Identification & Reduction

4 topics · 4 required
Identify, measure, and systematically eliminate operational toil to free up engineering capacity.

What is Toil?

Required

Manual, repetitive, automatable work that scales with service growth and provides no enduring value.

Measuring Toil

Required

Track toil as a percentage of engineer time the SRE goal is to keep toil below 50%.

Toil Reduction Strategies

Required

Automate tickets, self-service provisioning, runbook automation, and event-driven remediation.

AWS Systems Manager Automation

Required

Create SSM Automation runbooks to execute common operational tasks without manual intervention.

19
19

Infrastructure as Code for SREs

4 topics · 2 required · 2 recommended
Manage infrastructure reliably and repeatably using code the foundation of operational consistency.

Terraform for SREs

Required

Write, plan, and apply infrastructure changes safely remote state, workspaces, and drift detection.

AWS CloudFormation & CDK

Required

Manage AWS resources declaratively with CloudFormation stacks or CDK constructs.

Ansible for Configuration Management

Recommended

Idempotent server configuration, patching automation, and drift remediation with Ansible.

IaC Testing & Validation

Recommended

Lint and test infrastructure code with tflint, Checkov, cfn-lint, and Terratest.

20
20

CI/CD Reliability

4 topics · 3 required · 1 recommended
Make deployment pipelines reliable, observable, and safe to run at high frequency.

Deployment Strategies

Required

Rolling, blue/green, canary, and feature flag deployments risk profiles and rollback mechanisms.

AWS CodeDeploy

Required

Automate EC2, Lambda, and ECS deployments with health checks and automatic rollback.

GitHub Actions for SRE

Recommended

Automate operational workflows AMI baking, certificate rotation, and compliance checks.

Deployment Observability

Required

Track deployment frequency, lead time, change failure rate, and MTTR (DORA metrics).

21
21

Capacity Planning

4 topics · 4 required
Forecast resource needs and provision ahead of demand to prevent capacity-driven incidents.

Demand Forecasting

Required

Analyse traffic trends, seasonality, and growth projections to model future resource needs.

Load Testing

Required

Simulate production load with k6, Locust, or Artillery to validate capacity assumptions.

AWS Compute Optimizer

Required

ML-based right-sizing recommendations for EC2, Lambda, ECS, and EBS volumes.

Auto Scaling & Predictive Scaling

Required

Configure dynamic and predictive scaling policies to handle demand spikes automatically.

22
22

Reliability Patterns

5 topics · 4 required · 1 recommended
Apply proven engineering patterns to make distributed systems resilient to partial failures.

Circuit Breaker Pattern

Required

Prevent cascading failures by temporarily blocking requests to a failing downstream service.

Retry & Exponential Backoff

Required

Retry transient failures with jitter and backoff to avoid thundering herd problems.

Bulkhead Pattern

Recommended

Isolate failures by partitioning resources separate thread pools, queues, or service instances.

Rate Limiting & Throttling

Required

Protect services from overload using token bucket or leaky bucket rate limiting.

Graceful Degradation

Required

Serve reduced functionality rather than full failure when dependencies are unavailable.

23
23

Security for SREs

4 topics · 3 required · 1 recommended
Integrate security into reliability practices secure systems fail less and recover faster.

Least Privilege & IAM Roles

Required

Apply least privilege to automation scripts, Lambda functions, and EC2 instance profiles.

Secrets Management

Required

Retrieve credentials at runtime from AWS Secrets Manager or Parameter Store no hardcoded secrets.

AWS GuardDuty for SREs

Recommended

Integrate GuardDuty findings into incident response workflows and PagerDuty escalation.

Vulnerability Management

Required

Automate patching with AWS Systems Manager Patch Manager and track CVEs with Amazon Inspector.

24
24

Cost Reliability

4 topics · 3 required · 1 recommended
Prevent cost incidents and ensure financial reliability alongside technical reliability.

Cost Anomaly Detection

Required

Use AWS Cost Anomaly Detection to alert on unexpected spend spikes automatically.

Tagging & Cost Attribution

Required

Tag all resources consistently to enable service-level cost attribution and showback.

Spot Instance Reliability

Recommended

Build fault-tolerant workloads on Spot with interruption handlers and mixed instance policies.

AWS Budgets & Alerts

Required

Set cost and usage budgets with alert thresholds to catch runaway workloads early.

25
25

Runbook & Documentation Culture

4 topics · 3 required · 1 recommended
Build a culture of operational documentation that empowers any engineer to respond to incidents.

Writing Effective Runbooks

Required

Clear, step-by-step procedures with expected outcomes, rollback steps, and escalation paths.

AWS Systems Manager Run Command

Required

Execute runbook steps remotely across EC2 fleets without SSH using SSM Run Command.

Architecture Decision Records (ADRs)

Recommended

Document why systems are designed the way they are to aid future incident diagnosis.

Operational Readiness Reviews (ORRs)

Required

Gate new services going on-call with a checklist: SLOs defined, runbooks written, dashboards built.

26
26

Advanced Observability

4 topics · 2 required · 2 recommended
Go beyond basic monitoring to build a deep, unified view of complex distributed systems.

OpenTelemetry Collector

Required

Deploy the OTel Collector as a sidecar or DaemonSet to pipeline traces, metrics, and logs.

Continuous Profiling

Recommended

Profile CPU and memory in production with Pyroscope or Amazon CodeGuru Profiler.

Real User Monitoring (RUM)

Recommended

Capture frontend performance from real user sessions with Amazon CloudWatch RUM.

Synthetic Monitoring

Required

Simulate user journeys 24/7 with CloudWatch Synthetics canaries to detect outages proactively.

27
27

SRE Maturity & Org Culture

4 topics · 3 required · 1 recommended
Grow SRE practices across the organisation and measure the maturity of your reliability programme.

SRE Team Models

Required

Embedded vs centralised vs consulting SRE choosing the right model for your organisation.

DORA Metrics

Required

Deployment frequency, lead time, change failure rate, and MTTR as engineering performance indicators.

Production Readiness Reviews

Required

A structured checklist every new service must pass before receiving SRE support and going on-call.

SRE Maturity Model

Recommended

Assess and grow your SRE practice across five dimensions: monitoring, incident response, toil, SLOs, and culture.

Discuss this Roadmap

Related Posts

You might also enjoy

Check out some of our other posts on similar topics

Release Engineer Beginner to Expert

This roadmap takes you from release engineering principles and version control mastery through to advanced GitOps patterns and multi-account AWS delivery at scale. Each stage builds on the last treat

DevOps Engineer Beginner to Expert

This roadmap guides you from Linux fundamentals through to advanced platform engineering and MLOps. Each stage builds on the last work through them sequentially to develop a deep, well-rounded DevOps

Solutions Architect Beginner to Expert

This roadmap guides you from cloud fundamentals through to professional-level AWS solutions architecture. Each stage builds on the last master the foundations before tackling advanced networking, secu

JavaScript Beginner to Expert

This roadmap guides you through the complete JavaScript journey from writing your first variable to architecting production-grade applications on the frontend and backend. Work through each stage sequ

4 related posts