theme switcher

Home About Me Experience Education Projects Résumé Skills Services

Content

Blog Case Studies Cheatsheets Code Snippets DevTips Flashcards Glossary Quizzes Roadmaps Series Bookmarks

theme switcher

Contact Me

Home About Me Experience Education Projects Résumé Skills Services

Content

Blog Case Studies Cheatsheets Code Snippets DevTips Flashcards Glossary Quizzes Roadmaps Series Bookmarks

Contact Me

Home›Roadmaps›All Categories›Sre

Roadmaps

Site Reliability Engineer Beginner to Expert

Roadmap

Site Reliability Engineer Beginner to Expert

A comprehensive roadmap to master Site Reliability Engineering from Linux and networking fundamentals to advanced SLOs, observability, incident management, and automation on AWS.

Published: 21 Apr, 2026

27 Stages

All Levels

Sre Cloud Devops #sre #aws #observability #reliability #incident-management #automation

Facebook Twitter LinkedIn WhatsApp Telegram Reddit Hacker News PinterestEmail

Series

Career Roadmaps4/6

PreviousSolutions Architect Beginner to Expert NextRelease Engineer Beginner to Expert

All posts in this series (6)

Roadmaps6

JavaScript Beginner to Expert
DevOps Engineer Beginner to Expert
Solutions Architect Beginner to Expert
Site Reliability Engineer Beginner to ExpertYou are here
Release Engineer Beginner to Expert
Frontend Developer Beginner to Expert

This roadmap takes you from the fundamentals of Linux and systems thinking through to advanced observability, chaos engineering, and SRE organisational culture. Each stage builds on the last master reliability principles and AWS tooling together, treating every production system as an opportunity to learn, automate, and improve. The goal is not zero failures, but fast recovery and continuous reduction of their impact.

Site Reliability Engineer Beginner to Expert

Contents

SRE Foundations Linux & Systems Fundamentals Networking Fundamentals for SREs Shell Scripting & Automation Basics Version Control & GitOps Practices Containerisation & Orchestration Service Level Indicators (SLIs)Service Level Objectives (SLOs)Service Level Agreements (SLAs)Metrics & Prometheus Dashboards & Visualisation Logging & Log Management Distributed Tracing Alerting & On-Call Management Incident Management Post-Incident Reviews (PIRs)Chaos Engineering Toil Identification & Reduction Infrastructure as Code for SREs CI/CD Reliability Capacity Planning Reliability Patterns Security for SREs Cost Reliability Runbook & Documentation Culture Advanced Observability SRE Maturity & Org Culture

LegendRequiredRecommendedOptional

0 / 108 complete

1SRE Foundations 2Linux & Systems Fundamentals 3Networking Fundamentals for SREs 4Shell Scripting & Automation Basics 5Version Control & GitOps Practices 6Containerisation & Orchestration 7Service Level Indicators (SLIs)8Service Level Objectives (SLOs)9Service Level Agreements (SLAs)10Metrics & Prometheus 11Dashboards & Visualisation 12Logging & Log Management 13Distributed Tracing 14Alerting & On-Call Management 15Incident Management 16Post-Incident Reviews (PIRs)17Chaos Engineering 18Toil Identification & Reduction 19Infrastructure as Code for SREs 20CI/CD Reliability 21Capacity Planning 22Reliability Patterns 23Security for SREs 24Cost Reliability 25Runbook & Documentation Culture 26Advanced Observability 27SRE Maturity & Org Culture

01

1

SRE Foundations

4 topics·3 required·1 recommended

Understand what Site Reliability Engineering is, where it came from, and how it differs from traditional ops.

What is SRE?

Required

Google's SRE origin, the SRE vs DevOps distinction, and the core principles of reliability engineering.

Google SRE Book

The SRE Role & Responsibilities

Required

On-call engineering, toil reduction, capacity planning, and the balance between feature velocity and reliability.

Reliability as a Feature

Required

Why reliability is a product requirement, not just an operational concern.

The SRE Workbook

Recommended

Practical guidance and case studies that complement the original SRE book.

The SRE Workbook

02

2

Linux & Systems Fundamentals

4 topics·3 required·1 recommended

Master the operating system skills every SRE depends on daily.

Linux Process Management

Required

ps, top, htop, kill, nice, systemd units, and process states.

Linux man pages

File Systems & I/O

Required

inodes, mount points, df/du, lsof, and diagnosing disk I/O bottlenecks with iostat.

Memory & CPU Analysis

Required

vmstat, free, /proc/meminfo, perf, and understanding OOM killer behaviour.

Kernel & System Calls

Recommended

strace, ltrace, dmesg, and reading kernel logs for low-level system debugging.

03

3

Networking Fundamentals for SREs

4 topics·4 required

Understand the network layers that underpin distributed systems and cloud infrastructure.

TCP/IP & the OSI Model

Required

How packets travel from source to destination, TCP handshake, and connection states.

Cloudflare Learning: OSI Model

DNS Deep Dive

Required

Resolution chain, TTL, dig/nslookup, CNAME vs A records, and DNS propagation issues.

HTTP/HTTPS & TLS

Required

Request lifecycle, status codes, headers, TLS handshake, and certificate chains.

Network Troubleshooting Tools

Required

curl, traceroute, mtr, tcpdump, netstat/ss, and Wireshark for packet analysis.

04

4

Shell Scripting & Automation Basics

4 topics·3 required·1 recommended

Write Bash scripts to automate operational tasks and reduce manual toil.

Bash Scripting

Required

Variables, conditionals, loops, functions, error handling (set -euo pipefail), and exit codes.

Bash Guide

Text Processing

Required

grep, awk, sed, cut, sort, uniq, and jq for structured log and JSON processing.

Cron & Scheduling

Required

crontab syntax, systemd timers, and avoiding overlapping job execution.

Python for SRE

Recommended

Scripting operational tasks in Python requests, subprocess, boto3, and argparse.

05

5

Version Control & GitOps Practices

3 topics·2 required·1 recommended

Use Git to version infrastructure, runbooks, and automation code.

Git for Operations

Required

Branching strategies, pull requests, tagging releases, and git blame for audit trails.

Pro Git Book

GitOps Principles

Required

Git as the single source of truth declarative config, automated reconciliation, and self-healing.

Runbooks as Code

Recommended

Store operational runbooks in Git alongside the systems they describe for version control and discoverability.

06

6

Containerisation & Orchestration

4 topics·3 required·1 recommended

Operate containerised workloads and understand the platforms SREs are responsible for.

Docker Operations

Required

Container lifecycle, resource limits (--cpus, --memory), log drivers, and health checks.

Docker Docs

Kubernetes for SREs

Required

Pod lifecycle, rolling deployments, resource requests/limits, liveness and readiness probes.

Kubernetes Docs

Amazon EKS Operations

Required

Managed node groups, Fargate profiles, cluster upgrades, and EKS add-ons.

Horizontal Pod Autoscaler & KEDA

Recommended

Scale workloads based on CPU, memory, and custom metrics from SQS or Prometheus.

07

7

Service Level Indicators (SLIs)

4 topics·4 required

Identify and measure the metrics that best represent user experience for your services.

What are SLIs?

Required

SLIs are quantitative measures of service behaviour availability, latency, throughput, and error rate.

Google SRE: SLIs

Choosing the Right SLIs

Required

User-centric SLIs vs system metrics measure what the user experiences, not what is easy to collect.

The Four Golden Signals

Required

Latency, traffic, errors, and saturation Google's framework for monitoring any system.

RED & USE Methods

Required

Rate/Errors/Duration for request-driven services; Utilisation/Saturation/Errors for resources.

08

8

Service Level Objectives (SLOs)

4 topics·3 required·1 recommended

Set meaningful reliability targets and use them to make data-driven engineering decisions.

Defining SLOs

Required

SLO = SLI over a time window with a target threshold how to set achievable, meaningful targets.

Google SRE: SLOs

Error Budgets

Required

Error budget = 1 - SLO target. Use it to balance reliability work vs feature velocity.

Error Budget Policies

Required

Define what happens when the budget is consumed feature freeze, reliability sprints, or escalation.

SLO Review Cadence

Recommended

Regularly review and adjust SLOs as user expectations and system capabilities evolve.

09

9

Service Level Agreements (SLAs)

3 topics·2 required·1 recommended

Understand the contractual reliability commitments and how SLOs feed into them.

SLA vs SLO vs SLI

Required

The hierarchy: SLI measures it, SLO targets it internally, SLA commits to it externally.

Google SRE: Terminology

Setting Conservative SLAs

Required

SLAs should be set tighter than SLOs to provide a safety buffer before breaching customer commitments.

SLA Breach Consequences

Recommended

Financial penalties, service credits, and reputational impact of breached SLAs.

10

Metrics & Prometheus

4 topics·3 required·1 recommended

Collect and query time-series metrics to understand system health and trends.

Prometheus Architecture

Required

Scrape model, targets, exporters, the Prometheus data model, and labels.

Prometheus Docs

PromQL

Required

Instant vectors, range vectors, rate(), irate(), histogram_quantile(), and recording rules.

Alerting Rules

Required

Write Prometheus alerting rules with for clauses, severity labels, and runbook annotations.

Amazon Managed Service for Prometheus

Recommended

Serverless Prometheus-compatible monitoring at scale on AWS with AMP.

AMP Docs

11

Dashboards & Visualisation

4 topics·3 required·1 recommended

Build meaningful dashboards that surface actionable information for on-call engineers.

Grafana Fundamentals

Required

Panels, variables, annotations, alerts, and dashboard-as-code with Grafonnet/Jsonnet.

Grafana Docs

Amazon Managed Grafana

Recommended

Fully managed Grafana on AWS with native CloudWatch, AMP, and X-Ray data sources.

Amazon Managed Grafana Docs

Dashboard Design Principles

Required

Design dashboards for on-call use clear signal hierarchy, avoid noise, and link to runbooks.

Amazon CloudWatch Dashboards

Required

Build operational dashboards with CloudWatch metrics, alarms, and Logs Insights widgets.

12

Logging & Log Management

4 topics·2 required·2 recommended

Collect, centralise, and query logs to diagnose incidents and understand system behaviour.

Structured Logging

Required

JSON log format, log levels, correlation IDs, and consistent field naming conventions.

AWS Logging Best Practices

Amazon CloudWatch Logs

Required

Log groups, log streams, metric filters, subscription filters, and Logs Insights queries.

Centralised Logging with OpenSearch

Recommended

Ship logs from Lambda, EC2, and ECS to Amazon OpenSearch for full-text search and dashboards.

AWS Fluent Bit & Fluentd

Recommended

Collect and route logs from containers and EC2 instances to CloudWatch or OpenSearch.

13

Distributed Tracing

4 topics·3 required·1 recommended

Track requests across microservices to diagnose latency and pinpoint failure sources.

Distributed Tracing Concepts

Required

Traces, spans, context propagation, sampling, and the OpenTelemetry data model.

OpenTelemetry Docs

AWS X-Ray

Required

Instrument Lambda, ECS, and API Gateway; read service maps and analyse trace segments.

AWS X-Ray Docs

AWS Distro for OpenTelemetry (ADOT)

Recommended

Collect traces and metrics with the AWS-supported OpenTelemetry distribution.

Correlating Logs, Metrics & Traces

Required

Use trace IDs in logs, link metrics to traces, and build a unified observability workflow.

14

Alerting & On-Call Management

4 topics·4 required

Design actionable alerting and sustainable on-call rotations that minimise engineer burnout.

Alerting Best Practices

Required

Alert on symptoms not causes, avoid alert fatigue, and ensure every alert has a runbook.

Google SRE: Alerting

Amazon CloudWatch Alarms

Required

Composite alarms, anomaly detection alarms, metric math alarms, and SNS integration.

Alertmanager

Required

Route, deduplicate, group, and silence Prometheus alerts; integrate with PagerDuty and Slack.

On-Call Rotations & Escalation Policies

Required

Design fair rotations, primary/secondary escalation, and override schedules in PagerDuty or OpsGenie.

15

Incident Management

5 topics·4 required·1 recommended

Detect, respond to, and resolve incidents quickly while communicating clearly with stakeholders.

Incident Lifecycle

Required

Detection, triage, response, mitigation, resolution, and follow-up the five phases of an incident.

PagerDuty Incident Response Guide

Incident Commander Role

Required

Coordinate responders, manage communication, and make decisions under pressure.

Severity Levels

Required

Define SEV1–SEV4 (or P1–P4) with clear criteria for customer impact and response SLAs.

Incident Communication

Required

Internal Slack war rooms, external status pages (Statuspage.io), and stakeholder updates.

AWS Systems Manager OpsCenter

Recommended

Centralise operational issues, correlate findings from GuardDuty, Config, and CloudWatch.

16

Post-Incident Reviews (PIRs)

4 topics·4 required

Learn from incidents systematically to prevent recurrence and build institutional knowledge.

Blameless Post-Mortems

Required

Focus on systemic failures, not individual mistakes psychological safety is critical.

Google SRE: Post-Mortems

Post-Mortem Template

Required

Impact summary, timeline, root cause, contributing factors, action items, and lessons learned.

Root Cause Analysis (RCA) Techniques

Required

5 Whys, fishbone diagrams, and fault tree analysis for structured root cause investigation.

Action Item Tracking

Required

Assign owners, set deadlines, and track PIR action items to completion in Jira or Linear.

17

Chaos Engineering

4 topics·2 required·2 recommended

Proactively inject failures into production systems to discover weaknesses before they cause incidents.

Chaos Engineering Principles

Required

Hypothesis-driven experimentation, blast radius control, and GameDay practices.

Principles of Chaos Engineering

AWS Fault Injection Service (FIS)

Required

Run controlled fault injection experiments on EC2, ECS, EKS, RDS, and more with AWS FIS.

AWS FIS Docs

LitmusChaos on EKS

Recommended

CNCF chaos engineering framework for Kubernetes pod kill, network latency, and disk fill experiments.

GameDays

Recommended

Structured team exercises that simulate real outage scenarios to validate runbooks and incident response.

18

Toil Identification & Reduction

4 topics·4 required

Identify, measure, and systematically eliminate operational toil to free up engineering capacity.

What is Toil?

Required

Manual, repetitive, automatable work that scales with service growth and provides no enduring value.

Google SRE: Eliminating Toil

Measuring Toil

Required

Track toil as a percentage of engineer time the SRE goal is to keep toil below 50%.

Toil Reduction Strategies

Required

Automate tickets, self-service provisioning, runbook automation, and event-driven remediation.

AWS Systems Manager Automation

Required

Create SSM Automation runbooks to execute common operational tasks without manual intervention.

SSM Automation Docs

19

Infrastructure as Code for SREs

4 topics·2 required·2 recommended

Manage infrastructure reliably and repeatably using code the foundation of operational consistency.

Terraform for SREs

Required

Write, plan, and apply infrastructure changes safely remote state, workspaces, and drift detection.

Terraform Docs

AWS CloudFormation & CDK

Required

Manage AWS resources declaratively with CloudFormation stacks or CDK constructs.

Ansible for Configuration Management

Recommended

Idempotent server configuration, patching automation, and drift remediation with Ansible.

IaC Testing & Validation

Recommended

Lint and test infrastructure code with tflint, Checkov, cfn-lint, and Terratest.

20

CI/CD Reliability

4 topics·3 required·1 recommended

Make deployment pipelines reliable, observable, and safe to run at high frequency.

Deployment Strategies

Required

Rolling, blue/green, canary, and feature flag deployments risk profiles and rollback mechanisms.

AWS Deployment Strategies

AWS CodeDeploy

Required

Automate EC2, Lambda, and ECS deployments with health checks and automatic rollback.

GitHub Actions for SRE

Recommended

Automate operational workflows AMI baking, certificate rotation, and compliance checks.

Deployment Observability

Required

Track deployment frequency, lead time, change failure rate, and MTTR (DORA metrics).

21

Capacity Planning

4 topics·4 required

Forecast resource needs and provision ahead of demand to prevent capacity-driven incidents.

Demand Forecasting

Required

Analyse traffic trends, seasonality, and growth projections to model future resource needs.

Google SRE: Forecasting

Load Testing

Required

Simulate production load with k6, Locust, or Artillery to validate capacity assumptions.

AWS Compute Optimizer

Required

ML-based right-sizing recommendations for EC2, Lambda, ECS, and EBS volumes.

AWS Compute Optimizer Docs

Auto Scaling & Predictive Scaling

Required

Configure dynamic and predictive scaling policies to handle demand spikes automatically.

22

Reliability Patterns

5 topics·4 required·1 recommended

Apply proven engineering patterns to make distributed systems resilient to partial failures.

Circuit Breaker Pattern

Required

Prevent cascading failures by temporarily blocking requests to a failing downstream service.

AWS: Circuit Breaker

Retry & Exponential Backoff

Required

Retry transient failures with jitter and backoff to avoid thundering herd problems.

Bulkhead Pattern

Recommended

Isolate failures by partitioning resources separate thread pools, queues, or service instances.

Rate Limiting & Throttling

Required

Protect services from overload using token bucket or leaky bucket rate limiting.

Graceful Degradation

Required

Serve reduced functionality rather than full failure when dependencies are unavailable.

23

Security for SREs

4 topics·3 required·1 recommended

Integrate security into reliability practices secure systems fail less and recover faster.

Least Privilege & IAM Roles

Required

Apply least privilege to automation scripts, Lambda functions, and EC2 instance profiles.

AWS IAM Best Practices

Secrets Management

Required

Retrieve credentials at runtime from AWS Secrets Manager or Parameter Store no hardcoded secrets.

AWS GuardDuty for SREs

Recommended

Integrate GuardDuty findings into incident response workflows and PagerDuty escalation.

Vulnerability Management

Required

Automate patching with AWS Systems Manager Patch Manager and track CVEs with Amazon Inspector.

24

Cost Reliability

4 topics·3 required·1 recommended

Prevent cost incidents and ensure financial reliability alongside technical reliability.

Cost Anomaly Detection

Required

Use AWS Cost Anomaly Detection to alert on unexpected spend spikes automatically.

AWS Cost Anomaly Detection

Tagging & Cost Attribution

Required

Tag all resources consistently to enable service-level cost attribution and showback.

Spot Instance Reliability

Recommended

Build fault-tolerant workloads on Spot with interruption handlers and mixed instance policies.

AWS Budgets & Alerts

Required

Set cost and usage budgets with alert thresholds to catch runaway workloads early.

25

Runbook & Documentation Culture

4 topics·3 required·1 recommended

Build a culture of operational documentation that empowers any engineer to respond to incidents.

Writing Effective Runbooks

Required

Clear, step-by-step procedures with expected outcomes, rollback steps, and escalation paths.

PagerDuty: Runbook Guide

AWS Systems Manager Run Command

Required

Execute runbook steps remotely across EC2 fleets without SSH using SSM Run Command.

Architecture Decision Records (ADRs)

Recommended

Document why systems are designed the way they are to aid future incident diagnosis.

Operational Readiness Reviews (ORRs)

Required

Gate new services going on-call with a checklist: SLOs defined, runbooks written, dashboards built.

26

Advanced Observability

4 topics·2 required·2 recommended

Go beyond basic monitoring to build a deep, unified view of complex distributed systems.

OpenTelemetry Collector

Required

Deploy the OTel Collector as a sidecar or DaemonSet to pipeline traces, metrics, and logs.

OTel Collector Docs

Continuous Profiling

Recommended

Profile CPU and memory in production with Pyroscope or Amazon CodeGuru Profiler.

Real User Monitoring (RUM)

Recommended

Capture frontend performance from real user sessions with Amazon CloudWatch RUM.

CloudWatch RUM Docs

Synthetic Monitoring

Required

Simulate user journeys 24/7 with CloudWatch Synthetics canaries to detect outages proactively.

27

SRE Maturity & Org Culture

4 topics·3 required·1 recommended

Grow SRE practices across the organisation and measure the maturity of your reliability programme.

SRE Team Models

Required

Embedded vs centralised vs consulting SRE choosing the right model for your organisation.

Google SRE: Organisational Change

DORA Metrics

Required

Deployment frequency, lead time, change failure rate, and MTTR as engineering performance indicators.

Production Readiness Reviews

Required

A structured checklist every new service must pass before receiving SRE support and going on-call.

SRE Maturity Model

Recommended

Assess and grow your SRE practice across five dimensions: monitoring, incident response, toil, SLOs, and culture.

Reset Progress?

This will clear all your checked topics in this roadmap. This action cannot be undone.

Comments

You might also enjoy

Check out some of our other posts on similar topics

Release Engineer Beginner to Expert

Mohammad Abu Mattar
Devops
Cloud
Release engineering

This roadmap takes you from release engineering principles and version control mastery through to advanced GitOps patterns and multi-account AWS delivery at scale. Each stage builds on the last treat

#Aws #Ci cd #Terraform+4 tags

DevOps Engineer Beginner to Expert

Mohammad Abu Mattar
Devops
Cloud

This roadmap guides you from Linux fundamentals through to advanced platform engineering and MLOps. Each stage builds on the last work through them sequentially to develop a deep, well-rounded DevOps

#Devops #Linux #Docker+4 tags

Solutions Architect Beginner to Expert

Mohammad Abu Mattar
Cloud
Architecture

This roadmap guides you from cloud fundamentals through to professional-level AWS solutions architecture. Each stage builds on the last master the foundations before tackling advanced networking, secu

#Aws #Solutions architect #Cloud+3 tags

JavaScript Beginner to Expert

Mohammad Abu Mattar
Web development

This roadmap guides you through the complete JavaScript journey from writing your first variable to architecting production-grade applications on the frontend and backend. Work through each stage sequ

#Javascript #Frontend #Nodejs+2 tags

Frontend Developer Beginner to Expert

Mohammad Abu Mattar
Web development

Frontend Developer Beginner to Expert This roadmap walks you from absolute beginner to a strong, hireable frontend engineer. Work through the stages in order: the early ones build the mental model

#Frontend #Html #Css+5 tags