DevOps Engineer Beginner to Expert
A comprehensive roadmap to master DevOps engineering from Linux fundamentals to advanced cloud-native and platform engineering concepts.
This roadmap guides you from Linux fundamentals through to advanced platform engineering and MLOps. Each stage builds on the last work through them sequentially to develop a deep, well-rounded DevOps skill set. Mark topics as you complete them and revisit earlier stages to reinforce your foundations as you progress.
Linux Fundamentals
File System & Navigation
Understand the Linux directory tree, navigate with cd/ls/pwd, and manage files with cp/mv/rm.
Users, Groups & Permissions
Manage users with useradd/usermod, set file permissions with chmod/chown, and understand sudo.
Package Management
Install and update software with apt, yum/dnf, and snap. Understand package repositories.
Shell Scripting
Bash Basics
Variables, conditionals (if/else), loops (for/while), and functions in Bash.
Text Processing
Use grep, awk, sed, cut, and sort to process and transform text streams.
Cron Jobs & Scheduling
Schedule recurring tasks using crontab and understand cron syntax.
Networking Basics
TCP/IP & DNS
How IP addressing, subnets, DNS resolution, and routing work.
HTTP/HTTPS & TLS
Request/response lifecycle, status codes, headers, and TLS certificate fundamentals.
Firewalls & Ports
Configure iptables/ufw, understand inbound/outbound rules, and common service ports.
Load Balancing Concepts
Round-robin, least-connections, and health checks at the network layer.
Version Control with Git
Core Git Workflow
init, clone, add, commit, push, pull, and status commands.
Branching & Merging
Feature branches, merge vs rebase, resolving merge conflicts.
Git Workflows
Gitflow, trunk-based development, and pull request best practices.
Tags & Releases
Semantic versioning with git tags and using GitHub/GitLab releases.
Containerization with Docker
Docker Architecture
Images, containers, the Docker daemon, and the Docker Hub registry.
Writing Dockerfiles
FROM, RUN, COPY, ENV, EXPOSE, CMD, and ENTRYPOINT instructions.
Docker Compose
Define and run multi-container applications with docker-compose.yml.
Image Optimization
Multi-stage builds, layer caching, .dockerignore, and minimizing image size.
Container Registries
Docker Hub
Push and pull public/private images from Docker Hub.
Private Registries
Use AWS ECR, GitHub Container Registry, or self-hosted Harbor.
Image Scanning
Scan for vulnerabilities using Trivy or Docker Scout before deploying.
CI/CD Fundamentals
CI/CD Concepts
The pipeline stages: build, test, lint, package, deploy, and release.
Pipeline as Code
Define pipelines in YAML (GitHub Actions, GitLab CI) rather than through a UI.
Artifacts & Caching
Cache dependencies and pass build artifacts between pipeline stages.
GitHub Actions
Workflow Syntax
on, jobs, steps, uses, run the building blocks of a GitHub Actions workflow.
Reusable Workflows & Composite Actions
DRY your pipelines by sharing logic across repositories.
Secrets & Environments
Manage sensitive values, approval gates, and environment-specific variables.
GitLab CI/CD
.gitlab-ci.yml Basics
Stages, jobs, scripts, image, and before_script in GitLab CI.
GitLab Runners
Register and configure shared and self-hosted runners for job execution.
Environments & Deployments
Track deployments per environment and use dynamic child pipelines.
Cloud Computing Fundamentals
Cloud Service Models
IaaS vs PaaS vs SaaS understand where you manage what.
Regions, Availability Zones & Edge
How cloud providers distribute infrastructure globally for availability.
Shared Responsibility Model
What the cloud provider secures vs what you are responsible for.
AWS Core Services
IAM Identity & Access Management
Users, groups, roles, policies, and the principle of least privilege.
EC2 & Auto Scaling
Launch instances, configure AMIs, security groups, and Auto Scaling Groups.
S3 Simple Storage Service
Buckets, objects, versioning, lifecycle policies, and static website hosting.
VPC Virtual Private Cloud
Subnets, route tables, internet gateways, NAT gateways, and VPC peering.
RDS & Databases
Managed relational databases, Multi-AZ deployments, and read replicas.
Infrastructure as Code Terraform
Terraform Core Concepts
Providers, resources, data sources, variables, and outputs.
State Management
Local vs remote state, terraform.tfstate, state locking with S3 + DynamoDB.
Modules
Write reusable, composable Terraform modules and use the public registry.
Workspaces & Environments
Manage dev/staging/prod environments using workspaces or directory isolation.
Configuration Management
Ansible Fundamentals
Inventories, playbooks, tasks, handlers, roles, and ad-hoc commands.
Ansible Roles & Galaxy
Structure playbooks with roles and reuse community roles from Ansible Galaxy.
Idempotency
Understand why idempotent tasks are critical for reliable automation.
Kubernetes Core Concepts
Cluster Architecture
Control plane (API server, etcd, scheduler, controller-manager) and worker nodes (kubelet, kube-proxy).
Pods, Deployments & ReplicaSets
The smallest deployable unit, declarative rollouts, and replica management.
Services & Networking
ClusterIP, NodePort, LoadBalancer service types, and DNS within the cluster.
ConfigMaps & Secrets
Decouple configuration from container images and manage sensitive data.
Kubernetes Workloads & Storage
StatefulSets & DaemonSets
Run stateful applications with stable network identities and per-node daemons.
Persistent Volumes & Claims
PV, PVC, StorageClasses, and dynamic provisioning for stateful data.
Jobs & CronJobs
Run batch tasks and scheduled workloads inside a cluster.
Resource Requests & Limits
Set CPU and memory requests/limits to ensure fair scheduling and stability.
Kubernetes Advanced Operations
Ingress & Ingress Controllers
Expose HTTP/S routes with NGINX or Traefik ingress controllers and TLS termination.
Horizontal & Vertical Pod Autoscaling
HPA based on CPU/custom metrics and VPA for right-sizing resource requests.
RBAC
Role-Based Access Control ClusterRoles, Roles, RoleBindings, and ServiceAccounts.
Network Policies
Restrict pod-to-pod communication using Kubernetes Network Policies.
Helm Kubernetes Package Manager
Helm Chart Structure
Chart.yaml, values.yaml, templates, helpers, and the _helpers.tpl file.
Templating with Go Templates
Use {{ .Values }}, conditionals, loops, and named templates in Helm.
Chart Repositories & OCI Registries
Host charts on GitHub Pages, Artifact Hub, or push to OCI-compatible registries.
Helm Hooks & Tests
Run pre/post-install jobs and validate deployments with helm test.
GitOps
GitOps Principles
Declarative config, versioned history, automated reconciliation, and self-healing.
ArgoCD
Deploy and sync Kubernetes manifests automatically from a Git repository with ArgoCD.
Flux CD
CNCF-graduated GitOps toolkit for continuous delivery to Kubernetes.
Observability Logging
Structured Logging
JSON log formats, log levels, correlation IDs, and log context best practices.
ELK / EFK Stack
Collect with Fluentd/Filebeat, store in Elasticsearch, visualize in Kibana.
Loki & Grafana
Lightweight log aggregation with Loki, queried via LogQL in Grafana.
Observability Metrics
Prometheus
Scrape metrics with Prometheus, write PromQL queries, and configure alerting rules.
Grafana Dashboards
Build dashboards, panels, and variables to visualize Prometheus metrics.
Exporters
node_exporter, kube-state-metrics, blackbox_exporter for infra and app metrics.
Alertmanager
Route, deduplicate, and silence alerts; integrate with PagerDuty, Slack, and email.
Observability Tracing
Distributed Tracing Concepts
Traces, spans, context propagation, and the OpenTelemetry data model.
Jaeger & Tempo
Deploy Jaeger or Grafana Tempo as a tracing backend and query trace data.
Instrumenting Applications
Add OpenTelemetry SDKs to Node.js, Python, and Go services.
Security DevSecOps
Static Application Security Testing (SAST)
Scan source code for vulnerabilities using Semgrep, SonarQube, or Bandit.
Dependency Scanning (SCA)
Detect vulnerable third-party packages using Dependabot, Snyk, or OWASP Dependency-Check.
Container Image Scanning
Scan Docker images for CVEs with Trivy or Grype in your CI pipeline.
Secrets Detection
Prevent credentials from reaching Git with detect-secrets, GitGuardian, or gitleaks.
Security Cloud & Kubernetes Hardening
CIS Benchmarks
Apply CIS benchmarks for Linux, Docker, and Kubernetes to harden configurations.
Pod Security Standards
Enforce privileged, baseline, and restricted pod policies using PSA/PSP.
OPA / Kyverno
Define and enforce admission policies in Kubernetes with Open Policy Agent or Kyverno.
Secrets Management
Store and inject secrets using HashiCorp Vault or AWS Secrets Manager.
Service Mesh
Service Mesh Concepts
Sidecar proxy pattern, control plane vs data plane, and mTLS between services.
Istio
Traffic management, circuit breaking, retries, and observability with Istio.
Linkerd
Lightweight CNCF service mesh focused on simplicity and low resource overhead.
Infrastructure Testing
Terraform Testing
Use Terratest or terraform test to write unit and integration tests for modules.
Kitchen-Terraform / Checkov
Policy-as-code compliance scanning for Terraform with Checkov or tfsec.
Kubernetes Manifest Testing
Lint and validate manifests with kubeval, kubeconform, and Polaris.
Cost Optimization
Cloud Cost Visibility
AWS Cost Explorer, tagging strategies, and budget alerts.
Right-sizing & Reserved Instances
Match instance types to workload needs and use Savings Plans or Reserved Instances.
Spot Instances & Preemptible VMs
Run fault-tolerant workloads on Spot/Preemptible instances for up to 90% savings.
FinOps Practices
Cross-team cost accountability, showback/chargeback models, and FinOps Foundation principles.
Disaster Recovery & Backup
RTO & RPO
Define Recovery Time Objective and Recovery Point Objective for each service.
Backup Strategies
Automated backups for databases, volumes, and object storage with retention policies.
Multi-Region & Cross-Zone Architecture
Design for availability zone and region failure using active-active or active-passive patterns.
Site Reliability Engineering (SRE)
SLIs, SLOs & SLAs
Define measurable reliability targets and track them with error budgets.
Error Budgets
Use error budgets to make data-driven decisions about feature releases vs reliability work.
Incident Management
Runbooks, on-call rotations, post-mortems, and blameless culture.
Chaos Engineering
Proactively test system resilience with tools like Chaos Monkey, LitmusChaos, or AWS FIS.
Platform Engineering
Internal Developer Platforms
Concepts behind IDPs: golden paths, self-service infrastructure, and platform teams.
Backstage
Deploy Spotify Backstage as a developer portal with a software catalog and TechDocs.
Crossplane
Provision cloud resources from Kubernetes using Crossplane Compositions and XRDs.
Advanced Kubernetes Patterns
Custom Resource Definitions (CRDs)
Extend the Kubernetes API with your own resource types.
Operators
Encode operational knowledge as Kubernetes controllers using Operator SDK or Kubebuilder.
Admission Webhooks
Mutating and validating admission webhooks for policy enforcement and injection.
Advanced Scheduling
Node affinity, taints/tolerations, topology spread constraints, and priority classes.
Multi-Cloud & Hybrid Cloud
Multi-Cloud Strategy
Evaluate use cases for multi-cloud vs single-cloud: portability, vendor lock-in, and compliance.
Terraform Multi-Provider
Manage AWS, GCP, and Azure resources within a single Terraform configuration.
Federated Kubernetes
Manage workloads across multiple clusters with Cluster API or Google Anthos.
AI & MLOps Foundations
MLOps Concepts
CI/CD for ML models, experiment tracking, model registries, and feature stores.
Kubeflow Pipelines
Orchestrate ML workflows on Kubernetes using Kubeflow.
GPU Workloads on Kubernetes
Configure NVIDIA device plugins and resource limits for GPU-accelerated pods.
You might also enjoy
Check out some of our other posts on similar topics
4 related posts