Building reliable systems requires a shift in mindset and a focus on scalability, automation, and resilience. This course introduces participants to the core practices of Site Reliability Engineering (SRE), blending software engineering with IT operations. They will explore service-level objectives, error budgets, and modern operational strategies.
Learning Outcomes:
Understand the principles and origins of Site Reliability Engineering
Describe the relationship between reliability and system performance
Apply SRE tools and techniques such as SLIs, SLOs, and error budgets
Analyse incidents and implement learning through blameless postmortems
Key Topics:
History and principles of SRE
Monitoring, observability, and incident response
Automation and toil reduction strategies
Service-level indicators and error budget policies
Exam Details
This course is designed to build participants’ understanding of key concepts and practices covered in the DevOps Institute (DOI) SRE Foundation certification.
The course includes the official SRE Foundation certification exam, which is bundled with the course fee. Participants will explore real-world case studies and engage with topics such as error budgets, service level objectives (SLOs), service level indicators (SLIs), monitoring strategies, toil reduction, and observability — all aligned with the SRE Foundation exam content.
The course has been developed by referencing key SRE sources and contributions from industry thought leaders and organisations actively adopting SRE practices.
To maximise success, participants are strongly encouraged to complement the course with additional self-study, revision of course materials, and dedicated practice before attempting the exam.
Module 1: SRE Principles and Practices
- What is Site Reliability Engineering?
- SRE and DevOps: What is the Difference?
- SRE Principles and Practices
Module 2: Service Level Objectives and Error Budgets
- Service Level Objectives
- Error Budgets
- Error Budget Policies
Module 3: Reducing Toil
- What is Toil
- Why Toil is bad
- Doing something about Toil
Module 4: Monitoring and Service Level Indicators
- SLI's - Service Level Indicators
- Monitoring
- Observability
Module 5: SRE Tools and Automation
- Automation Defined
- Automation Focus
- Hierarchy of Automation Types
- Secure Automation
- Automation Tools
Module 6: Antifragility and Learning from Failure
- Why learn from Failure
- Benefits of Anti-fragility
- Shifting the Organisational Balance
Module 7: Organisational Impact of SRE
- Why Organisations embrace SRE
- Patterns for SRE adoption
- SRE Job Description
- Sustainable Incident Response
- Blameless Postmortems
- SRE and Scale
Module 8: SRE, Other Frameworks, Trends
- SRE and Other Frameworks
- SRE Evolution