Glossary

SRE

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering into the IT operations domain to create scalable and reliable software systems.

Site Reliability Engineering, often abbreviated as SRE, is a set of principles and practices that aim to bridge the gap between development and operations within a software engineering context. Originating at Google, SRE focuses on creating automated solutions for operational aspects such as deployment, scaling, and reliability, while maintaining a close alignment with business objectives.

SRE teams use code to manage and automate infrastructure and operations tasks. This includes writing software for service monitoring, performance tuning, incident response, and capacity planning. The goal is to improve system reliability and efficiency, while also enabling rapid development and release of new features.

One key principle of SRE is the concept of error budgets, which are defined as a quantitative measure of the acceptable level of risk of downtime. This concept helps balance the need for system reliability with the pace of innovation.

Another core practice is the use of Service Level Objectives (SLOs) and Service Level Indicators (SLIs). SLOs are explicit goals for system reliability, while SLIs are metrics that indicate the current level of service. Monitoring these helps SREs make data-driven decisions for system improvements.

By integrating development and operational skills, SRE promotes a collaborative culture where engineers share responsibilities and focus on automation to solve problems at scale, leading to more resilient systems and better user experiences.