The overwhelming majority of a software system's lifespan is spent in use not in design or
implementation. So why does conventional wisdom insist that software engineers focus primarily
on the design and development of large-scale computing systems? In this collection of essays
and articles key members of Google's Site Reliability Team explain how and why their
commitment to the entire lifecycle has enabled the company to successfully build deploy
monitor and maintain some of the largest software systems in the world. You'll learn the
principles and practices that enable Google engineers to make systems more scalable reliable
and efficient—lessons directly applicable to your organization. This book is divided into four
sections: Introduction—Learn what site reliability engineering is and why it differs from
conventional IT industry practices Principles—Examine the patterns behaviors and areas of
concern that influence the work of a site reliability engineer (SRE) Practices—Understand the
theory and practice of an SRE's day-to-day work: building and operating large distributed
computing systems Management—Explore Google's best practices for training communication and
meetings that your organization can use