Site Reliability Engineering: How Google Runs Production Systems
- Length: 552 pages
- Edition: 1
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2016-04-16
- ISBN-10: 149192912X
- ISBN-13: 9781491929124
- Sales Rank: #23036 (See Top 100 Books)
Building and operating distributed systems is fundamental to large-scale production infrastructure, but doing so in a scalable, reliable, and efficient way requires a lot of good design, and trial and error. In this collection of essays and articles, key members of the Site Reliability Team at Google explain how the company has successfully navigated these deep waters over the past decade.
You’ll learn how Google continuously monitors and deploys some of the largest software systems in the world, how its Site Reliability Engineering team learns and improves after outages, and how they balance risk-taking vs reliability with error budgets.
Table of Contents
Part I. Introduction
Chapter 1. Introduction
Chapter 2. The Production Environment at Google, from the Viewpoint of an SRE
Part II. Principles
Chapter 3. Embracing Risk
Chapter 4. Service Level Objectives
Chapter 5. Eliminating Toil
Chapter 6. Monitoring Distributed Systems
Chapter 7. The Evolution of Automation at Google
Chapter 8. Release Engineering
Chapter 9. Simplicity
Part III. Practices
Chapter 10. Practical Alerting from Time-Series Data
Chapter 11. Being On-Call
Chapter 12. Effective Troubleshooting
Chapter 13. Emergency Response
Chapter 14. Managing Incidents
Chapter 15. Postmortem Culture: Learning from Failure
Chapter 16. Tracking Outages
Chapter 17. Testing for Reliability
Chapter 18. Software Engineering in SRE
Chapter 19. Load Balancing at the Frontend
Chapter 20. Load Balancing in the Datacenter
Chapter 21. Handling Overload
Chapter 22. Addressing Cascading Failures
Chapter 23. Managing Critical State: Distributed Consensus for Reliability
Chapter 24. Distributed Periodic Scheduling with Cron
Chapter 25. Data Processing Pipelines
Chapter 26. Data Integrity: What You Read Is What You Wrote
Chapter 27. Reliable Product Launches at Scale
Part IV. Management
Chapter 28. Accelerating SREs to On-Call and Beyond
Chapter 29. Dealing with Interrupts
Chapter 30. Embedding an SRE to Recover from Operational Overload
Chapter 31. Communication and Collaboration in SRE
Chapter 32. The Evolving SRE Engagement Model
Part V. Conclusions
Chapter 33. Lessons Learned from Other Industries
Chapter 34. Conclusion
Appendix A. Availability Table
Appendix B. A Collection of Best Practices for Production Services
Appendix C. Example Incident State Document
Appendix D. Example Postmortem
Appendix E. Launch Coordination Checklist
Appendix F. Example Production Meeting Minutes