Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way. Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on. We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification. That practice and pace should be the norm instead of the exception.
This post is part of an SRE series grounded in the ideas inspired by the Google SRE book. Every Ops team I know is underwater and doesn’t have the time to catch their breath. Why does the load increase and leave Ops behind? It’s because IT is increasingly fragmented and