For everything that Google does which I have strong opinions about, some of their SRE training and information is pure gold. I just finished reading the chapter on eliminating toil and it really hits home for me.
If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.
- Carla Geisser, Google SRE
So, the more time you spend keeping things running, the less time you have to make them better - for you and your end users.
I won’t copy the rest of the article out, I’d highly recommend to read it and rest of the SRE book. (I really need to take my own advice)