I’m frustrated with the DevOps vs. NoOps debate. I don’t feel like it’s helping us define where we want to go. Instead I think we’re stuck arguing semantics. On the one hand, there seems to be a certain amount of violent agreement. On the other hand, we haven’t actually defined our terms. What does “ops” mean to each of the debaters? Does NoOps mean “no operations activities”, or “no operations organization”, or “no operations budget”? Furthermore, I think the discussion is happening at the wrong level.
Various Traditional Operations Activities
I do think there are some things about which we can all agree. First, various traditional operations activities are going away:
- IaaS means most of us will no longer rack servers, swap hard drives, run CAT-5 cable, etc.
- PaaS means most of us will no longer configure firewalls or load balancers, install database or web server software, etc.
- Configuration automation means most of us will no longer manually install applications, patches, public ssh keys, etc.
Silo’d Development and Ops Cultures and Behaviors
Second, silo’d development and ops cultures and behaviors are problematic. Development trying to maximize change, while ops tries to minimize it, reduces efficiency, responsiveness, quality, and mutual respect, all at the same time. What’s needed is a unified focus on simultaneously maximizing agility and reliability.
What isn’t changing or going away is the need for operational excellence. Neither is accountability for operational excellence going away. If I run my SaaS application on Heroku, and Heroku goes down, does my CEO choke Heroku’s throat, or Amazon’s, or mine? Even if I’ve outsourced all the traditional ops activities to the cloud, I still have to ensure my application’s performance, scalability, and resiliency. I still have to do things such as:
- Evaluate vendors to determine who can best meet my business’ operational needs
- Design, test, and implement tools and procedures for dealing with vendor outages and performance/scalability bottlenecks
- Figure out what to do if one of my vendor choices turned out to be a bad one
- Monitor everything from application availability to dynamic usage costs and make intelligent decisions based on the metrics I gather
- Understand and manage the relationships between the outsourced pieces I’ve integrated into my application
- Plan for the future of my business, my application, and the underlying IT landscape
From an ITSM perspective, almost none of the core ITIL activities completely disappear. Do these activities count as “ops”? If you use my definition, which is “accountability for operational excellence”, the answer is yes. According to someone else’s definition, the answer might be no. To be honest, to channel Brian Katz, I’m not sure I care. However we label them, these activities need to happen, someone needs to be accountable for them, and they need to be integrated into an overall, coherent set of activities focused on delivering value to customers. As I’ve stated elsewhere, in the era of IT-as-a-Service, customer value includes operational excellence.
There could still be argument about who carries out these activities. Again, I’m not sure it matters. To the statement “the developers carry the pagers”, for example, one might respond that “carrying the pager” is the very definition of ops! In any case, carrying the pager means there’s an expectation that something may go wrong, and when it does that someone needs to respond to it. Differentiating whether the outage is caused by a platform problem or a code problem is just more siloing. The key fact is that there’s an outage, and it needs to be rectified. If developers are doing system architecture, and carrying pagers, then maybe “developer” is no longer an accurate job title.
NoOps often gets linked with PaaS
I’m a big fan of PaaS, at least in principle. I do, however, see a certain amount of talk that seems to imply that PaaS “solves all problems” and “enables blissful ignorance”. As I mentioned above, and as I have personally experienced, using a hosted PaaS solution does not migrate operational accountability away from oneself. As others such as James Urquhart and John Allspaw have expressed much more eloquently than I could, cloud hides certain kinds of problems, only to replace them with new and more interesting ones. AWS will have outages. Heroku and Azure will have outages. PaaS platforms will be unable to auto-magically scale up applications due to underlying IaaS congestion. If I spread my application functionality across Heroku and AppFog, and they both run on top of AWS, then I still have a common point of failure/degradation in my architecture. If I leverage a SaaS service as part of my app, I may not even know where it runs or what its OpExc (operational excellence, not operating expense 🙂 strategy is. The operational activities that the cloud doesn’t render obsolete, it makes more complex than ever.
It may very well be the case that traditional ops budgets and staff sizes will shrink, and that more and more traditional ops activities will happe