Are we DevOps, or SRE?

SRE Vs. DevOps has been a ranging battle for quite sometime now. Ask any operations or infra team in today's bubbling tech shops and, they will tell you, without a flutter, that they are SRE. Scratch them a bit more and, they will tell you how they follow the SRE principles with stories from their daily grind and link them to Google's SRE handbooks. Then drop the big question, "But isn't the same thing DevOps too?" and see them getting confused and incoherent a bit. Now, if you ask, "Or maybe, yours is more of a hybrid model than pure DevOps/SRE?". Now, you might have turned a few heads and even made some ponder further away.

Managing "Operations" as a disciplined practice has always been an arduous task. Many companies today have dedicated operations departments engaged in planning and executions, but more often than not, they fail badly. The tech operations landscape is no different. There are always, generally unsolved questions about how to smoothly run a multitude of services and their conglomerates of business systems. The best practices that seem to work on paper, seldom operate well on the floor. Furthermore, copying established practices from other tech shops doesn't help either, once you realize the solutions are highly "context and culture" dependent. But in reality, all these problems are nothing new and we have been thinking about how to better run our tangible systems and processes for millennia.

The technology shops today, often manage operations as an evolved "cost center" business, with a direct focus, amongst others, on three distinct areas:

Executable infrastructure management.
Continuous CI/CD.
Production Business System Support.

The tremendous short-sightedness of this process is still not widely understood. So every time the teams have a production system meltdown that results in a support breakdown and then to long calls and related RCA and post mortem cycles, they resort to experimentation and agree to introduce new tools and processes to ease things out. While the proposals appear to work in the hour or two long meetings, on the whiteboard, they fail again the next time. The situation is even more intricate for companies that are big on scale and headcount. The bigger your house, the more virulently shall it shake in an earthquake.

With no reliable and orderly way to measure, control, and decipher the increasing complexities in production business systems, the technology operations landscape, and its practitioners have evolved towards two newest solutions – DevOps and Site Reliability Engineering (SRE). While we talk about these individually as if they are two separate reactions to the same problems, this article hopes to explain that in fact, they maybe more alike, complement each other, and can operate at different levels and in different teams within the same technology organization.

So let's check out what these are and how they help make things better.

So, What is DevOps?

DevOps is an ITSM (IT Service Management) framework that defines the mindset, culture, and ideology of working on IT projects as a collaboration among developers, operations, and QA teams.

At its core, DevOps is a methodology, a set of guiding principles.

There are 7 Primary guiding ideologies of DevOps:

Adopt agile methodologies
Foster a culture of collaborative learning
Automation - far and wide
Accept End-to-End responsibility
Shorten Feedback Loops - faster actions & reactions
Establish concrete monitoring & alerting insights
Establish continuous improvement process

What does it entail?

DevOps is not just Continuous Integration, Deployment, Testing, Improvement etc.

They are high level process guidelines from one of its guiding principles.

DevOps is not a bunch of tools.

They help you implement your organization / team DevOps ideology.

DevOps != Rigorous Automation.

Automation helps complete DevOps along with shorter Feedback loops.

DevOps is not a new team / role.

It’s part of your existing work / team responsibilities.

DevOps is not an Agile / SCRUM replacement.

It leverages Agile to improve PMO capabilities over lean teams.

DevOps is not a one-size-fits-all strategy for every team / organization.

Different business / technology teams within same organization can have different DevOps models.

DevOps is not “100% defect free” delivery.

It focuses on velocity of feedback loops and encourages evolutionary innovation.

What does modern DevOps look like?

The following image is a common practical realization of DevOps methodology that is most commonly adopted in technology companies.

Experienced technology leaders would quickly relate to this and also see the major shortcomings of this prevalent practice. Some of them that you might have faced yourself too:

It has a lot of "continuous" things.

But it doesn't explain when and by how much?

"Agile" is one big quadrant of it, and needs to be taken seriously. This also means you have one or more project managers onboard to streamline it.
- But it doesn't explain how to do it effectively and how to evolve the agile process with learnings.
CI is incomplete without testing.
- But it doesn't say when, where and what kind of testing to do?
CD is continuous as well.
- But what should our release strategy and release management look like?
- Rollbacks strategy? Blue-Green Releases? Canary? Traffic Mirroring?
And what about our executable infrastructure and automations?
Some practical guidelines to move from "fighting production fires" to preventing them.

So while your proud and well skilled polyglot dev teams:

Keep hammering away on their software features (small and big) AND
Have adopted processes and parlance (of their own) AND
Are releasing smaller changes frequently OR big bang changes occasionally AND
Are experimenting with new technology mixes (like eventual consistency, multi region service discovery, CQRS, highly elastic IMDBs etc.)

......they don't really have time to (and in many companies they don't even have skills or believe its their job role / responsibility) explain a lot of practical use cases that technology teams face everyday on the floor e.g.

How to do effective monitoring / Observability?
- Reactive or Proactive?
- Add discovery intelligence for forewarning.
- Site alerts - static, dynamic, relationships between cascading failures?
How to establish a process for supporting their individual services and then support the culmination of the business system consuming these services?
- Manage dependencies between services/apps?
- Their collective release strategy, versioning, rollbacks etc.?
- What about velocity of release cycles?
Care about things like:
- Business system reliability, and not just their service reliability?
- User experience indicators?
- MTTR, MTTF?
- SOPs (Standard Operating Procedures) for common issues faced in production and their resolutions?
- Investigations and Post Mortems (infrastructure, individual service or for an ecosystem of them)?
- RCA process and learning evolution?

Alright...What about SRE then?

SRE was born out of similar frustrations of balancing the operations side for fast paced and polyglot systems/services team at Google. While over time, many processes and ideologies were experimented with at Google, before they settled on what worked for their business scale, it was Ben Treynor Sloss, a VP of engineering at Google, who coined the term and the approach. So while SRE, in its present form, was born from a small team that was more operationally focused, Ben provided the needed energy for treating the use cases from an engineering viewpoint.

In Google’s terms....

SRE applies Software Engineering principles to Operations (site & infra management, support, observability, MI & postmortem handling etc). SRE intends to bridge dev (functional & non functional apps like internal SRE tooling and automations) and ops. Percentage of time spent on Ops and Dev varies and depends on needs of the organization/team.

At its core, SRE is Google’s strategic extension of generic DevOps principles. In a way, SRE implements DevOps for Operations.

So what does it entail?

Theoretical ideology of SRE is based on the following 6 major principles. They provide the driving direction but leave it to the technology shops to implement it for their scale and setups.

Reduce Silo Barriers - Communicate More:
Bigger companies always have a complex organizational structure, with a lot of teams working in brittle vertical silos. They may either adopt a polyglot microservices ideology, with each team building services contributing to the same business platform/system OR the same things running for multiple business systems in parallel.

With the freedom that each technology team gets, coupled with the innate human desire to try new things and propel the project in the "direction they think is right", the technology teams are always pulling in different directions, not communicating with the rest of the company, unnecessarily complicating things in the name of engineering and as a result, failing to see the big picture. This can lead to frustration within the teams, setbacks in deployments, and most importantly high-cost overruns not just in dollar value but also in time erosion.

SRE doesn’t argue about silos and their construction in the company but instead focuses on how to get everyone to communicate freely in the same language. This can be done in multiple ways e.g. by using the same tools, techniques, design expectations, support structure, issue handling process, and frameworks across the company, which in return helps establish a uniform culture across teams with predefined expectations.

Accept failure as a norm - Embrace it:
Though DevOps attempts to intellectualize the control and handling of issues, sooner or later you, as a technology leader, will realize that failure cannot be avoided and is inevitable, especially in the cloud-native architecture landscape. Major DevOps processes in practice today, attempt to reduce this to a zero-sum game and chase those elusive uptime targets. Bigger the scale, more complex the business systems and higher is the probability of failure.

SRE attempts to handle it differently. The objective here is not to chase perfection but accept imperfection and evolve with it. The aim is to arrive at a set of quantitative specifications that act as a prescription for balancing accidents and failures while maintaining the velocity of change. In more tangible terms, SRE engineering strives to overlay budgets on everything we do, to realize a business system release. Discussion of budgets is a more involved matter and a suitable topic for another post though.

Control Change - The Pace and The Size:
Everyone wants to move fast to production. We all want frequent releases (across the polyglot ecosystem), continually modernizing the business system and its product services, and keeping the development team members alive and simmering with the new and relevant technology. But do you want to:
"Move fast, break things and then learn from the experience?" OR
"Move fast enough to maintain predictable releases sans surprises and learn from the collective experience of your team's and your's?"

Both DevOps and SRE are all for this change but only SRE explains how it should be like. SRE lays more weight on reducing the cost of failure, business system unavailability and promotes faster recovery in case of unintended major issues. It lays emphasis on small predictable releases that can be rolled back idempotently. DevOps does that too, but for SRE the intrinsics of the release management are not important, only the release strategy is. As long as the release strategy works within the bounds of the prescribed and agreed upon budgets, SRE is happy, whether the build moves forward or is rolled back.

Embrace Automation, Standardize Tooling:
Both DevOps and SRE encourage automation and fluid technology adoption. DevOps authorities stress inducting IAC ideologies in the mainstream. Treat your infrastructure, CI/CD, monitoring, and in general, everything as code.

SRE on the other hand, lays extra focus on embracing consistency in technology, automation, tooling, and process adoption. The rationale is that this makes it easier to manage operations, visibility, auditing, reconciliation and reduces toil and incompatibilities. This standardization further helps ensure that teams collaborate better in a consistent language of tools, techniques, design practices, and processes expectations.

Measure Everything and Anything - Quantify it:
There is a golden saying in business management circles: "If you can’t measure it, you can’t improve it". When you think about this quote, it should immediately become apparent how true it is. Always talk in numbers, absolutes or relatives.

As the scale of the business systems and related automation grow exponentially, the reality of having constant feedback quickly sinks in. Persistent monitoring and passive observability become necessary to ensure that our tech shops are steadily moving in the right direction without swaying their focus on the other four principles listed above.
SRE ideologies introduce a standard set of metrics, collectively called Service Level Indicators (SLIs), to measure how good/bad things are doing. Standardizing the SLIs allows us to quantitatively establish what our Service Level Objects (SLOs) should be. Based on that, we as a technology organization can agree on a set of Service Level Agreements (SLAs) contracts that act as our torchbearers in times of chaos and MIs. Without effective metrics at levels and the means to relate and decipher them, you are practically running blind on a mountain trail.
Drive Engineering - Get "Hands On":
SRE ideologies strive to attain and then maintain the production business system reliability but once this is done, SRE advises us to plan more than 50% of the team's effort on development work. This is a natural extension of prioritizing goals over a longer tenure. Systems reliability, once achieved effectively, turns into a passive observability exercise and the same operations team can leverage their engineering skills to contribute back to the core business advancements.

This means that SRE team members need to be well versed in architectural principles, cloud-native designs, full-stack & micro development practices and also have good knowledge of cloud providers and IAC practices. It is not necessary for SRE teams to contribute only to functional business services/applications. They can choose to support your internal tooling (e.g. your code repository providers, central issue & incident management platforms, your production clusters & service meshes, self-hosted data & integration services, monitoring & alerting platforms, etc.) or build new applications & automation to narrow down the gap towards your SRE maturity.

Alright that's a mouthful....But what does practical SRE look like?

This picture sums up the practical ideologies of an SRE practise beautifully.

The image outlines the major SRE ideologies and categorizes them conveniently in three action groups viz:

Responsibilities of SRE - lay down the core duties or focus areas of an SRE team.
The Core Tenets of an SRE team - lays down the major doctrines of an SRE team.
The Goals of an SRE team - are the holy-grail, the purpose of your SRE team.

An in depth discussion on the ideologies is beyond the scope of this post, but it does merit a dedicated series to touch base on practical aspects on each on of these.

Parting Thoughts....

SRE and DevOps may appear close on surface, in both ideology and practice, in the overall landscape of IT operations, but dig deeper, and you can see the differences. SRE takes a very pragmatic posture towards things with clearly laid out practice guidelines, while DevOps is more of a free-spirited stallion. As shown, SRE provides tangible approaches on how to establish your practices earlier on in the journey. Concrete realizations of these guidelines are left to you, as a leader, to tailor to your specific technology shop.

SRE provides a quantitative approach to operations, backed by engineering merit. We observe everything & anything and automate to reduce toil and promote engineering excellence to solve our business problems. Without quantiles, SRE is just another infrastructure and support faction of DevOps. On the work floor, both use several of the same toolsets, similar approaches to code, change, incident and issue management, the same data-based decision-making mindset. But it is the importance of communication-driven decision-making, not just between people but also between humans and systems, that makes SRE remarkably effective.

So, what do you think now? Is your technology company SRE enabled or following DevOps or maybe somewhere in between?

IDEAS Churn

Search This Blog