One of the main challenges QA, DevOps and SREs face is knowing when a release is good to go. More often than not, we release not because we think it’s ready, but because it’s time and we have to. We face a dual problem; one part is Coverage: our testing is only as good as our coverage, which is never 100%. In complex systems it’s very hard to foresee all the different code and infra combinations, especially with the release of a new feature.
The second aspect we face is that of Anomaly Detection. The noise to signal ratio in large scale system, even in test environments, is high. If something abnormal does happen, how do we even know about it, without users (internal or external) experiencing the anomalies themselves and telling us, or by horridly watching a decrease to a business metric?
— OverOps (@overopshq) January 24, 2019
It seems the industry needs to craft a more effective approach, an approach that lies in being able to converge Machine Learning and AI to automatically gate bad code from moving up the chain. This is another step towards software automation, where we should have the code score itself to tell us how well it’s doing, especially in comparison to how it was doing yesterday, or a previous release. That sounds pretty Sci-fi, so the question then becomes: can we break this down into something we can actually implement? And how do we even evaluate what even makes for a good release vs. a bad one?
To compare apples to apples, we need to have an objective measure to quantify the quality of a release, and set the gates a release passes before it moves from staging into production, or in the case of production – gets to remain there. These are the four “gates” we came up with, let us know what you think. We’re starting from the simple to the more complex.
Gate #1 – Error Volume
This gate is the most basic one – does this release increase the number of errors, compared to its predecessor? The challenge here is that we need to be able to capture this data correctly and incorporate both errors that are logged, as well as those that aren’t such as HTTP errors, swallowed and uncaught exceptions, as those might actually be more important than those in the logs.
The second stage is normalizing data. Volume only makes sense when compared with throughput. You’ll undoubtedly going to have higher volume when more throughput is pumped into the system, so normalizing the error volume into a percentage is critical. The next thing we need is the ability to duplicate the data, so we can easily see what’s makes the bulk of the volume, and whether or not that’s benign or severe.
Gate: The normalized error rate of an application should never increase between releases
Gate #2 – Unique Error Count
This leads us directly to the second gate which is the ability to classify which errors comprise the volume, and has that count gone up or down since our last release. At the core of this is the ability to transform an ocean of separate events into time series that we can chart and split into core components. Once we transform this mass of code and log events into a set of analytics, we can begin to see where are these errors coming from – which apps, containers, locations in the code, and under what conditions.
This gives us a picture of the quality of the code and also the performance cost associated with having those events into the code and emitting them, in addition to their impact on the reliability of the app. This becomes more important when looking at key applications or reusable components (i.e. tiers) in the code such as payment processing or DAL, when more errors can be a very negative indicator.
Gate: The number of unique error counts, especially in key applications or code tiers, should not increase between release.
Gate #3 – New Errors
Drawing directly from that, once we’ve broken down all the errors in our environments into components, we want to be able to quickly separate from what could be hundreds, or sometimes even thousands of locations in the code, the ones that are new and have just been introduced by this release.
Even if we are the caretakers of an enterprise application, one with an honored 15 year legacy, which brings with it a mass of existing errors, we shouldn’t be introducing new ones into the environment. Or at least not ones that we did not expect. When dealing with a massive release, or major infrastructure change, you might encounter a large number of these “micro fractures”, in the form of a dozen new errors – that’s just reality. So what do you do?
For this, it’s important to be able to prioritize events, to make sure when new errors are even introduced into the environment, they must never be severe ones. So what makes for a severe error you ask? There are probably two key attributes that make up for that.
The first is: Is the error of a critical type or code location? A socket exception is much more benign, and can be perfectly normal vs. a NullPointer, DivideByZero or IndexOutOfBounds, which are direct results of code and data not working well together. There is little excuse for these, and their impact on a customer will likely lead to unexpected results. We don’t want those.
The second key trait that severe new errors have is a high error rate. There is a difference between new errors that happen during a deployment for a short period of time, vs something that’s happened more than 500 times in more than 10% of the calls (again – everything must be normalized). We don’t want those as well.
Gate: New errors of a critical type or high rate error should block a release.
Gate #4 – Regressions & slowdowns
This brings us to the last, and probably the most sophisticated gate. Here we’re using the data obtained from the first two gates to look not at new errors, but at the behavior of existing errors within the system. In this case, we’re looking for regressions in the form errors that pre-existent but are happening at a higher rate, or slowdowns that deviate previous performance.
This is more complex, as in this gate, things become more relative. For something to be considered regressed or slow, it needs to be compared against itself. For this, a baseline must be established before ML can be applied. The second ingredient needed is tolerance – what is the level of regression or slowdown we are willing to tolerate?
It’s courageous to say – “nothing at all!”, but life is far from perfect and environments are noisy. In this case we want to be able to say, an increase of more than 50% is said to be a regression, and an increase of more than 100% is said to be a severe regression / slowdown, in which case a release should not be promoted – at least not without inspection of the anomaly. This is the most complex gate, but also the one that has the strongest predictive quality with respect to potential outages and severity 1 incidents.
Gate: severe regressions and slowdowns should block a release.
With these four gates, we wanted to define a benchmark that is powerful and broadly applicable to complex environments, but also isn’t complex to the point that it remains an academic exercise. What did you think about these gates, and which ones do you use in your environment? We’d love to hear your comments and thoughts in the comments section! 🙂