Testing in Production

Robert Ruzitschka
3 min readJun 8, 2022

--

Testing in Production is an interesting topic, I have to say. Understanding what it means is highly valuable. Let’s dig in!

We all are more or less familiar with the concepts of test automation. This is also what we here at Agile Engineering Support have been pushing a lot. Having good automated test coverage of your system is the precondition for Continuous Delivery. It enables us to deliver frequently to production while keeping our system’s quality high.

There is no doubt about that and nothing has changed here — a high degree of test automation is absolutely necessary. But we as engineers know that our work is all about tradeoffs and this is also true for automated testing. There is a cost associated with very high levels of test automation — we need to spend implementation but also maintenance effort and as always, we are bound by the law of diminishing returns: If we already have achieved a specific level of automation, efforts to improve will grow disproportionally.

But there is another aspect that we need to consider: Our test system won’t be able to replicate our production environments completely. Even if we keep the software and infrastructure stacks of our test and production systems in sync (and this is an absolute necessity and another of the core pillars of Continuous Delivery), production will always be different:

  • different system load
  • different user data
  • user behavior patterns
  • timing issues (latency, response times)
  • and many more

Sure, we can try to model all of these aspects but our models always will be incomplete. That is in the nature of things. If we consider distributed systems based on a microservice architecture or other complex architectures with a lot of asynchronous communication the problem gets even more obvious.

Now what can we do with this piece of insight?

Should we spend even more effort in measuring our production systems behavior and then include results in our test automation suites?

That certainly is an option and we can do this but again, we need to accept that we will *never* be able to cover all potential cases.

The other option is to accept reality and think about an approach how we can use our productive systems to understand the impact of our changes. Now we enter the realm of *Testing in Production*!

I really want to make it clear that this does not mean that we confront our customers with bad quality software because we don’t want to spend the effort for proper testing. Not at all!

On the contrary, it is about accepting that our tests are always just reflecting a limited view on reality and thinking about a strategy how we can assure *optimum* quality.

There are several patterns that we can apply if we want to test new changes in production while keeping (the majority) our customer unaffected. I’ll note just a few here:

  • Blue/Green Deployment: We expose only a small subset of our customer to the changes. If we have sufficient confidence that things work as intended, we make the change available to all customers.
  • Feature Flagging: We can toggle between the changed code and the old code. If things don’t work out as intended, we can easily fall back to the old implementation
  • Parallel implementation: We don’t expose new functionality to the customers but have it in parallel to the old implementation and feed it with production data so we can observe the impact and behavior of our new implementation.

All of this allows us to observe the behavior of our changes in *production* while making sure that customers are not significantly impacted.

A precondition to all of this is that we have the observability tools in place that allow us to have good insights into the behavior of our production system. This is key!

So let me summarize: “Testing in Production” means that we accept that our test systems can’t model production behavior completely. Based on a solid foundation of automated testing coverage, effort is spent on instrumenting our production systems so we can verify “real life” behavior of our changes with only minimal impact on the majority of our customers.

--

--

Robert Ruzitschka

Physicist working in Software Engineering for many years. DevOps Community Lead/Engineering Coach. Austrian based in Vienna.