LitmusChaos case study

5 min read

Cover Image for LitmusChaos case study

Imagine this, you built an e-commerce website that has become very popular. Now it's time for a huge sale. Your website gets a lot more traffic than usual which means the resource usage is also high. You have auto scalers in place, but resources are being used faster than they can be provisioned which finally causes your entire system to crash.

System resilience

The above scenario is what we would call a non-resilient system. This means that when it is under heavy load, something is going to fail. This wouldn't happen under normal circumstances, but when your infrastructure resources are being utilized at a greater than normal rate, you can have such a system failure.

Now you might think, "This is an easy fix, I can just scale up my cluster, and request more resources. Then there isn't any reason that my cluster will run out of resources." And yes, you are correct. But that is not a resilient system. And another thing is, requesting more than needed resources would lead to unnecessary cloud costs.

So what exactly is a resilient system? Let's understand it with an example. You have a Kubernetes cluster running with some application in it. You are using a LoadBalancer service to allow your users to access your dashboard.

Now let's say some unexpected event happens, and your LoadBalancer service gets deleted for whatever reason. Now how will people access your dashboard? Sure, Kubernetes has an auto-healing capability, but that can take a little bit of time to kick in. Now imagine instead of a service, a replicaset was deleted.

This is going to take a while to recreate, and until it recreates, your users cannot access your application dashboard.

TLDR; Resilient systems are defined by how well they can function and recover when unexpected behavior occurs.

What is Chaos Engineering?

So we talked about what are resilient systems. Now, why does that matter? This case study focused on Chaos Engineering. Chaos Engineering is essentially just what we discussed above. The only difference is we are causing that chaos in a controlled manner.

Instead of some unexpected event deleting my Load Balancer, I will manually delete it, and see how my system reacts to this element of "Chaos". Why do I want to do this?

Once I introduce a little bit of chaos in my systems, I can see exactly how the system will react. This will help me to improve it, and in turn, make it resilient and robust. So even if an unexpected event occurs, and one of my resources gets deleted, the users should still be able to use the application without facing any kind of disruptions.

Challenges

We talked about Chaos Engineering which is a relatively new concept that aims to solve the problem of creating resilient systems. As you might already know when you introduce a method to solve one set of challenges, another set of challenges arises. Chaos is no different.

In cloud-native systems, you have a single application, broken down into small chunks, which are then all connected using services or service meshes. Can you spot the problem here?

There are many many moving components here. Each one of them handles a critical part of your application. So if one of these components fails, your application is bound to face some downtime. Now when you change your code to add a new feature, enhancements, or whatever, there is a chance for some kind of outage even if all test cases have passed.

This outage in production can be devastating to the organization as a whole. And predicting such scenarios is not easy.

Another challenge with Chaos is that many commercial, as well as Open source chaos tools, lack community collaboration and achieving the real principles of cloud-native technology.

LitmusChaos

Now let's come to our tool, LitmusChaos. It is an open-source chaos engineering platform that enables teams to identify weaknesses and potential outages in infrastructures by inducing chaos experiments in a controlled manner.

LitmusChaos is driven by the principles of Cloud-Native innovation and gave rise to the principles of Cloud-Native Chaos Engineering. Chaos engineering verifies the resilience of business services and helps DevOps pipelines proactively build code that is more resilient against software and infrastructure faults.

The project was started in late 2017 to provide simple chaos jobs in Kubernetes and has recently become a CNCF incubating project.

The project is used in production by more than 30 organizations, including large end-users like Adidas, FIS, iFood, Cyren, Intuit, Lenskart, Orange, and more as well as technology organizations like Red Hat and VMware.

Litmus over other tools

Litmus is a robust and powerful chaos engineering tool with functionalities required by end-users today like multi-cloud support, GitOps with Chaos Engineering, creating more chaos experiments, building chaos engineering for non-Kubernetes scenarios, and much more.

It comes with a user-friendly UI and lets you create more use cases and test chaos with various sample applications out there, enhancing monitoring & observability, helping conduct more gamedays, scheduling chaos, helping teams run chaos experiments rather than individuals, enhancing security provisions and most importantly creating a regular release cadence to support the community constantly.

Litmus also had a repository called the ChaosHub which is an open-source marketplace hosting all the different chaos experiments offered by Litmus. The experiments are declarative and tunable as per your requirements. Use the hub interface to tune them, deploy them, and take that step towards resilience.

Litmus also allows you to include chaos tests directly into your GitOps pipelines and lets you automate the chaos experiments. And automation means that you have less work to do.

Success stories

Litmus Chaos has helped multiple large organizations in creating robust and resilient systems. You can read some of these stories below

Who should be using Litmus

While Litmus is a tool that is mainly used by DevOps, SRE, and QA engineers, recent industry trends suggest that all folks involved in the development process should use Litmus Chaos to create chaos experiments and test their applications before pushing them to production.

Get involved

You can choose from a list of sub-dependent repositories to contribute to. Below are a few of them: