What is chaos engineering?
Let's define first what it is NOT.
It is NOT engineering chaos.
It does not refer to engineering chaos to be tested in your environments so that you can make measures to maintain your service availability. Doing so will lead to complacency. The problem with this approach is; you have predefined list of 'expected chaos' and 'expected measures' to fix the problem. Yet, real chaos sources in systems are something that is hard to pinpoint. You can't predict where and when chaos will happen. Having a list of recovery measures might help but not fully.
Production systems is never clean, serene, calm and smooth.
Just like nature, chaos originates in many forms. It could be your users, cloud providers, administrators (lol), developers, hacker, strategies, processes, applications, and even YOU. Also, as your application are being exposed to more of these sources, you are increasing your risk surface. The more dependencies and interconnected parts there are, the more risk can be introduced into the system, but, it is not always the case.
Making a system distributed requires breaking it down into microservices, yeah? as mentioned, increasing interconnectedness with other systems can lead to unforeseen issues. Yet, these tools can also help you maintain your system uptime and availability. Sounds like a paradox isn't? It is not.
The world is chaotic so as systems, the best practice for complex and chaotic systems is chaos engineering.
Now, let's define chaos engineering for what it IS.
It is engineering OUT of chaos. One way of engineering out of chaos is reducing the moving parts that can contribute to more entropy of a system.