Netflix Upgrades its Powerful "Chaos Monkey" Open Cloud Utility
Few organizations have the cloud expertise that Netflix has, and it may come as a surprise to some people to learn that Netflix regularly open sources key, tested and hardened cloud tools that it has used for years. We've reported on Netflix open sourcing a series of interesting "Monkey" cloud tools as part of its "simian army," which it has deployed as a series satellite utilities orbiting its central cloud platform.
Netflix previously released Chaos Monkey, a utility that improves the resiliency of Software as a Service by randomly choosing to turn off servers and containers at optimized tims. Now, Netflix has announced the upgrade of Chaos Monkey, and it's worth checking in on this tool.
Years ago, we decided to improve the resiliency of our microservice architecture. At our scale it is guaranteed that servers on our cloud platform will sometimes suddenly fail or disappear without warning. If we don’t have proper redundancy and automation, these disappearing servers could cause service problems.
The Freedom and Responsibility culture at Netflix doesn’t have a mechanism to force engineers to architect their code in any specific way. Instead, we found that we could build strong alignment around resiliency by taking the pain of disappearing servers and bringing that pain forward. We created Chaos Monkey to randomly choose servers in our production environment and turn them off during business hours. Some people thought this was crazy, but we couldn’t depend on the infrequent occurrence to impact behavior. Knowing that this would happen on a frequent basis created strong alignment among our engineers to build in the redundancy and automation to survive this type of incident without any impact to the millions of Netflix members around the world.
We value Chaos Monkey as a highly effective tool for improving the quality of our service. Now Chaos Monkey has evolved. We rewrote the service for improved maintainability and added some great new features. The evolution of Chaos Monkey is part of our commitment to keep our open source software up to date with our current environment and needs.
Note that Chaos Monkey 2.0 is fully integrated with Spinnaker, Netflix's continuous delivery platform. Service owners set their Chaos Monkey configs through the Spinnaker apps, Chaos Monkey gets information about how services are deployed from Spinnaker, and Chaos Monkey terminates instances through Spinnaker.
Since Spinnaker works with multiple cloud backends, Chaos Monkey does as well. In the Netflix environment, Chaos Monkey terminates virtual machine instances running on AWS and Docker containers running on Titus, the company's container cloud.
You can peruse Netflix's overall open source software resource center on GitHub. The company is steadily releasing proven tools that can be quite useful for administrators. Netflix has also said that it has more tools to be open sourced soon.