Jeremy Edberg & Ariel Tseitlin from Netflix shared how they almost survived the recent Amazon Web Service outage unscathed, Post-mortem of October 22, 2012 AWS Degredation.
Everyone has the best intentions when building software. Good developers and architects think about error handling, corner cases, and building resilient systems. However, thinking about them isn’t enough. To ensure resiliency on an ongoing basis, you need to alway test your system’s capabilities and its ability to handle rare events. That’s why we built the Simian Army: Chaos Monkey to test resilience to instance failure, Latency Monkey to test resilience to network and service degradation, and Chaos Gorilla to test resilience to zone outage. A future improvement we want to make is expanding the Chaos Gorilla to make zone evacuation a one-click operation, making the decision even easier. Once we build up our muscles further, we want to introduce Chaos Kong to test resilience to a complete regional outage.
It's great to see Netflix sharing how they are building a high availability service on AWS.