One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
On the face of it, it seems insane. Why would you intentionally kill parts of your site? Yet in practice, being able to handle failure consistently means you are never ever surprised.
Edit: My friend Vinny Magno writes:
Automobile assembly lines started doing the same thing (I think Ford was the first, though I might be wrong).A single line can produce multiple models back to back. So for example, a Focus may be followed by an Explorer and then an Escape. At one point, the chassis and body are aligned and welded together. Obviously welding a Focus body to an Explorer chassis would be a problem, but the automation is so good that the defect rate was incredibly low. Since such a defect happened so infrequently, operators became complacent and failed to catch it the times that it did occur. To solve the problem, Ford purposefully upped the error rate to keep the operators sharp.
Turns out there's precedence!