Security through chaos engineering: No more 'set it and forget it'
Connecting state and local government leaders
Organizations that don’t stress test the security of their infrastructure, "are slowly drifting into the unknown," one security architect says.
Some network administrators no doubt yearn for the good old days when network security meant simply putting in a solid firewall and setting up access controls. Thanks to the cloud and the proliferation of mobile devices that take users beyond an agency’s network perimeter, however, those days are gone.
Increasingly, agencies are using endpoint detection and response tools to constantly monitor network activity. These EDR tools presume there are already intruders in the network; their goal is to hunt them down.
“A thousand events may happen every single minute," said Brian Hussey, vice president for cyber threat protection and response at Trustwave, a cybersecurity company that offers EDR services. "You're going to be able to monitor and capture every single one of those, bring them into our data center, and correlate them -- not across just one computer but across entire networks.”
However, some cybersecurity experts argue that identifying and remediating vulnerabilities isn’t enough -- solutions must be baked in and vigorously tested.
Aaron Rinehart, chief enterprise security architect for United Health Group, argues in a January OpenSource.com article that what is needed is “security chaos engineering” -- building security instrumentation into network infrastructure during its design and then testing that infrastructure through novel experiments that uncover vulnerabilities before damage is done. Chaos engineering helps IT managers build confidence in a distributed system's ability to withstand extreme and unexpected conditions.
“How often do we proactively instrument what we designed, built, and are operationally managing to determine if the controls are failing?” Rinehart wrote. “Most organizations do not discover that their security controls were failing until a security incident results from that failure. The worst time to find out your security investment failed is during a security incident at 3 a.m.”
In addition to building security instrumentation into hardware, Rinehart told GCN organizations should conduct exercises to stress-test systems.
“Humans make mistakes,” he said. “Let’s not just assume that we are all geniuses and we build things perfectly and we don't ever make mistakes.”
An example of a chaos engineering exercise, Rinehart suggested overloading a network’s load balancer to determine what happens when it fails. An actual operational exercise, he said, generally last about three to four hours, with two hours for conducting the exercise and the rest of the time reserved for a post-mortem to analyze the results.
Rinehart recommended first running the exercise in a test environment rather than on an organization’s production environment. “Then you slowly increase the scope of the exercise from a test environment to a production environment as you get more mature,” he said.
Rinehart acknowledged there is resistance in some organizations to adopting security chaos engineering because of the time and costs of conducting exercises. “It’s going to be a while before it becomes mainstream,” he said. Ultimately, though, Rinehart believes the costs of not securing organizations’ computer infrastructure will push its adoption. So far, he said, the practice hasn’t been implemented widely beyond cutting-edge Silicon Valley technology companies such as Google.
Organizations that don’t test the security of their infrastructure, he said, “are slowly drifting into the unknown. That's what chaos engineering is attacking.”
Rinehart also said the name itself may be getting in the way of its adoption. “The name ‘chaos engineering’ is too provocative,” he said. “People get the wrong idea. It’s not about chaos. It's about not making assumptions that what you built is always operating you thought it was.”
NEXT STORY: Moving IT security from human to machine speed