Accepted metrics for measuring the severity of security incidents, like mean time to repair (MTTR), may not be as reliable as previously thought and are not providing IT security teams with the correct information, according to Verica’s latest “Open Incident Database Report” (VOID).
The report is based on 10,000 incidents from nearly 600 companies ranging from Fortune 100s to startups. The amount of data gathered enables a deeper level of statistical analysis to determine patterns and debunk previous industry assumptions that lacked statistical evidence, Verica said.
“Enterprises are running some of the most sophisticated infrastructure in the world, supporting many parts of our daily lives, without most of us even thinking about it — until something isn’t working,” says Nora Jones, CEO and co-founder of Jeli, one of the report’s sponsors. “Their businesses heavily rely on site reliability, and yet incidents are not going away as technology gets more and more complex.”
Enterprises need to be making data-driven decisions on how they approach organizational resilience, she adds. “Most organizations are running incident management decisions based on long-standing assumptions,” Jones points out.
Share Information to Understand Incidents
Courtney Nash, lead research analyst at Verica and creator of the VOID report, explains that, in much the same way airline companies set aside competitive concerns in the late 1990s and beyond in order to share information, enterprises have an immense body of commoditized knowledge they could use to learn from each other and push the industry forward, while making what gets built safer for everyone.
“Collecting these reports matters because software has long moved on from hosting pictures of cats online to running transportation, infrastructure, power grids, healthcare software and devices, voting systems, autonomous vehicles, and many critical [often safety-critical] societal functions,” Nash says.
David Severski, senior security data scientist at the Cyentia Institute, points out that enterprises can only see their own incidents, which limits the ability to see and avoid broader trends affecting other organizations.
“Incident databases and reports like [VOID] help them escape tunnel vision and hopefully act before they experience problems themselves,” he says.
Duration and Severity Are ‘Shallow’ Data
How organizations experience incidents vary, as does how long it takes to resolve those incidents, regardless of severity. Which scenarios even get recognized as an “incident” and at what level varies among colleagues within an organization and is not consistent across organizations, the report cautioned.
Nash explains duration and severity are “shallow” data — they are appealing because they appear to make clear, concrete sense of what are messy, surprising situations that don’t lend themselves to simple summaries. However, measuring the duration isn’t really useful.
“The duration of an incident yields little internally actionable information about the incident, and severity is often negotiated in different ways, even on the same team,” Nash says.
Severity may be used as a proxy for customer impact or, in other cases, engineering effort required to fix or urgency. “It is subjectively assigned, for varying reasons, including to draw attention to or get assistance for an incident, to trigger — or avoid triggering — a post-incident review, or to garner management approval for desired funding, headcount, and so on,” Nash says.
There’s no correlation between the duration and severity of incidents, according to the report. Companies can have long or short incidents that are very minor, existentially critical, and nearly every combination in between.
“Not only can duration or severity not tell a team how reliable or effective they are, but they also don’t convey anything useful about the event’s impact or the effort required to deal with the incident,” Nash says.
Analyze Past Incidents
“While MTTR isn’t useful as a metric, no one wants their incidents to go on any longer than they must,” she says. “To respond better, companies must first study how they’ve responded in the past with more in-depth analysis, which will teach them about a host of previously unforeseen factors, both technical and organizational.”
Jones adds that the culture of an organization will also play a role in how teams tag incidents and to what degree.
“This all goes back to the people of an organization — the people building the infrastructure, maintaining the infrastructure, resolving incidents, and then reviewing them,” she says. “This is all done by people.”
From her perspective, no matter how automated our technology gets, people are still the most adaptable part of the system and the reason for continued success.
“This is why you must recognize these socio-technical systems as just that, and then approach your incident analysis with the same understanding,” Jones says.
Severski says the security industry is full of opinions on what should be done to improve things, noting Cyentia continues to analyze large datasets in its Information Risk Insights Study (IRIS) research.
“Basing our recommendations on actual failures and lessons learned from this is a far more effective approach,” he says. “We place a high value on studying real-world incidents.”