Last time, we talked a bit about the "New School" mode of thinking and dug into some of the publicly available numbers on data breaches. I mentioned two sources of security data and in this next post, I'll dig into the second source: DLP Risk Assessment data.
First, an open admission that (for obvious reasons of confidentiality) we simply cannot publish this data in detail; however, a summary of what we see will clearly make my point. Having said that, let's dive into why we have a very different perspective on the natures and causes of data breach.
We've conducted several hundred live on site engagements using highly accurate content-aware DLP systems to identify actual events of data exposure. This body of data should be of interest to practitioners since we don't miss a lot of exposure events. Publicly reported breach stats appear to have a serious problem with false negatives, and looking more deeply at what really happens on the wire instead of what people report is very enlightening.
Our systems watch for data exposure events across a variety of threat vectors (data-in-motion, data-at-rest, and data on endpoint systems) Here's a basic summary of our findings:
* For data-in-motion exposure events we see - roughly - one in 400 messages sent to the outside world (email, webmails, FTPs, IMs, etc...) contain highly confidential data
* For data-at-rest exposure events, we see - approximately - 1 in 50 files contain highly confidential data stored out on public file shares, open wikis, sharepoint sites, databases etc… scattered across the enterprise LAN with essentially completely open access control (i.e. global read/write for EVERYONE.)
Granted, the DataLossDB and the Symantec DLP stats are not directly comparable data sets, but what our customers conclude from our data and their DLP risk assessments tells a very different story than the publicly reported breaches. The conclusions most customers draw from a risk assessment are:
* Data exposure events are significantly higher than expected
* Well-meaning insider risk is a top-rated risk requiring urgent attention
* Large amounts of malicious activity go undetected until DLP is deployed
* The spread of confidential data across unauthorized storage locations is huge
There's some pretty big differences in the conclusions customers draw from our data versus what the DataLossDB data indicate. No one will argue that stolen laptops aren't a problem, but these conclusions that customers draw don't really reinforce the implicit conclusions you'd draw from the DataLossDB.
What could possibly explain the difference?
Implicit Sample Bias
Publicly reported breaches, by their very nature, are subject to huge amounts of sample bias. This sample bias comes in several ways, but a key source of the bias is that DLP-style detection of data exposure is a relatively new phenomenon in the world. Highly accurate, scalable systems that can detect and block exposure were simply not a feasible reality until recent technical developments, many of them pioneered at Vontu (now part of Symantec). It's no wonder most practitioners (let alone most enterprises) have no idea about the magnitude or true nature of their data loss problem. The tools haven't been available until now.
A separate form of sample bias surrounds laptop theft breach events. Unlike data breach events via email or data-at-rest exposure events behind the firewall, stolen laptops are almost always detected. Why? Well, when Joe Salesman comes back into work and doesn't have his laptop with him because it was ripped off from his car, it's blindingly obvious! When Marge from the M&A team loses her computer, she can't work until she gets a new one; thus immediately prompting the awkward question: "So Marge…uhmmm where's your laptop?". The natural follow-on conversation around what data was on those systems leads to these frequently issued breach disclosure filings.
A final form of bias comes from the fact that security teams are under-supplied with tools that allow them to detect the breach of content (i.e. DLP) but they are very amply supplied with tools to detect perimeter security breach. Unfortunately, the consequences of this bias are rather clear: teams are well equipped (on a relative basis) to report on a compromise of the perimeter but are woefully under-informed on a range of breach events that would make their executive team, if they knew about it, flip out.
Towards Better Sources of Data
As a program for change, "The New School of Information Security" is hard to disagree with. Shostack and Stewart's call to "Use hard data to guide how you react" is a perfectly reasonable recommendation.
I argue here that sample bias in the publicly reported breach data skews the picture on what risks are most active in the enterprise today. This data has a large amount of in-built bias, and a higher quality source of data around the origins of data exposure will improve risk management enormously.
Where are we going to find better data? Content-aware countermeasures like DLP. Do you know where your confidential data is stored? Where it's being sent? How it's being used? Most major enterprises simply don't know the answers to these basic questions. Direct and meaningful hard data on the nature of these risks can be answered by DLP systems. Quantifiable measures that provide answers to these questions is probably the most meaningful set of security metrics an enterprise could hope to have.
Before you conclude that laptop-theft is your number one biggest threat, you owe it to yourself to run a DLP risk assessment and at least gather and analyze exposure events off of your own systems to determine which of the many possible risks you face is the most serious.
Founder, Data Loss Prevention Division