Closing the data floodgates

I grew up in south Florida, probably one of the flattest places in the country. We had no mountains, hills or even mounds — nothing but flat in all directions. There was one diversion from the flat when I was a kid — an odd ravine along a residential street. We referred to it as the “deep deep” and drove by for a look every chance we got.

Over 30 years ago, I moved to Atlanta, a land of hills and valleys. My house backs up to a floodplain area with a ravine that makes the “deep deep” in Miami look small by comparison. Since I see it every day from my window, I really don’t think much about it anymore.

So, what does this reminiscence have to do with preventing data loss? I would suggest that the underlying problem is the same. Companies concerned about losing key data, such as the elements regulated by HIPAA and PCI, begin watching their communication channels (email, USB drives, etc.) for the presence of such data, and filter out the critical items. It seems an easy task at first, but after the hundredth email message, their eyes glaze over, causing them to miss data items, just like me looking out my window, and no longer noticing my ravine. Thus, there is a legitimate need for some automated approach to monitoring communication channels for inappropriate data.

Data loss prevention (DLP), sometimes called data leak protection, is an attempt to monitor common communication channels for the presence of controlled data, and to mask the data, preventing its transmission in readable format.

According to the SANS Institute, the earliest DLP hit the market in 2006, with the product class gaining steam in 2007. In 2010, Fujitsu published a detailed research paper on the topic, laying out the specifics about how a DLP system needed to function. The market has grown into booming business since then, with products from a variety of sources, both commercial and open source.

From a high level, the idea is to employ automation to watch for the outflow (and in some cases, inflow) of controlled data by its pattern, for instance a Social Security or credit card number, via a communication channel. When such a pattern is found, the system can mask the data automatically to prevent its unauthorized disclosure, and also log the source for investigation. A key prerequisite to the loss prevention process is understanding what data elements you have, and where they are. According to Randy Trzeciak, the technical manager of the CERT Insider Threat Center at the Carnegie Mellon Software Engineering Institute. “If you don’t know what they are and who has access, then it is hard to either detect or protect.”

Drilling down a bit, most of the products look for predefined data types, such as credit card numbers, as well as custom-defined patterns. They offer a variety of additional means to locate data to be masked, including keyword matching, predefined dictionaries and match expressions. They can integrate with a variety of other technologies, including Microsoft Active Directory, SMTP servers, databases, and custom code via API, and automatically discover and monitor workstations, giving the products a variety of vantage points from which to watch for data leakage.