For the last several months, the Hanzo team has been building artificial intelligence models using grant funds that we received from Innovate UK’s Sustainable Innovation Fund. The grant was designed to help companies recover from the COVID-19 pandemic. We’ve been looking into ways to extend Hanzo Hold for Slack, our purpose-built Slack ediscovery tool, to address the new workplace risks caused by the abrupt transition to remote work. A little while ago, I wrote an update about the model we’re building to detect human resources risks like discrimination, threatening language, and bullying on the Slack platform.

Today, I want to shift focus a bit and talk about a completely separate model we’re building to detect data leakage. This model, geared toward identifying personal data disclosures —names, Social Security numbers, and so on—and organisational intellectual property such as patent applications aims to alert organisations early for proactive remediation. The model doesn’t care whether those disclosures are accidental or intentional; it’s just looking for information that might be problematic. Let’s take a closer look at the risk we’re addressing, and then I’ll walk you through our current solution.

Defining the Risk: Data Leakage on Collaboration Platforms

One of the significant challenges modern companies face is identifying and mitigating data leakage: the unauthorised sharing of sensitive information. As I just mentioned, that information may be protected personal data like a full name or an address, or it might be internal company information such as trade secrets or intellectual property. There are many reasons why sensitive data can leak from one system to another, such as human error, malicious intent, poor data protection policies, or technical fault. With businesses increasingly globalised, more people working from home, and an ever-growing reliance on enterprise collaboration platforms like Slack to communicate, the risk of data leakage is a bigger problem than ever before.

The consequences for data leakage are growing right along with the risk. Companies that are subject to the EU’s General Data Protection Regulation (GDPR) or similar laws like the California Consumer Privacy Act (CCPA) must tightly control access to and use of personal data. Failure to report and respond to leaks of personal information can incur expensive fines and penalties. Worse, data protection regulatory structures vary by state and nation—which means companies can’t afford haphazard data control. Nor are legal and regulatory penalties the only detrimental consequences: leakage of personal data can also pave the way for identity theft or hacking. Of course, leakage of sensitive corporate information can irreparably damage a company’s financial and reputational well-being.

Because of the risk mentioned above to a corporation's reputation, data leakage is an urgent concern for the Data Science team at Hanzo. We know that both speed and accuracy are critical in the accidental or malicious sharing of sensitive data. Finding the relevant messages and identifying their senders are major priorities when a leak occurs.

But this is where things get tricky because leaked sensitive information can take a wide variety of formats, and—even more dauntingly—it is buried in a mass of data, like the proverbial needle in a haystack. It would be no problem to spot a data leak that occurred right in front of your eyes, but picking through the sheer volume of data involved in a typical enterprise Slack dataset? That’s why we need help from artificial intelligence tools.

Let’s suppose a Social Security number (SSN) was accidentally shared on the wrong channel on a collaboration platform to get into an example. If you know roughly when and where the leak happened, you can probably find the relevant message quickly. But what if you have tens of millions of messages spanning several years across hundreds of channels with thousands of users? How can you quickly find every instance where an SSN was posted? Or, if you manage to find one leak, how confident can you be that it’s the only one?

To quickly zoom in on and visualise where incidents occurred, we created a calendar view that gives the user a very high-level picture of the enterprise’s data channels. For example, this is the distribution of all messages, by date, for a given corpus that we’re going to be investigating.


Creating a Solution

Using the artificial intelligence that we’ve trained to identify personal data, we can create a filter to overlay this calendar, the “heatmap” view. One quick note about our data set. Since we didn’t want to “plant” data for ourselves to find, but we also didn’t want to set ourselves up to fail, we selected a dataset that we knew could, due to its nature, include some SSNs and other personal data in some channels. At the outset, we didn’t know whether it did have any of the information we’d be looking for, or, if it did, how many instances there were, where they occurred, or what context they occurred in.

That said, when we filtered that entire corpus to only show messages that contained potential SSNs, the results were striking. In this example, our filter reduced the number of relevant messages from 2.7 million to 626—a reduction of four orders of magnitude—in just a few seconds. You can clearly see that the instances were clustered together in time and occurred only on a few channels. Depending on the dataset, this could indicate a coordinated malicious attack, a systematic failure, or an intentional discussion within a secure channel. Whatever the cause, this analysis gives users the tools they need to find potential incidents and assess the damage quickly.


In this example, even though just over 600 messages contained instances of personal information being shared, those messages occurred in a significantly lower number of incidents. Clustering data leakage like this gives a better understanding of the risk. While it may be helpful to know that 600 messages over a two-year period included potential personal information, it is much more useful to know that the vast majority of that data sharing took place over a few days, in a few channels.

We can get into even finer detail by aggregating our filter with a channel view. For example, by limiting the search to one channel, we find another data-sharing incident a few days later:


Of course, it’s valuable to identify instances where one type of personal data, such as SSNs, is leaked. Still, the risk is even higher when SSNs are revealed in proximity to other personal data. To address that risk, we can have our model rapidly scan surrounding messages for traces of different types of personal information, such as names, addresses, phone numbers, or email addresses. This ability to check for various personal information allows users to assess the risks in even more detail. Instead of having just a raw count of SSNs posted, the end-user can see those incidents mapped by time, channel, user, and proximity to other personal data, so they can both locate and prioritise their efforts.

We’re also working on technology that will alias any potential personal data to prevent sensitive personal information exposure to employees. The team tasked with finding data leakage incidents is like a first responder reacting to an emergency: you don’t want your emergency crew to create an additional risk while dealing with the presenting issue. Aliasing sensitive information mitigates the risk of further sharing that information, so your cleanup team can respond to the incident without being exposed to the actual sensitive data.

Looking Ahead to What’s Next

The scale of information exchange in our globalised, remote-working world is terrifically vast, and, of course, no system or human is perfect. Somewhere down the line, even if a company never hires a “bad actor,” someone will make a mistake, resulting in a sensitive data leak in the wrong place. It’s not a question of if data leak will occur, but rather a matter of when. Even supposing that the chances come down to one in a million, you probably have 10 million messages in your enterprise communication systems—so it would be wise to be prepared to respond as quickly and safely as possible.

We haven’t entirely wrapped this project yet, so watch this space for another post reflecting on what we created and what we learned along the way.

[View source.]