[co-authors: Jonathan Flood*, Justine Sim**, Jennifer Hamilton***, Michael Amara****]
Editor’s Note: On June 23, 2021, HaystackID shared an educational webcast designed to inform and update legal and data discovery professionals on how to consider and operationalize data mapping processes utilized in the planning, preparation, and migration of complex data sets consisting of high-risk, sensitive, and low-risk data requiring management and migration.
While the full recorded presentation is available for on-demand viewing, provided for your convenience is a transcript of the presentation as well as a copy (PDF) of the presentation slides.
[Webcast Transcript] Operationalizing Data Mapping: From Practices and Protocols to Proven Processes
Complex information governance challenges require a comprehensive and complete approach to understanding an organization’s data profile through the lenses of compliance, privacy, and litigation. Data mapping is a critical component of this understanding. The practices, protocols, and processes around this vital component can make the difference between an organization’s comfort or concern regarding its data profile.
In this presentation industry experts shared how to consider and operationalize data mapping processes utilized in the planning, preparation, and migration of complex data sets consisting of high-risk, sensitive, and low-risk data requiring management and migration.
+ The Importance of Data Mapping
+ Strategy to Tactics: Data Mapping Considerations
+ Documentation and Data Mapping: A Key to Success
+ Working the Workflow: From Protocols to Processes
+ Technology Matters: Tools and Techniques
+ Overcoming Challenges: Compliance, Privacy, and Litigation
+ Reporting: From DSARs to Discovery
+ Jonathan Flood – Mr. Flood is the Director of EU Discovery Ops at HaystackID. Jonathan is a thought leader who has worked with top-tier law firms in Ireland, in addition to vendors, financial institutions, government agencies, and regulatory bodies.
+ Justine Sim – Ms. Sim is an Information Technology Analyst for John Deere.
+ Jennifer Hamilton, JD – Ms. Hamilton is the Deputy GC for Global Discovery and Privacy at HaystackID. Jenny is the former head of John Deere’s Global Evidence Team.
+ Michael Amaral – As the Director of Global Client Advisory, Mr. Amaral leads the Privacy Management Technology group at HaystackID. Mr. Amaral is a technologist with over 20 years of experience helping companies solve complex data problems in compliance, governance, privacy, and litigation matters.
Hello, and I hope you’re having a great week. My name is Rob Robinson, and on behalf of the entire team at HaystackID, I’d like to thank you for attending today’s presentation and discussion titled Operationalizing Data Mapping from Practices and Protocols to Proven Processes.
Today’s webcast is part of HaystackID’s monthly educational efforts conducted on the BrightTALK network and designed to ensure listeners are proactively prepared to achieve their cybersecurity, computer forensics, eDiscovery, and legal review objectives. Our expert presenters for today’s webcast include three of the industry’s foremost subject matter experts on privacy and governance in eDiscovery.
Our first expert is Jennifer Hamilton. Jenny is the Deputy General Counsel for Global Discovery and Privacy at HaystackID, an industry veteran who is very active in associations and advisory groups throughout the industry. Jenny is also the former head of John Deere’s Global Evidence team. Our second expert is Justine Sim, and Justine is an Information Technology Analyst for John Deere and she has extensive experience in operational expertise in areas ranging from data mapping to DSAR responses. Last but certainly not least, our final expert is Mike Amaral. And as the Director of Technology Innovation at HaystackID, he’s a technologist with over 20 years of experience helping companies solve complex data problems and compliance, and governance, privacy, and litigation matters. And also, while not able to attend today based on the imminent arrival of a new family, I’d also like to acknowledge Jonathan Flood. Jonathan is HaystackID’s Director of eDiscovery Ops and he was integral in the planning and preparation for today’s webcast.
Welcome, Jenny, Mike, and Justine.
Thanks, Rob. I really appreciate that fantastic introduction, and we’re really excited to talk to you today about operationalizing data mapping and here’s the agenda. We are going to talk about the importance of data mapping, key challenges, our mapping inventory strategy, the workflow process, our ever-favorite, data cleaning and migration, and last but not least, documenting and reporting.
So, the question is, what is data mapping? Like any specific subject area in the world of privacy, eDiscovery, cyber, litigation, we need to start with setting the table so that we can all line around a common understanding of data mapping or doing a data inventory. What does it mean?
So, I’m going to kick it over to Mike Amaral to give his views on what is a data map, in your experience, Mike.
Jenny, thank you, and what is a data map? I think that depends on who you ask, and what their goal is. And as we go through the presentation, we’ll talk about some of these things. But really, what it is, is… and the question I always ask is how do we know what we need to protect, what is out of our compliance, what is… where our data sits if we don’t understand this. So, a data map is really kind of a layout to where your data sits in your enterprise.
I was just going to add that the map not only should contain where the data is, but what the data contains. So, we’ve got maps of where files are contained, but if we don’t know what is in those files, does it really help that we know where they are.
I would agree with that. So, I think touching on Mike’s point there that depending on who you ask, the map’s going to look different. So, if I put my eDiscovery hat on, I’m mostly focused on where – depending on the case – internal employee data is located, versus where customer data is located, versus warranty and system and manufacturing kind of data. Then when you look at it from a data subject response perspective, I’m usually diving into things a little bit deeper to understand really where those data flows are going, how is the data getting input, how is it getting exported or sent somewhere else, if it is even, in general. And then you’ve got the cyber ops world, which is a whole other ballgame of trying to figure out where the data is and, like Mike said, how best to protect it.
Just before we leave this concept, you touched on the subject request. Can you define that for the audience, maybe level set on that?
Yes, so the subject request would be in compliance with data privacy laws. So, as a data subject or citizen of wherever you reside, you may be entitled to certain protections when it comes to your personal data, and you can request that a company provide you with either access to what data they have on your or a copy of it. if you want it remediated, you want to update it or delete it, depending on which law applies to you.
Excellent, and we’ll get into some more detail on that in a minute. But I think what I’m hearing, and in my own experience is consistent with what you’re saying that it really depends on the audience. What is a data map? What is inventory? And to different audiences, a data map may mean different things or may be used for different things. So, let’s dive into that a little deeper.
In doing so, we’re looking at the data mapping, the evolution of data mapping, should I say, in four different phases where if you look at the history of organizations mapping their data.
We have the first phase which we call the Wild West, where it was this idea, this more abstract idea of, hey, in order to meet eDiscovery obligations, we need to understand more about our IT systems, and more about the data we have, so that we can certify to the Government or to the judge, or the opposing party that we’ve run a complete search, and we’ve been diligent in our discovery efforts. And this is going back into the early 2000s when the Zubulake opinions started to come out, and the Federal rules were revised, that made it clear that attorneys couldn’t shrug their shoulders when they were asked about their clients’ IT systems, and kick it over to the IT department to certify what was searched and what wasn’t searched.
So, that’s what I would call the Wild West. I’ll open the floor to see if you have any other ideas and experiences in terms of the early stages of the excitement around data mapping.
I can add to that, Jenny. As you say, when it first started, we all went and started to try and organize our data better when we had to discover it. So, we would have a layout of an IT share that had an accounting department, and this is where you’re supposed to keep your data.
But unfortunately, there really weren’t many controls around keeping it there. And so, as we discovered, or as we continue to discover, there are more and more places that data seems to hide. And so, that Wild West then turned into the second phase, which is the Audit Style.
Big companies brought in big companies to audit where all their data was in transactional workflows, and that was great. It was a great snapshot of what we did when they looked at it. But the problem with that is it was out of date as soon as it was complete.
Right, so tell us why. What happens when you’re going through the process? Maybe you can describe a little bit more, from your experience, the mechanics of trying to document or data map? How are you doing it?
It is a long and tedious process. Interviewing people on where do you store your data, and taking that back and then going and checking that, OK, that’s where the data are stored. But by the time you’ve finished the interview process and the verification process, policies have changed, new systems have been added, and it makes the whole thing irrelevant.
It was great to look at it, and maybe 75% of it’s still there, but it’s not a full picture of what your data is.
Right, it’s like ever-evolving. In my experience, and Justine can jump in too, is that the needs of the business are not such that IT is sitting down and diagramming out where all the data sits and what servers it’s on, in what part of the world, that we have this misperception, particularly as lawyers when we start this journey and start interviewing clients and talking to people in IT, like you mentioned Mike, that they are going to know where the data is and to what level of granularity, and that they have this all already neatly diagrammed somewhere in IT. if you just ask the right person, then they’re going to snap their fingers and produce it.
And so, my experience is that it’s not like that at all, and you really are [cracking this from whole cloth]. and part of the reason is that they need to document to support the business is very different from the information we’re trying to tease out, which is more of a high-level understanding and education about where the data is so that we can quickly find things as they arise. And then, Mike, you pointed out that even if you’re able to do this, and it is a tedious process to get to that point – and again, Justine can comment on the experience she’s had – you get this map, this spreadsheet, or this diagram that isn’t really accurate by the time you come to that understanding with IT about what exactly you’re looking for them to document.
Yes, I can add to that. That’s definitely been a lot of my experience. So, we would start off in kind of this phase one/phase two mesh of interviewing a lot of our – a lot of my experience is based on interviewing and building relationships with people, so trying to understand, like you said, exactly who’s in charge of what, and they might have… Custodian A might have part of the puzzle documented, but then you’ve got to talk to three other people to get to the other person who has the retention or the, I’ll say, outflow of where the data goes. So, a lot of conversations, a lot of documenting, and it can be as simple as Word documents to document the interviews. But I do like to think of phase one and two as almost like a fire-fighter, and a lot of it was case-dependent.
So, I might be only looking at a few systems that are relevant to this case, and then after that case is done, you kind of move on, and the next case arises and you look at three different ones. But by the time you get back to needing System A from Case One, it’s outdated and you’re just repeating the same process over and over.
I was just going to say, I think, for all of us is to move into doing more with less and trying to figure out how to operationalize that. So, using planning or systems to approach this mapping in a way that you can kind of make a repeatable process, and get that information that you need, but be in a more proactive, I’ll say, stance than always playing catch up.
Let me ask you this, Justine. Before GDPR went into effect, whose job in IT was it to maintain this map or information?
That’s a great question. I don’t know that it was anyone’s specific job. I think each… and my experience has been that each data owner or system owner would look at things differently. So, as you said, the business might not have a need to have all of the questions that we would ask for an eDiscovery related matter documented. They might have where the system is physically located, if it’s on-premise or off-premise, how long things are retained, but not necessarily all the other bits of information we’d want to pull out.
So, what changed with GDPR coming online?
Yes, so GDPR introduced the, I’ll say, idea or the notion that anyone could show up at any time and ask for, essentially, a data map, a listing of what systems you have and where the data flows in and out of. So, I think that, brought on a lot more attention and urgency to at least have an idea, a base level idea of where your data is and where it sits.
And then, Mike, why don’t you take us then… I think you mark it, you discover it, you’re describing phase three, and Mike if you have anything to add to that, and then take us into phase four, what the future of data mapping looks like.
Well, yes, we are currently, today, in phase three, and we’re starting to see some tools that really help assist us in building that map with all kinds of new connectors and what they can look at and that’s really where the map is headed. And what we see is it’s a living, breathing index of all your data in real-time. And this really is what we all would love to see that for compliance you’ve got real-time information about documents that are about to fall off of retention, or something else for privacy. We know where all of our private data sits. For cyber and breach response, if a system gets hacked, we know what was on that system, who was on that system, and so, our reporting becomes easier.
And to get everything into one living, breathing kind of tool that the enterprise can use for all of its things, I think is a really nice goal to have. I don’t know that we’ll ever get all the groups to agree on a single tool to use, but that’s another discussion.
Yes, I agree with that, and I would just add one more thing to that last slide. I think, in my opinion, and experience, that most companies are probably not all in one phase. Likely with old legacy systems and the new systems coming on board, you’re in a mix of all of those phases, if not a few.
And this talks a little bit about what we talked about in phase four, why did we do this stuff? We talked a little bit about, in the early days, we want it for legal purposes, and then compliance came in, and now there’s privacy and cyber. All of these different groups have their own version of what a data map is and what the best way to do it is. And I think that causes some struggles when looking at information, as a whole.
Yes, I think it does, because it’s like four different puzzles, but with largely the same pieces, they’re all a little different. And I think even if you were just, in the beginning, trying to better understand where the relevant data was located, by developing your own map or your own record, you really didn’t learn how to map it until after you’d already started, and then there was a lot of iterations.
You start over and, OK, now I better understand what it is I’m looking for and what information I’m going to need out of this. It’s like doing your first document review of a banker’s box of invoices like you get the file from the partner, and the partner says, “Here’s, generally, what the case is about, now review the invoices and see if there’s anything interesting in there, anything relevant”. And so, then you go through the invoices, and you’re like, “OK, this is an engineering invoice, this looks really relevant, this is a completely unrelated bill into accounting”. Now, that I understand better after I’ve looked at the data, what I need to know and develop the issues, and go back and review.
So, that is my view that it is… as Justine said, you’re jumping between phases, it’s not a linear approach, you’re learning as you go. You have to… two steps forward, one step back if that makes any sense on any one of these. And then as all these different issues have come online, records, information management, or file cleanup day for compliance, and they want to understand where the data is located. And then you have GDPR and CCPA, so now there’s another view of what’s important to include in the map. So, you’re constantly moving back and forth between iterations of maps, what you’re collecting, why you’re collecting it.
So, let’s move forward here and continue to talk about key challenges maybe in a little more detail, starting with forming the working community. Justine, do you have any thoughts on how this has been a challenge, in your experience?
Yes, so this, I will likely say, is the foundation of your whole existence in the data mapping world. So, it’s very important. It also can be very challenging, because at times, like we said earlier, it’s not necessarily anyone’s job responsibility or anything that they feel they’re tied to or looked at for job promotions on, but it’s very important to have those conversations and build the relationships.
So, legal, your legal team, your IT resources who should know where some of the data is. You’ve got risk management. In my experience, there’s a lot that can come from data governance groups, things like that. So, it’s a lot… I guess, to me, coming from a technical background, it was a lot more of having conversations and building that kind of rapport with everyone than I thought it would be.
I thought, “Oh, we’ll come in, we’ll just document where these systems are and be done”, very technical, like straightforward. And it’s been a lot more work and importance on the relationship building and the work committee than I really would have thought of to begin with.
I think the next piece would just be understanding the risk. And I think, Jenny, you did touch a little bit on that earlier with the banker’s box and the review kind of situation. You don’t really know what you have or where the risks are until you start diving in and understanding where the data is.
Absolutely, and aligning on the goals, and we talked about that. In the cyber world, in the business world, the company’s interested in where are the crown jewels. The big issue that comes up in our business is that we assist on locating personal information, patient health records to determine what the risk is in any sort of dataset, and making sure that that review is done with those goals in mind.
We touched on this earlier, and I welcome your additional comments on this, but just deciding where to start, what systems to start with. And I always joke about one of the first conversations I had with anyone in IT, in my prior role, was really a question for IT, what systems, tell me a little bit about our infrastructure environment, what kinds of systems we have. And the question back to legal was, well, which system do you want to know about? Can you tell us about all of them? And they’re like, no, which one do you want to start with? Well, I don’t know, what do you have? Well, what do you want to start with?
So, that was the first conversation, right, Justine?
Yes, I think that’s a very common chicken and egg conversation, and especially, I think, where IT and legal tend to swirl a lot, it’s answering a question with a question.
So, in my experience, the best approach I’ve had is to pick something at a high level, I think, either looking through discovery requests, either that or working with the privacy team. Depending on which hat you’re wearing, just pick a system. If you know this is my CRM, my customer system, I hear about this a lot, then just start there. It’s, I think, kind of getting the starting piece, you just have to start.
So, it might lead you down a bit of a bunny trail, but it does help you map out. The map’s never going to be a pretty spreadsheet, or linear, line by line, it’s really a spiderweb of systems that usually are interconnected in some way or fashion, so picking, depending on what’s your priority. When we talked about aligning on goals, so your priority and your goals and just going on from there.
So, if it’s… again, like if it’s data subject related, for CCPA, we started with a lot of our customer data. Where are the heavy hitter customer data systems that we know we have? If it’s for legal and it’s very case-specific to an employment suit, then I might start with an internal HR system.
Mike, what advice would you have with the audience, what’s a takeaway, in your mind, of how to decide where to start and then lead us into defining the data content.
And I think that’s going to… defining where to start is… if we’ve got a committee and we’ve got the enterprise thinking about data mapping as a priority, it doesn’t really matter where to start, it’s whatever the priority is. And that brings us to defining the data content. For privacy, what data are we looking for? It doesn’t mean because we’re looking just for privacy data, we also can’t meet the needs of litigation, cyber, and compliance while that same data is being gone through.
So, we can start building the map with a lot of our current processes once we define, as a group, what data we want to collect about our systems.
Agreed. I would add, I think that’s been one of the bigger frustrations when we started rolling out some of this – and Jenny, you can agree or disagree with this – was that some of the feedback we got on the business side was, “Well, I’ve already talked to several of these other groups, why are you asking me these questions again?”
So, to Mike’s point, the more you can grab your groups together and get your working committee aligned on the goals and the data content, the, I think, easier of a rollout you’ll have in your company.
Yes, definitely. And then we’ve got the concept of defining the data content, so we discussed that before. It’s the same thing, you don’t necessarily know where the crown jewels are, that’s why you’re doing the exercise, but you want to start somewhere and then you also… so those are areas of risk that we’ve identified, but then there’s a new area we really haven’t identified, which is third party data and doing the vendor inventory. And [anymore] organizations have really large supply management departments and rely heavily on third parties, contractors, and subcontractors, and consultants, and eDiscovery and cyber providers like HaystackID, so you’ve got this whole other area of third party data, not to mention we already talked about customer data and employee data where you’ve got risk that you need to start with identifying where it is, and as Mike mentioned earlier, the actual content.
Let’s move into more the strategic discussion on doing your inventory or mapping these different items. Mike, can I ask you to kick us off and take us through a day in the life of developing your strategy around this.
Yes, and really to start with, your data inventory, it really is a living index, and it doesn’t have to be up to the second, but there are enough tools out there today that can provide you feedback on your data on a daily, weekly, monthly basis. And once you do that, you really have, what you said earlier, kind of the full-blown picture of what’s going on in your environment, where your PI is stored, what documents are on or about to fall off of retention. And you could even go further than that with some of the new tools.
But really, it’s all about taking what you want your… I always like to work from the end goal back. So, how do we have to build this index? We have to (1) identify where the data sits, and then pull that into – once the data sits there, we’ve got to go scan the data to find out what information we can pull back from it, whether it’s traditional eDiscovery metadata that tells us the dates and the filetypes, and where they’re sitting. Or can we actually apply some new AI tools and pull out PI and PHI and PII and tell us where that’s located on our systems. Including down to scanning the data for content and other things that fit with your compliance.
In eDiscovery, we’ve been doing contract identification for a while but does every company out there understand where all the data sits, where all their contracts are, or where all their important documents are. with all the breaches that have been in the news lately, you look at what’s the best way to protect what’s most important, because they are… the criminals are going to find a way to get at your data. It’s not an if, it’s a when. And really, we want to know about where’s my most valuable information, and I can spend my money really protecting that, and then if they get in, maybe they get something that’s of no value, but when they get in, you know what your exposure is very quickly with having this type of data map.
I’m taking a lot away from what you’re saying, I’m hearing, kind of reading between the lines that it’s important to scale the inventory because it is a living index. So, you can’t gather so much information that all you do is take people off their day job, sit around and inventory things, but you do need to prioritize based on risk, and cyber risk is obviously a huge issue. It is all over the news. There’s an executive order about it, the G7 is talking about it, and there are new roles being created faster than we can fill them, in terms of addressing the reactive part of a cyber breach.
And what we’re talking about today is developing a strategy that’s proactive. When the criminals – to your point – when criminals access or exfiltrate your company’s data, they’re not getting anything very exciting, because you have more controls and protection around the higher risk information, the employee data, the patient health records, and the stuff that maybe they can get to, the lower hanging fruit isn’t very exciting, so what. You don’t necessarily have reporting obligations, and you can sleep at night.
So, I’m taking a lot of that away from what you’re saying. We’ve talked about the tools and can you tell us a little bit more about the tools and what are some exciting things you’re seeing specifically that can help us with this more proactive approach.
We have some existing tools in the market, we’re a Nuix partner, and Nuix puts out a really good set of tools that can scan data for this type of information and provide it back. There’s no kind of IG front end on it, or mapping front end on it, but that can be worked in with – we’ve done it with Nuix and Elasticsearch, and some other things to make some intelligence from the data. Because as you said earlier, just having all this data in an index is one thing, but how do we present that to people so they can understand it and use it for their business. And that’s where some of the new tools, one OneTrusts and the BigIDs of the world that really came out of privacy, but are starting to see that building a total data map for security and governance are becoming more and more important. They’ve got good tools for looking at building that workflow around your third-party risk and making sure you’ve got an inventory of your vendors, what data they have and have a good conceptual map built. And then you can scan that data and pull back the information when needed.
But I think, also, interestingly, we’ve gone from… in the old days, we had a records department with kind of the overseer, the data manager of that records department. And with IT systems growing far and wide, I don’t know that there’s any one person in any one company that can tell you where any particular data is.
I think this is an issue that my team had to hear me talk about ad nauseum, and that’s the free lunch we thought we were going to get by cutting down on the number of administrative assistants that we all had access to, to help us organize and manage records. It was a way, well, we don’t really need secretarial help and the way we had used it in the past as computers became more prevalent.
So, there was some displacement of that work. You didn’t need someone to type what you were dictating, to use kind of a legal context. And so, then the secretaries, administrative assistants were simply organizing the case files. But even then, we were storing more and more of the case information to our computer, and we felt that their work had been displaced. But in my view, we misunderstood what their real role was, the value of having them was to keep us organized. Attorneys and many other specialized professionals are not good at this, and we certainly aren’t educated in library science. There’s no Dewey Decimal system, there’s no organizational system as you come into an organization that’s already laid out for you to say, “OK, here’s where you’re going to store these things, and here’s the folder structure that’s consistent, so use this folder structure to store your case files”, again, keeping to a law department kind of analogy. And instead, it’s like, “Well, just do whatever you want”.
My understanding from benchmarking over the years is this is pretty common, it’s not the other way around where people come in and there’s already a structure for them, it’s been set up by somebody who is a professional at organizing records, like we had with administrative assistants. So, that when we say data librarian, that’s what we’re talking about, a data scientist, a data librarian, an administrative assistant is somebody who is a records information specialist who can provide that. So, that’s to me where this is being driven from. But the reason that we have all this risk and we know where things are located is because it’s been largely left up to people who aren’t specialized or this is not their core competency, this is not their core role, they’re supposed to be doing the work, not figuring out the best way to organize it and having it being executed consistently across the organization.
So, I will now get off that soapbox, but one of the suggestions here is to bring back the data librarian or administrative assistant and have someone really take charge. But I really like the future where this is going, Mike, is it makes me think we have a real opportunity, now that we have all these amazing tools, we can better understand what people will do to the natural course of their job, where they’re putting records, understanding why they’re putting them there. is it because it makes sense? Is it because it’s easy and fast? And we can develop an approach that’s custom-tailored to the organization, the data type, and the psychology of how people operate when they are trying to do their work but keep things in a place that they can find them. I think that insight could drive a better system to be developed for an organization, so that then it could be driven through the DNA of the organization, through the records information management group, or the data governance group and that companies can start getting budget to do this, understanding that this cuts off a lot of risk down the road.
I don’t know, Justine, if you have any comments on that, any examples from your life where you think that any one of these strategies makes a difference.
Yes, so I would agree with all the comments made here. I think specific to the data librarian, one thing that I keep thinking of as we talk about some of these tools and resources that are available, that even if a data librarian, in the sense of a true person, might not be an option currently, then you can leverage some of these tools to become that data librarian for you. you institute some kind of automated assessment and you send those out to owners, then that becomes your inventory and your living index where different groups can then come and collaborate in one tool, and get the information they need.
But the data librarian, obviously, is important, and that’s, I feel like, some of the times what these tools are starting to go down that path of becoming that librarian for you. and if not, then some kind of data librarian role or responsibility within an organization, if it’s not a person then it’s an added responsibility for the applicable people.
And I just want to throw in an aside here, there was a question from the audience. Yes, this is being recorded, and certainly, let the fire out and you need to and come back, but we welcome anybody’s questions, so keep them coming.
There’s so much meat on the bone here, and I don’t want to belabor it, but I do want to circle back to this idea of you’ve got tools, and then you have the index that needs to be accessible and needs to be iterative. And there’s a distinction there between the location and storing of the map and making it accessible to people whose job it is to update it, and to the people who need it. and then there are the tools used to crawl, scan, index, and organize the data so that it can be cleaned or migrated or remediated based on risk.
But between those two tools, those are pretty big chunks of budget and require quite a bit of employee engagement. We talked about roles. And today, GDPR, CCPA, and other regulations are driving more budget into this area, we’ve been waiting for it for some time, and now we’re starting to get our wish. But if you have any comments, Justine, or Mike, for companies who… not every company can afford all of these tools and how do you help them make decisions about where to start and which [tools to consider], the tools would be your go-to. How could they do this in a more budget-friendly way?
Yes, I can take the first stab at that. So, I would say if I had zero budget and be the glim future of not getting any money way, you could always start with a good old spreadsheet, list, something.
To me, the key takeaway for this, if nothing else, is to just start. You’re obviously not going to be compliant with anything if you have nothing documented, so you’ve got to start documenting and coming up with your list. Then, in my experience, what I have found is as I’m documenting things and keeping track of my time and efforts spent to perform a data map, then you can take that to your management, your supervisor, whoever you’re presenting this in front of and say, “Look, I’ve spent 80 hours on this, and I’ve got two systems documented, if you want to be compliant in this space, we need to have a focus on this”. Hey, look at this nice tool over here that can help automate some of this or take some of that burden off, and so, we can be compliant in a timely fashion without either hiring more manpower, more people that continue to manually do it, or leverage a tool.
I think what I’ve experienced with you, Justine is you kind of just have to dive in and start doing it and then demonstrate the amount of time these things take before you can then propose a return on investment for any one of these tools.
Mike, I’m guessing you have some helpful advice for the audience on this as well.
Like Justine and Jenny have said, you don’t need to spend money to build a data map. You can send out questionnaires or have a conversation with your fellow employees about where data is stored and what the process is, and start building that data…
There are two maps, really. One is where the data sits and the other is how the data flows, which is almost just as important. How long does it stay there? does it get shared with other systems? Does it go outside of our walls? But you can do all of that with questionnaires to the different department heads and understanding what data they touch and where it is and build a spreadsheet.
Now, there are some inexpensive tools, but if you really, like Justine said, want to get a budget for it, pick an application that you think needs to be mapped, or some unstructured data that you’re concerned about and have a test run. Go out and index that data and pull back just a slice of it, and layout what the risk was in that data, and use that to go get your budget for your tools.
Because if you can identify risk in data now with privacy and security, I think you can really move the needle towards getting what you need to build this living, breathing index.
And in addition to that, we touched on the data librarian, making sure you have the right roles engaged in this process, and what we didn’t do is discuss when you’re sending out your questionnaires, where do you start, where’s the biggest bang for your buck. What are the types of roles that are apt to answer questions about what’s in a certain system that’s helpful for the data map?
Justine, are there any particular people who are kind of a shortcut, so you don’t spin your wheels just spewing it out across the enterprise and hope somebody takes the time to answer it.
Yes, definitely. So, I would say, in my experience where I am, we have product owners is what they’re called for each kind of, I would say, IT system. I’ll say the magic potion for what we’ve been doing is we try to have one IT product owner, and then one business side. So, I feel like it’s always this balance of IT and business, or IT and legal. That way, you get both sides of the story, as well as the translation. The translation piece, I feel like from a high level seems like it should be really easy, but in my experience, when we’ve talked to just one of the roles without the other, we end up missing pieces and we’re going back and forth between the two.
So, I would say anybody that’s more of the business or maybe, like you said, department head or process kind of side of the tool or dataset that you’re looking at, and then the person who is on the IT, kind of administration side of it or product owner. So, we would always say, “Hey, Jenny, I see that you’re the owner of this system, who do you work with on the IT side to make any changes or to set it up that kind of runs the admin side?” And that was usually the question that would get us to the other person.
And that leads us to our last point on strategy is, is everybody using the same definitions, the watermark technology-based definitions, a lot of them are legal terms of ours. So, the point about getting both the businessperson and the technical person involved in the conversation who have some ownership or stake in a system is really well made, because you’re speaking at least two different languages. I think with privacy coming on, maybe it’s three different languages now, because privacy doesn’t always sit in legal, [inaudible] compliance and it’s being infiltrated, or I guess, exfiltrated throughout the organization. So, everybody is using similar terms similarly, but also differently and how does that affect the documentation you’re trying to create, the data map if you’re not all on the same page with what things mean.
Let’s move forward here to just a concept that – the goal is to develop a workflow process, that that workflow is driven by the goals of this team and the ultimate strategy of the business. And being able to demonstrate you’re connecting it both to the project goals, but also the business overall goals then is helpful as you describe what you’re doing and why you’re spending that time, especially in large organizations where people are wearing many different hats and have lots of work on their plate.
So, let’s talk about what a workflow might look like. Mike, I don’t know if you have any views on this that you would like to share.
Again, all of these workflows, as we’ve mentioned all along, should be an ongoing workflow. For a data cleaning and migration workflow, as we’re building the living index, we want to identify things that are duplicative, and then we want to say, “OK, what are we going to do with duplicative data? Which one do we want to keep? Who takes priority? Do we want to keep them all?” but as we go through and scan the systems for this data, we get reports on duplicates, and we get reports on documents that are outside of retention policy, documents that are outside of our IT policy. You’ve got MP3s on a personal share, all documents that need to have something done to them. So, you go in and you’ve got to have… at some point we have rules that either we don’t care if it’s over our retention policy, it’s getting deleted, or if it’s this we do that. And then other documents, we’re going to have to have people look at.
What’s our next step with that? Do we want to take these, and OK, we want to keep these 10 documents, but we want to migrate them to a new location? We’re going to validate what we’ve done and then we’re going to rerun the process and keep scanning, so as these things happen, it keeps going over and over again. And while this is going on, our map is updated. When we remove a document, the map’s got to be updated. When we migrate documents from one area of the system to another, we’ve got to update the map. We’ve got to validate that everything is right and then we’ve got to rerun the process.
Yes, this is no joke. This is a pretty heavy lift. What do you say, Mike, are things getting a little easier with the new technologies and with better understanding of the goals of this type of workflow, or is the complexity increasing because of the increasing volume of data and the increasing risk of what’s in our systems?
Well, it’s getting easier. Now, we have tools to help us identify data that is at risk. But the problem is we have more and more places where data is being stored. We’ve got our old on-prem systems, and most corporations are now moving to at least some form of cloud, whether it’s M365 or we’re using Box in the cloud, we’re using Salesforce, we’ve got cloud tools that are now storing our customer’s data that we are, ultimately, responsible for.
Salesforce has some responsibility for that, but ultimately, it’s our customer data and we have to keep – as systems change and we have more and more data points out there, this just keeps growing. And so, the potential for risk keeps growing, and it’s why your map in this process has to keep growing because you have to keep adding new endpoints to it and understanding what that data is, and should it be there, and where do we need to move it to if it shouldn’t be there. But the tools are making it an easier process.
On, I think, the vendor partners are upping up their game, like HaystackID, in terms of being able to take some of this work off of an organization’s plate so that they don’t have to buy the tools and implement them and develop this knowhow, and that the partners, like HaystackID, are in a great position doing this work for other companies to do some natural benchmarking, so the organization can know where do we stand vis-à-vis each other. This was a really important part of the work inhouse, was just simply the benchmarking and making sure that you were keeping up with the industry experience of where you’re allocating your resources and your time when you’re doing only something that you can do inhouse versus finding the specialist that you can delegate things to, but still manage budgets and cost. Certain things are more expensive to do in-house and less expensive to do in-house.
Where do you see the future – I’m going to put you on the spot, Justine – where do you see the future of this type of work living, because sometimes it starts off being work done by a supplier partner, sometimes it starts off inhouse and it migrates out. Is this a true partnership between the in-house and your service provider? So, I don’t know if you can start with where things are today, in your view, just overall in the industry and where do you think that it will head. Will it go inside-out/outside-in?
Sure, so I would say, in my experience, most of it is currently in-house. I think a lot of it is – even the fact that we’re having this conversation, people are still trying to wrap their heads around the best way to perform this data mapping and operationalizing it.
So, I see it as more on the in-house side currently, and I think as – and in my experience with a lot of things – as the process and workflows become more operationalized, then the pendulum starts to swing the other way. As in-house, as we’re getting better at doing this, then usually the vendors and tools and things like that, that you can take advantage of, are also getting better at doing this. And then it becomes a mix of do I want to completely have this go out, and it’s just something that I manage from the inside? Or is it a partnership where it’s a little bit in and a little bit out? So, I do think we’re more on the inside now, but I would venture to guess this will start to swing the other way as things mature.
I don’t know, Mike, if you have any thoughts on that.
No, I agree, and we’re starting to see some of that. With vendors like ourselves, we’ve been dealing with data for a long time, and so we know how to operationalize it and understand it and run it through. And as I said earlier, if we can help in the beginning setting it up and getting you your first kind of index and teaching you how to keep it living and breathing, that’s great, and you can kind of learn what you’re going to do with that from there. but I do think it’s going to – like everything else – it will become a service just like everything else out there and you can buy pieces or portions of it, manage some of it yourself, or none of it.
All right, here we are, we’re sliding to home plate, Documenting and Reporting. Let’s assume we’ve gone through our data cleaning, remediation, migration, and now what? What’s the final phase of the development or maintenance of a data map? Justine, do you want to get started?
Yes, so I think just looking back, or thinking back on our conversation today, again, it doesn’t necessarily have to be the prettiest data map you’ve ever seen. I think the important part is having the documentation there and understanding that you need to be documenting from day one, even as tedious as that may seem, you’ll be glad that you have it.
So, the other piece of this, that I would say, that’s important is that it’s never complete, it’s never going to be done. You might have a system completed for now, but I think as we’ve been discussing, it’s going to have to be updated and maintained moving forward. So, I think having that mindset of continuous improvement and knowing that it’s a circular process is certainly helpful.
But, I guess, at the end, long story short, you should have some kind of map, whether it’s a spreadsheet, it’s something that’s in a tool that you can visualize. A lot of these tools have reports that you can just click a button and they come out with a map, a literal map of where your datasets are located. You just have to have something that you can defend, that you could provide to somebody and feel comfortable defending that this is where your data is and how it’s flowing.
Mike, I think you had something to add to that.
I agree, and really, the documentation, we want to make this a repeatable process, because it has to be repeated. And everything changes as you run through it. As regulations change, what reporting you need to put out is going to change. Your map is going to change as new systems get added and old systems get removed. And what sits on those systems may change as those systems are upgraded or changed or where they store the data may be different.
Once the process is documented, you repeat it over and over and over again. And I know we’ve said that a lot, but it’s really important that this thing is up to date, so when something happens, it’s an easy, quick solution to it, and the information is readily available and you know what your exposure is when it happens.
So, as we’ve discussed throughout the call, and I think that we can end on a good note there, it’s really about risk identification, risk mitigation, not getting sidetracked because the enormity of this work with details or information that isn’t important to managing the risk, and then developing your own organizational-specific approach and goals. And not being too detailed, as I mentioned, but not being too generic and having to go back to the well over and over. You don’t want to keep pinging the same people over and over about similar questions and take up too much of their time and grace. So, there’s definitely a balancing act.
I do notice, if I can jump to a question, this will be our last discussion is, “What is the preferred data structure for a data map if we want to run analytics on it?” That is an excellent question. Mike, do you have any view on that?
Well, I’ll give you the standard, it depends. But again, what analytics tool are you going to run? If you can get data about your inventory out to any kind of structured format, so a CSV, an Excel, whatever, there are ways to pull that into tools like Elastic with a Kibana dashboard on it to run some advanced analytics. Or Qlik Sense has kind of a BI tool, you can pull them into some of Microsoft’s business intelligence tools to run analytics on that.
So, I think, basically, if you can get a structured format where you understand what fields you’re storing and what data they contain, you can apply any one of those tools to it.
Wonderful. Any closing comments, key takeaways for our audience?
Yes, if I’ve got one, it’s find a place and start. Always the hardest part is getting started. Once you’ve started, it will all start to flow in, because you don’t – to use an old saying – you don’t know what you don’t know. and so, there’s so much intelligence out there just within the data and where we store it and who has access to it, that the quicker you get to a map, the more quickly you’ll be able to use that.
I would add to that that it’s… benchmarking can be a big help here, understanding that the organizations may have a different risk profile and being mindful of that. But the ones who do have a higher risk profile also, usually, have some experience under their belt in performing these activities, and they’ve tried a lot of the tools because they have bigger budgets and bigger risks.
So, don’t be afraid to connect with your peers or with benchmarking organizations who have some energy around this type of discussion.
Yes, I would add… so I think both of those are great, and then along with that add, start having the conversations internally. Like the power of having a group of departments or a group of colleagues with a similar goal is great. I think you would be in a much better stance to make improvements and maturity around that program as well as if you need additional resources or budgeting, or what have you.
But have the conversations. I think people may assume that these spaces are different and because they need different data pieces that they’ll just go off on their own. But really, I think everyone is much better served trying to work together.
Teamwork prevails. That’s a great note to end on. Be mindful that we’re here for you, we would love to help you. if you want to know more about our infinite capabilities, reach out to us. If you want to talk some more about this, obviously, we’re energized about it, we’re passionate about it, there is a lot of white space around it, so it’s an area to continue to converse, and innovate, and see the fruits of the labor in terms of every organization’s risk management.
Excellent, Jenny. This is Rob Robinson and I’d like to thank you and the entire team for the excellent information and insight today. we also want to thank each and every one of you who had an opportunity to attend. We know how valuable your time is and we appreciate you sharing it with us today.
As Jenny mentioned earlier in the presentation, today’s content is being recorded for future viewing, and a copy of the presentation materials will be available for all attendees or any visitors to either BrightTALK or the HaystackID website. And the on-demand version will be available immediately after the webcast.
We also hope that each and every one of you has an opportunity to attend our next monthly webcast. That’s scheduled for July 13th at 12 p.m., and it will be on the topic of Data Breach Discovery and Review As Part of Defensible Incident Responses. So, we hope you can attend. You can register on the HaystackID website.
And again, thank you for attending today’s webcast, and this formally concludes today’s webcast. Have a great day.
CLICK HERE TO DOWNLOAD THE PRESENTATION SLIDES
CLICK HERE FOR THE ON-DEMAND PRESENTATION
*Director of EU Discovery Ops at HaystackID
**Information Technology Analyst for John Deere
***Deputy GC for Global Discovery and Privacy at HaystackID
****Director of Global Client Advisory at HaystackID