Well managed data analytics and other eDiscovery tools can help to bring order to a huge dataset, reduce data volumes for review and get to the relevant documents quicker.
Data is being created at exponential rates, yet investigations lawyers are not granted the luxury of additional time or resources to review it. Data volumes are predicted to grow from 45 zettabytes in 2019 to 175 zettabytes by 2025. In 2019, it is estimated that 188 million emails were sent in every minute of each day.
Let’s take a look at the most useful tools.
Sensibly grouping of data: clustering
Where do you even start when trying to make sense of a massive amount of unknown documents? A clustering tool groups documents together based on their content, so that documents relating to the same topics are grouped together. You don’t have to tell the tool what the topics are, it works out the topics for itself by “looking” at the data. The clustering allows you to easily gain a high-level understanding of the themes and concepts prevalent in your dataset. Clusters of documents with relevant themes can be prioritised and/or the clusters may reveal unexpected themes which may warrant further investigation.
Who was talking to who? – communications analysis
Communications analysis is a visual interactive tool which allows you to understand the key communicators (ie people) within your dataset, and who they were communicating with most frequently. Each “entity” (such as a person or distribution group) is represented by a node, with lines between the nodes to indicate lines of communication. The size of the nodes and thickness of the lines reflect the volume of communications.
This information can also help to reveal additional custodians who are communicating frequently with your key custodians and who may have been overlooked during the collection phase.
Getting rid of duplicate emails: threading
A common problem faced in review is the prevalence of duplicates even after “deduplication” has been applied when data was processed into a review database. There are many reasons for this, including metadata variances caused by server clocks, and different email aliases on the email duplicates.
Email threading technology has the ability to analyse email metadata and text in order to group email chains together that are part of a single thread or conversation. As part of the analysis, it identifies any remaining email duplicates containing the same content as another message, using a more “fuzzy” tolerance to allow it to identify additional duplicates which were not weeded out during processing.
In addition, email threading will identify which emails in each chain contain unique content not present in other emails in the chain (for instance, an email which is at the end of an email chain). Reviewing only these flagged emails allows you to cover all unique content within the email chain, yet saves time and cost by culling out emails and attachments with repetitive content.
Culling irrelevant documents from review: technology assisted review
Technology assisted review exists in various forms but the most commonly used today is known as “Continuous Active Learning” or CAL. CAL learns from ongoing coding decisions applied by human reviewers to continuously rank and reprioritise the most likely relevant documents for review, therefore allowing the reviewer to get to the more important documents sooner. As the review progresses, the relevance rate of the documents being queued for human review generally decreases until a point where relevance is consistently low enough to draw a cut-off point whereby documents below the rank can be culled from review after validation sampling.
For projects where documents do not need to be reviewed before production, the technology can be used to quickly separate documents into likely relevant and irrelevant categories without review of all of the likely relevant documents. This is especially useful for quick production use cases with tight turnaround times.
By using CAL, a review team will save time and costs as they may only need to manually review a fraction of the document set. Unlike some previous iterations of technology assisted review, CAL is not reserved only for large document reviews, as little upfront training is required to start using it.
Enhanced keyword searching: keyword expansion
Keyword searching is commonly used to cull a dataset prior to review. It is useful, but imperfect due to the nature of language. Words may have multiple meanings (eg pupil), synonyms (eg student, pupil), regional variations (eg football vs soccer), or contain typographical errors.
“Keyword expansion” assists by finding not just your key words, but also closely correlated terms or synonyms that exist in the dataset. The tool does not use external dictionaries but instead uses an “analytics index” built solely from the text of the dataset at hand. Keyword expansion can find matter-specific jargon or related code-words that have been intentionally used to obfuscate the true meaning of communications, which can be useful in an investigation.
Find more documents on a particular concept: concept searching
In a similar way that keyword expansion brings back related terms, concept searching can bring back documents conceptually related to a chosen piece of text. The text can originate from a legal document such as a briefing memo or pleading, from known hot documents or a tip-off email, or even a mock paragraph of text based on what the reviewer would envisage the “smoking gun” document to look like.
Because the search results are not dependent on the absence or presence of the words searched but rather the conceptual meaning of the search itself, the tool can help to locate additional related documents which are not limited by the words being searched.
All of these tools, in the right hands, can massively cut down on the time and cost of document review. And most importantly, they reduce the chance of an important fact being overlooked which could impact the outcome of an investigation.
So far in this blog series we have looked at collecting data efficiently and defensibly, and how to deal with email, chat and laptops. Our next ‘eDiscovery in investigations’ blog post will look at some top eDiscovery tips for production of documents to investigating authorities.