Unstructured Data: The Black Hole of Ediscovery

more+
less-

unstructured data

Big Data, Structured Data, Unstructured Data – these terms are becoming the buzzwords of ediscovery, but what do they mean?

Structured data refers to information residing inside complex applications, such as transactional and financial databases.  Data that you access in a variety of ways based on how it is presented within the application. For example you might have several similar yet distinct finance reports that hold the same structured data, but simply present it in different visual formats. Ultimately, structured data exists as segments of information inside a larger system, one that is often quite complex and contains many parts. While this type of data does continue to grow, and the format can make for challenging ESI application, it isn’t causing quite the same volume problems as we are seeing with “unstructured data”.

“Unstructured” or “loose” data might not be what you call it, but it’s what you are generally working with as ESI. These terms refer to all of the standalone, common files that make up work done every day in corporations around the world. All of those e-mail messages, word processing documents, spread sheets, and presentations, among other things—that are commonly sought as potentially relevant ESI in discovery – are considered unstructured data.

And that Unstructured Data is the harbinger of Big Data and the root cause of a 50% jump in enterprise storage volume from 2010-2012 (from 2,175 terabytes to 3,183 terabytes), as profiled in a recent infographic on ediscovery.com. But the scariest thing about unstructured data is that it’s a silent killer; most organizations don’t even know a problem exists until litigation is underway and (not surprisingly) something goes missing. Yikes.

While “Big Data” and the growing mass of “unstructured data” can make traditional manual ESI review completely cost-prohibitive, something often can be done. Predictive coding, for example, can provide a much needed backbone for unstructured data by detecting linguistic patterns in documents and ranking them according to predicted relevancy. Moreover, depending on the capabilities of a provider’s technology, it is possible for a vendor to host these unstructured documents in a cheaper “nearline” storage location, in case serial litigation summons them again.  Thus, once a document has been tethered to a custodian or date range in project once, you can leverage this information in the future.