Messy Metadata: More Challenges With Collecting Data From Google Workspace

by Dave Ruel | Hanzo

Hanzo

Knowledge workers may never go back to the office as we once knew it. But, now that companies and their employees have learned how well working from home can work—both for maintaining productivity and workers' quality of life—remote work is unquestionably here to stay.

For many offices, Google Workspace is one of the tools that enabled the transition to a fully remote workforce. Google provides outstanding version control, tremendous data storage capacity, and effortless collaboration on shared documents.

Of course, the data that businesses generate on Google Workspace is potentially discoverable. For this reason, Google created a tool—Google Vault—that ostensibly helps organizations preserve and collect files relevant to litigation. But identifying, preserving, and collecting Google Workspace files using Google Vault presents several challenges.

Google Vault Limitations

I've written in more depth about some of these issues before. To recap, Google Vault can lead organizations to over-collect data because of the way it organizes and presents file information. Google Vault currently doesn't allow any visualization of a user's Google or Shared Drive structure, so there's no way to navigate specific files easily. Even then, users can't select individual files or folders to export—thus the necessity to export a custodian's entire Drive if they really only need a portion of it. There's also not a straightforward way to view specific versions of documents. Even though Google Workspace keeps track of every version a user creates, a user can only access those versions on a document-by-document version, which can be tedious and time-consuming.

Perhaps most importantly, Google Vault's exports aren't review-platform-ready. The primary problem is that Google Vault uses an XML as the load file format. This can certainly be problematic as an import source. File names are also appended with a Google DocID, making it hard for users to figure out what the original file name was supposed to be.

But let's take a closer look at a problem I haven't talked about much: metadata.

How Google Vault Manages Metadata

Metadata is crucial for ediscovery, both for its management—identifying the correct version of files and rapidly searching for relevant information—and production integrity. For example, suppose a litigation opponent sees that you've altered metadata in the process of collecting it and producing it. In that case, they're likely to have some serious questions about what else you may have altered.

Unfortunately, Google Vault isn't ideal for meeting the needs of ediscovery professionals. Three things are lacking with Google Vault's handling of metadata.

As mentioned, Google Vault separates metadata from its underlying files, exporting the metadata via XML files and labeling the loose documents themselves with both the file name and the internal Google Doc ID reference number. Then, the user must reassemble those two separate files before a review platform can understand them, dramatically increasing the time and effort needed to prepare data for review.

Second, Google Vault omits critical metadata upon export. It fully excises some types of metadata, including:

the full file path description,
file version information,
parent folder information,
indications that a document has been deleted or moved, and
information about file sharing and access permissions.

But that's not all. For example, it also overwrites the original metadata about a document's creation date; instead, it assigns the creation date as the date of export. Since metadata is a critical search component for discovery—particularly metadata about dates—losing that information can be problematic.

Third, omitted date metadata makes it hard to identify the correct version of a document. While Google Drive maintains every version of a document that the user creates, finding those versions and using them in ediscovery is different. Without original metadata about the creation date for a file, it's virtually impossible to know that you're getting the correct version of a document, edited by the right person on the correct date.

What you need is a way to export information out of Google Workspace—in a review-ready format—without losing or altering any metadata.

[View source.]