Recently, ACEDS hosted a webinar entitled “Point|Counterpoint: A Proposed TAR Framework,” during which a stellar panel of lawyers, including Redgrave’s Christine Payne and Kirkland & Ellis’ Michele Six represented the defense bar, and Suzanne Clark and Chad Roberts from eDiscovery CoCounsel represented the plaintiff’s bar. Retired US Magistrate Judge James C. Francis, IV, now with JAMS, agreed to moderate the discussion. The presentation raised the question whether using a scorecard or similar results-oriented framework could provide a solution for practitioners interested in assessing the success of a TAR project.
Because we had so many attendees, there were too many questions to answer during the webcast. The panelists graciously agreed to add their voices to this blog post to answer the questions we did not get to during the webinar. Note that where appropriate ACEDS combined similar or related questions to facilitate providing efficient responses.
What is the difference between data analytics and TAR?
PAYNE & SIX: The folks who can really answer this question are the vendors who control the marketing of all this. But to me, what sets TAR apart from general data analytics is the review piece – anything that makes a predictive guess at the way in which a document should be categorized for review (responsive/not responsive, priv/not priv, confidential/not confidential) counts as TAR.
ROBERTS & CLARK: “Analytics” is a non-specific term that has evolved in vendor’s marketing vocabulary to describe a variety of products and functionality. I think of it as a very broad bucket of various functions and features with special purposes, for example: conceptual clustering, similar document detection, development of keyword expansion through semantic similarity, data visualization, PII identification and redaction, etc. Generally, features that help you understand the data you have or that apply a process to your data set. TAR, as Ms. Payne points out, is specific to the machine learning process, where humans interact with the machine and the subjective judgments they make are then replicated to a larger set of documents.
Are you recommending TAR pre or post search terms?
PAYNE & SIX: A hot debate! Some courts say no search terms before TAR. A well-intended but misguided law review article says no search terms before TAR. Most data scientists, however, will say that you can absolutely use search terms before TAR. I think that’s the right answer, as long as you are designing everything as a system. In other words, it doesn’t make sense to design search terms for attorney review and then change your mind and uncritically layer TAR on top of that. The beauty of the report card model, however, is that it doesn’t matter. I repeat—it doesn’t matter. Do search terms, TAR, search terms again and then put a cherry on top. Just know that at the end of the process, there will be an objective report card waiting for you, and so you will need to demonstrate that your process—whatever it was—actually worked.
ROBERTS & CLARK: It is a hot debate! If you are approaching the task as an information retrieval scientist, you would never want to use search terms to cull a data set before a TAR process because it can be shown that the overall (“end-to-end”) recall rate will be degraded, sometimes very significantly degraded. So, don’t use search terms for the purpose of trying to make the outcome “better” by increasing the “richness” (technically speaking, the prevalence) of the data set thinking you will improve the quality of the production. In the work-a-day world outside of information retrieval contests, pre-culling the data set with search terms is simply done to reduce storage costs that are incurred hosting the data set in the TAR review platform to begin with. There is a good explanation of this here: 7 F.C.L.R. 1 (2014) Grossman & Cormack, Comments on “The Implications of Rule 26(g) on the Use of Technology Assisted Review”
Before TAR, parties never needed to be transparent about how they reviewed documents. What requires more transparency now?
PAYNE & SIX: Case law. Proponents of TAR offered up unprecedented transparency and cooperation in the early days to expand adoption. It then became part of the case law.
ROBERTS & CLARK: Transparency in electronic discovery methodology pre-dates TAR, actually. Here’s a great case about it: Google v. Samsung 2013 WL 1942163 (N.D. Cal 2013). To put a fine point on it though, because the question is about “how they reviewed documents,” review protocols (instructions to document reviewers about coding criteria) is typically couched as work product, but an overall methodology of review (TAR v. linear, etc.) is not necessarily work product or privileged.
In some ways search terms are like a very simple TAR model. Recall and precision can apply to search terms as well, but I’ve never seen those metrics negotiated for terms. What do you think are the primary reasons receiving parties expect and want so much more from TAR?
PAYNE & SIX: Case law. See above.
ROBERTS & CLARK: I think it is because TAR workflows were primarily responsible for introducing the notion of quality metrics into electronic discovery production. This happened when the early adopters and their data scientists were defending the methodology during the contested hearings and the notion of quality metrics were used as reassuring support for the process. Quality metrics are now becoming firmly entrenched in search term validation as well. Here’s a wonderfully written opinion on just that subject: City of Rockford v. Mallinckrodt ARD Inc., No. 17 CV 50107, No. 18 CV 379 (N.D. Ill. Aug. 7, 2018)
Why do lawyers believe manual linear review produces a better production? Recall and precision should be applied to both linear review and TAR. Linear review gets a pass.
PAYNE & SIX: I would put one of my attorney reviews up against any TAR tool in the country. I look at the conditions of the studies that conclude TAR is at least as good as manual review (they don’t go further than that), and I think “wow that was a very poorly designed manual review.” But to answer your actual question, the big difference is case law. TAR requires transparency and cooperation, attorney review does not. The idea of the report card is that everyone would have to fill it out, regardless of method. So, it evens the playing field.
ROBERTS & CLARK: Unless you’re willing to send things out the door without looking at it (it’s not as uncommon as you might think,) TAR simply generates a smaller, more productive “linear review.” Even TAR 2.0 (Continuous Active Learning) is a linear review of sorts, the CAL workflow simply helping you decide when you’re finished and when you can defensibly stop looking at more documents. As a practical matter, the best workflows all typically have some measure of TAR, search terms, and human review as components. Just like Gary Kasparov v. Deep Blue, Dave the Astronaut v. HAL, or John Henry v. The Steam Drill, TAR v Linear Review is just another in a long line of Man v. Machine contests.
What is the best way to describe the difference between precision and recall?? Also, what is the recommended way to calculate them?
PAYNE & SIX: Ah … this is where I call my data scientist friends to make sure I’m not talking out of school. We provided a definition that was checked and rechecked in our article.
ROBERTS & CLARK: Assume a certain information retrieval strategy is used to challenge a large data set and retrieve responsive items. Precision is the percentage of responsive items found in the retrieved set. It is a measure of how accurate the retrieval strategy is. Recall is the percentage of retrieved responsive items from the entire data set. It is a measure of how complete the retrieval strategy is. Generally, for any given retrieval strategy, precision and recall have an inverse relationship. Maximizing precision tends to diminish recall. Maximizing recall tends to diminish precision.
Imagine a search strategy in a business dispute that uses a single search term to identify responsive documents. You could choose a single term (like the name of the adverse party) that would have high precision (i.e., most all of the documents retrieved were responsive documents.) But that methodology would certainly leave very many responsive documents not retrieved (low recall). Requesting parties tend to be interested in recall (completeness.) Producing parties tend to be interested in precision (cost reduction).
Can the presenters explore the idea that almost all of these metrics are on a “curve” – precision falls (sometimes fast) as recall increases; recall plateaus (sometimes uncomfortably early) if review is for several concepts at once.
PAYNE & SIX: Someone smarter than me will have to say definitively, but I really don’t know if you can calculate recall effectively with multiple concepts. And yes, you can get 100% recall by just selecting the entire data universe—your precision will be terrible. Ideally, with an effective review, you’d have close to 100% recall (getting all the good stuff) and also 100% precision (keeping out all the junk). But that’s not realistic under any model, and each case is going to be different. Some parties may be willing to wade through more junk to ensure they are getting everything they need. Other parties may want a more precise set. It’s going to be a case-specific question.
ROBERTS & CLARK: There is definitely a sweet-spot in the trade-off between precision and recall. There is another metric known as “F1” that attempts to give an indication of this; it is the harmonic mean of precision and recall. Let your machine do the math, but the math is not as scary as it looks here, where it is explained in its Wikipedia page: https://en.wikipedia.org/wiki/F1_score
For recall/precision are you only concerned with the results of the machine learning algorithm or can you calculate these metrics to validate the efficacy of your overall workflow? Sometimes it makes sense to use more than one methodology.
PAYNE & SIX: Overall.
ROBERTS & CLARK: Yes, it should be overall or “end-to-end”, but often times a producing party can never truly calculate overall recall if they use a culling methodology that actually discards information from the data set. It makes it impossible for them to sample from the entire data set.
Is the sample the elusion sample at the end aka validation sample after review is completed or proposed to be completed?
PAYNE & SIX: I assume most practitioners would be doing in-process testing, but that would not be for the report card. The report-card sampling would be final validation only.
ROBERTS & CLARK: The best outcomes have mid-course assessments (including elusion testing) established in the workflow.
Shouldn’t any producing party be required to test the null set and disclose any marginally relevant documents and engage in a dialogue as to whether additional search refinement is required.
PAYNE & SIX: Maybe? The idea of the report card is to allow both parties to have objective metrics and drive dialogue that way. And they may agree, or the court may order, that any responsive documents in the sample set be produced and reviewed further. But I don’t think there’s a one-size-fits-all answer to this question.
ROBERTS & CLARK: Any responsive document should obviously be produced, regardless of where in the workflow its found. However, TAR does not necessarily try to leave behind documents of diminished relevance. It only leaves behind documents that it predicts is human trainer is less likely to tag as being “responsive.” So very highly relevant documents can have just so-so predictive rankings, and vice-versa. This is one of the least understood notions about TAR.
My understanding is that it’s unwise to make promises about the recall rate a producing party will achieve when you don’t know at the outset what the data set contains. i think it’s worrisome approach.
PAYNE & SIX: Agree completely.
ROBERTS & CLARK: Yep. This is why it should be an iterative approach undertaken by reasonable people.
Could you say that effective/passing grade TAR is dependent to some extent on the application? Is improvement of AI expected to improve reliability/report card score?
PAYNE: I don’t know, ask me in 10 years when we’re all driving flying cars.
ROBERTS & CLARK: I have less confidence now with the proliferation of “TAR” features in a lot of platforms. The original applications were facing tremendous scrutiny and had quality features cooked into them that tempered the way in which the machine learned to avoid bias and dead-ends. New applications may be more “juiced up” with more of an emphasis on precision as opposed to recall integrity. Theoretically, a solid validation procedure would compensate for this, but these platforms remain largely unregulated and without objective measurements of accuracy. The core math engines of an AI application are in the public domain; most Information Science grad students could build some type of crude TAR application in their mom’s garage.
What value would go into the cell that corresponds to precision horizontally and sampling method vertically?
PAYNE & SIX: That would be a text-based answer describing the sampling method for selecting the set of documents designed to test for precision. It comes from the responsive set, not the null set.
Marginally relevant documents could be thousands more to deal with additionally. When do you know where to draw the line?
PAYNE & SIX: That’s a question that every review has to wrestle with, regardless of the methodology used. You have to have a strong, defensible approach to determining responsiveness, and you have to train your people/computer thoroughly.
ROBERTS & CLARK: It is the stuff of probabilities and likelihoods, and not absolutes. Which is why “defensibility” is an easier lift if the requesting party had a seat at the table when methodology is being designed. Just sayin’.
Can you use TAR as an ECA strategy to fine tune key words and document demands?
ROBERTS & CLARK: Yes! The uses of TAR workflows to exploit evidence is limited only by your creativity and imagination.
Doesn’t TAR and use of the report card (or any technical analysis of outcomes) place an inordinate amount of pressure on judges, many of whom are not knowledgeable about such technical analysis?
HON. JAMES C. FRANCIS IV: Sure, it creates pressure, but pressure to develop technical competence is not a bad thing. My concern would be with judges who don’t appreciate the limits of their knowledge and who might, for example, assume that the failure to meet a pre-set recall rate necessarily demonstrates that the producing party conducted an inadequate search or, worse, acted in bad faith. Judges need to understand that there are complexities behind the simple numbers on a report card and to be prepared to address them.
PAYNE & SIX: That would be a great question for Judge Joe Brown in Nashville, who is semi-retired and became a folklore hero of the eDiscovery world for his frequent use of animal metaphors in TAR-related rulings. I don’t get the sense that he ever thought he’d preside over TAR-related litigation, but he did and we have the cougar/raccoon/horse stories to prove it. The truth is that TAR is headed for judges no matter what. The report card is designed to give everyone—judges included—an objective framework to cling to.
ROBERTS & CLARK: I love the notion of the report card. As a complete substitute for transparency and collaboration, not so much.
Thank you again to our presenters for taking the time to present not only on the webinar, but to answer these questions here on the ACEDS blog.