Predictive Coding Comes of Age

by Quinn Emanuel Urquhart & Sullivan, LLP

So-called “predictive coding”—using a small number of manually-coded documents to analyze and predict appropriate coding for a much larger set of documents —has become a hot topic in e-discovery. This past year brought the first reported judicial decisions explicitly authorizing the practice. 2012 also saw some of the first disputes concerning the appropriate methodologies for this technique.

In coming years, the use of predictive coding will continue to grow as litigants seek to limit discovery costs. Judges may also continue to endorse the practice, even incorporating it into model e-discovery orders. But early adopters should proceed with caution; the practice is likely to generate many disputes as acceptable methodologies and best practices are established.

The Evolution of Computer-Assisted Document Review
As companies have moved away from paper file systems and toward electronically stored information (ESI), the number of documents that must be collected and reviewed in civil litigation has skyrocketed. A number of technologies have been used to handle this explosion in discoverable information. Predictive coding is the latest technical evolution for reviewing and producing large data sets.

Manual Review: Not long ago, manual, linear, “eyes-on-the-page” analysis was the predominant method of document review. The process started with collecting documents that were potentially responsive to formal requests for production. The data collections, especially in complex civil litigation, often contained millions of pages. A small army of junior associates, contract attorneys, and even paralegals would then mobilize to manually review the documents for responsiveness, privilege, and confidentiality.

Although many still consider manual review to be the “gold standard,” it is rife with performance and quality shortcomings. Analysts estimate that when operating at a maximum review speed of about 100 documents per hour, a decision on relevance, responsiveness, privilege, or confidentiality would need to be made in an average of 36 seconds. See Nicholas M. Pace and Laura Zakaras, Where the Money Goes; Understanding Litigant Expenditures for Producing Electronic Discovery (RAND Corporation 2012) (hereafter “Pace & Zakaras”). As a result, the document review in a large case could take thousands of man-hours. This significant expenditure of time and money does not come with a guarantee of accuracy; studies suggest that up to 95% of reviewer disagreement is the result of human error and not simply close questions of relevance. See Maura R. Grossman & Gordon v. Cormack, Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error? 9 (ICAIL 2011 / DESI IV: Workshop on Setting Standards for Searching Elec. Stored Info. in Discovery, Research Paper).

Keyword Search: Keyword searching is a rudimentary form of computer assistance that narrows the scope and number of documents for further manual review. In a typical keyword search, the producing party runs a set of keywords against emails and other electronic documents to identify a smaller set of documents to be manually reviewed for responsiveness. Typically, multiple keywords and Boolean relationships among them can be utilized. Keyword searching offers performance improvements over manual searching, and is highly common in modern e-discovery. Courts have explicitly endorsed the practice and have even incorporated keyword restrictions and search terms into model orders for e-discovery. See, e.g., Federal Circuit’s Model E-Discovery Order for Patent Cases, available online at (proposing that email productions occur using “five search terms per custodian”).

Yet keyword searching is also rife with shortcomings. Keyword searches are frequently overinclusive and underinclusive; search terms fail to capture many relevant documents, while simultaneously generating many false positives. When search terms turn out to be more common than expected in a document set, keyword searching will return a high number of documents that contain the keyword but have no possible relevance to the case—forcing the producing party to use expensive manual review to find truly relevant documents. A poorly chosen keyword often returns more “junk” than responsive documents. For that reason, great care must be taken by the producing party to identify appropriate keywords, often with the assistance of the document custodians themselves. Creativity must be employed to ensure that common synonyms, misspellings, acronyms, and abbreviations are included and keywords likely to generate false positives are excluded.

Predictive Coding: Predictive coding is the latest evolution of computer-assisted document searching. As with manual and keyword searching, the process begins by collecting a corpus of potentially responsive documents from the client. Next, attorneys review a small set of randomly selected documents to identify a “seed set” of documents that are clearly fitting, or not fitting, the desired document categories. Then, the predictive coding software uses the “seed” documents to create a template to use when screening new documents. Some systems produce a simple yes/no, while others assign a score (for example, on a 0 to 100 basis) relating to responsiveness or privilege. Attorneys then audit the identified documents to validate their relevance, responsiveness, or privilege. The computer uses the attorneys’ audit results to modify its search algorithm. The search algorithm is repeatedly audited and rerun until the system’s predictions and the reviewer’s audits sufficiently coincide. Typically, the senior lawyer (or team) needs to review only a few thousand documents to train the computer, at which point the system has learned enough to make confident predictions on a much larger data set—relevance of millions of documents.

Once a predictive model is generated, there are several ways the review might proceed. In the context of a review for relevance and responsiveness, one option might be to assume that all documents with scores above a particular threshold can be classified safely as responsive, while all those with scores below a particular threshold can be safely classified as not responsive. Only those documents with scores in the middle would require eyes-on review. Another option would be to perform eyes-on review of only those documents exceeding a particular score in order to confirm the application’s decisions, while dropping the remainder from all further work. Foregoing all manual review altogether is also a possibility, though likely not advisable, given the potential for unexpected error. As these examples illustrate, the umbrella term “predictive coding” can be used to describe a number of different ways that predictions are used and applied. The individuals supervising the review must pick appropriate cut-off points and use their best judgment as to whether and how humans will review and refine codes that are automatically applied.

Used carefully, predictive coding has the potential to offer significant performance and cost benefits, without compromising accuracy. Litigants are already touting the cost-saving potential; some defendants have claimed predictive coding would reduce time for production and review from ten man-years to less than two man-weeks, and would cost roughly 1% of the cost of human review. See Global Aerospace, Inc. v. Landow Aviation, 2012 WL 1431215 (Cir. Ct. Loudoun Cty. Va. 2012). As to accuracy, predictive coding has not been shown to be any less accurate than traditional manual review. (Pace & Zakaras, pp. 61-66.) Some studies suggest that predictive coding identifies at least as many documents of interest as traditional eyes-on review, with about the same level of inconsistency, and may in fact offer more accurate review for responsiveness than most manual reviews. (Pace & Zakaras, p. xviii) Actual cost savings will depend on a number of factors, including the size of the document set, challenges to the predictive coding methodology, and the document review methodology against which predictive coding is compared—but used in the right circumstances, the cost-saving potential of predictive coding is obvious.

Recent Decisions
While keyword searching has been the most frequently used choice of computer-assisted document review and searching, a small handful of recent cases have considered the use of predictive coding. As courts become more familiar with the practice, some are explicitly endorsing and recommending the practice.

Global Aerospace may be the first case actually ordering the use of predictive coding. Global Aerospace, Inc. v. Landow Aviation, 2012 WL 1431215 (Cir. Ct. Loudoun Cty. Va. 2012). The defendants argued that, with more than 2 million documents to review, it would take reviewers more than 20,000 hours to perform the task—10 man-years of billable time. 2012 Global Aerospace, Inc. v. Landow Aviation, 2012 WL 1419842 (Va. Cir. Ct. April 9, 2012). But with predictive coding, it would take less than two weeks at a cost of roughly 1/100 that of manual, human-review. Id. Having heard arguments, the Court ordered that Defendants could proceed with the use of predictive coding for processing and production of ESI. Global Aerospace, Inc. v. Landow Aviation, 2012 WL 1431215 (Va. Cir. Ct. April 23, 2012).

Global Aerospace stopped short of an unqualified approval of predictive coding. For example, predictive coding cannot work effectively if a representative corpus is not used for the initial training. The Global Aerospace court noted that the receiving party was free to challenge the completeness of the contents of the production and the manner in which predictive coding was used for new documents. Id.

In Moore v. Publicis, perhaps the most significant judicial decision on predictive coding to date, the Southern District of New York (Magistrate Judge Peck) held that “computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases.” Moore v. Publicis Groupe, 2012 U.S. Dist. LEXIS 23350, 2012 WL 607412 (S.D.N.Y. 2012). The Court reasoned that computer-assisted review complied with the doctrine of proportionality of Federal Rule of Civil Procedure 26(b)(2)(C), and that predictive coding was an acceptable form of computer-assisted review. Id. at *12 (“…computer-assisted review is an available tool and should be seriously considered for use in large-data-volume cases where it may save the producing party (or both parties) significant amounts of legal fees in document review.”)

As courts have endorsed the voluntary use of predictive coding, parties have also sought to compel their adversaries to use the technique. In Kleen Products, Defendants sought to use keyword search-term processing, in which they had already invested much time and effort; but Plaintiffs moved to compel the use of predictive coding, arguing that keyword search methods were inadequate and flawed. Kleen Products, LLC v. Packaging Corp. of America, No. 10-C5711, Dkt. 412 (N.D. Ill. Sept. 28, 2012). The Court held evidentiary hearings in February and March 2012, during which it urged the parties to reach a compromise—for example, adopting Defendants’ keyword-based approach, but refining or supplementing terms and review procedures to meet Plaintiffs’ concerns. Ultimately, the parties reached agreement before a ruling on the motion to compel was reached. But Kleen illustrates that disputes over keyword search-terms may extend far beyond the sufficiency of specific terms going forward. Parties may challenge the notion of keyword searching itself—perhaps using the availability of predictive coding as leverage to obtain significant concessions on proposed keywords.

A recent case management order in In re: Actos provides further insight into the predictive coding processes that parties are likely to agree to and courts to sanction. In re: Actos (Pioglitazone) Products Liability Litigation, MDL No. 6:11-md-2299, Dkt. 1539 (W.D. La. July 27, 2012). The agreed-upon order in Actos allows each side to nominate three reviewers to work collaboratively to code the seed set of documents. The extremely detailed protocol contains numerous levels of sampling and review, as well as meet-and-confer check points throughout the procedure, including regarding the relevance threshold that would trigger manual review by the producing party.

Predictive Coding Done with Care
Litigants interested in utilizing predictive coding should keep several principles from these cases in mind. First and foremost, the producing party should attempt to gain the receiving party’s consent to use of predictive coding. The greater transparency offered into the procedure, the less likely that the receiving party will successfully move to compel an alternative document production methodology later in the case. An agreement regarding the basic methodology and the custodians from whom documents will be collected is recommended. Moreover, using jointly-appointed reviewers for the document training set may ease concerns with the process.

Second, the producing party should negotiate a “claw-back provision” that will allow recovery of documents that are improperly produced as a result of the predictive coding methodology. These could include documents that are irrelevant, privileged, or that should be, but were not, marked as confidential under a protective order. Such a provision is especially important if any portion of the documents marked responsive by the predictive coding methodology will not be manually reviewed.

Third, great care should be taken in preparing the initial “seed set” of documents that will be used to program the predictive coding algorithm. If the producing party does not actually involve the receiving party in the selection of the seed set, the producing party should be prepared to disclose the entire seed set to the receiving party and the court, which may raise work-product protection concerns. It is also important that the persons reviewing the initial seed set have a strong grasp of the issues in the case. Because of the importance of the initial seed set, it is critical that persons reviewing the seed set make accurate decisions; any errors in the seed set will become systemic throughout the larger review.

Fourth, the producing party should consider whether it is appropriate to use different seed sets for different custodians. For example, in a patent case, responsive documents that are held by an engineer may look very different than responsive documents held by an employee in the marketing or finance departments.

Fifth, the producing party should work closely with its e-vendor to ensure that the methodology is statistically justifiable. This includes ensuring that the documents from which the seed set is drawn is random, that the seed set is sufficiently large, and that the confidence interval and confidence level are either agreed upon between the parties or statistically justifiable.

Potential Stumbling Blocks and Pitfalls of Predictive Coding
Litigants planning to use predictive coding should be aware of potential pitfalls that could render the practice either more costly or inappropriate than manual review or keyword-driven review. For example, predictive coding may be inappropriate in a case that does not involve a sufficiently large body of documents. If the receiving party is dissatisfied with the results of the predictive coding, the producing party may face a motion to compel a more traditional document review methodology—thereby eliminating any cost savings. The danger of such a motion is especially high now, when predictive coding is in its earliest stages and best practices have not yet been developed. Where the corpus of documents contains highly sensitive information, a full manual review of any documents automatically selected for production may also be required to reduce the likelihood of damaging disclosure. This may entail significantly greater expense than keyword-driven reviews. Finally, predictive coding is not presently suitable for files that are not primarily text-based, such as video or audio files, necessitating the continued manual review of those materials.

As the amount of electronically stored information held by companies continues to grow at an exponential pace, widespread dissatisfaction with traditional manual and keyword review will likely lead to even greater use of predictive coding in 2013. This transition will offer cost savings for some, and headaches for others. As predictive coding grows, so too will litigation concerning predictive coding’s appropriate use and methodology. But the potential for significant cost savings is undeniable for large-scale reviews. Cost-conscious litigants in document-intensive cases would be wise to consider predictive coding as one tool to reign in growing e-discovery costs.

DISCLAIMER: Because of the generality of this update, the information provided herein may not be applicable in all situations and should not be acted upon without specific legal advice based on particular situations.

© Quinn Emanuel Urquhart & Sullivan, LLP | Attorney Advertising

Written by:

Quinn Emanuel Urquhart & Sullivan, LLP

Quinn Emanuel Urquhart & Sullivan, LLP on:

Readers' Choice 2017
Reporters on Deadline

"My best business intelligence, in one easy email…"

Your first step to building a free, personalized, morning email brief covering pertinent authors and topics on JD Supra:
Sign up using*

Already signed up? Log in here

*By using the service, you signify your acceptance of JD Supra's Privacy Policy.
Custom Email Digest
Privacy Policy (Updated: October 8, 2015):

JD Supra provides users with access to its legal industry publishing services (the "Service") through its website (the "Website") as well as through other sources. Our policies with regard to data collection and use of personal information of users of the Service, regardless of the manner in which users access the Service, and visitors to the Website are set forth in this statement ("Policy"). By using the Service, you signify your acceptance of this Policy.

Information Collection and Use by JD Supra

JD Supra collects users' names, companies, titles, e-mail address and industry. JD Supra also tracks the pages that users visit, logs IP addresses and aggregates non-personally identifiable user data and browser type. This data is gathered using cookies and other technologies.

The information and data collected is used to authenticate users and to send notifications relating to the Service, including email alerts to which users have subscribed; to manage the Service and Website, to improve the Service and to customize the user's experience. This information is also provided to the authors of the content to give them insight into their readership and help them to improve their content, so that it is most useful for our users.

JD Supra does not sell, rent or otherwise provide your details to third parties, other than to the authors of the content on JD Supra.

If you prefer not to enable cookies, you may change your browser settings to disable cookies; however, please note that rejecting cookies while visiting the Website may result in certain parts of the Website not operating correctly or as efficiently as if cookies were allowed.

Email Choice/Opt-out

Users who opt in to receive emails may choose to no longer receive e-mail updates and newsletters by selecting the "opt-out of future email" option in the email they receive from JD Supra or in their JD Supra account management screen.


JD Supra takes reasonable precautions to insure that user information is kept private. We restrict access to user information to those individuals who reasonably need access to perform their job functions, such as our third party email service, customer service personnel and technical staff. However, please note that no method of transmitting or storing data is completely secure and we cannot guarantee the security of user information. Unauthorized entry or use, hardware or software failure, and other factors may compromise the security of user information at any time.

If you have reason to believe that your interaction with us is no longer secure, you must immediately notify us of the problem by contacting us at In the unlikely event that we believe that the security of your user information in our possession or control may have been compromised, we may seek to notify you of that development and, if so, will endeavor to do so as promptly as practicable under the circumstances.

Sharing and Disclosure of Information JD Supra Collects

Except as otherwise described in this privacy statement, JD Supra will not disclose personal information to any third party unless we believe that disclosure is necessary to: (1) comply with applicable laws; (2) respond to governmental inquiries or requests; (3) comply with valid legal process; (4) protect the rights, privacy, safety or property of JD Supra, users of the Service, Website visitors or the public; (5) permit us to pursue available remedies or limit the damages that we may sustain; and (6) enforce our Terms & Conditions of Use.

In the event there is a change in the corporate structure of JD Supra such as, but not limited to, merger, consolidation, sale, liquidation or transfer of substantial assets, JD Supra may, in its sole discretion, transfer, sell or assign information collected on and through the Service to one or more affiliated or unaffiliated third parties.

Links to Other Websites

This Website and the Service may contain links to other websites. The operator of such other websites may collect information about you, including through cookies or other technologies. If you are using the Service through the Website and link to another site, you will leave the Website and this Policy will not apply to your use of and activity on those other sites. We encourage you to read the legal notices posted on those sites, including their privacy policies. We shall have no responsibility or liability for your visitation to, and the data collection and use practices of, such other sites. This Policy applies solely to the information collected in connection with your use of this Website and does not apply to any practices conducted offline or in connection with any other websites.

Changes in Our Privacy Policy

We reserve the right to change this Policy at any time. Please refer to the date at the top of this page to determine when this Policy was last revised. Any changes to our privacy policy will become effective upon posting of the revised policy on the Website. By continuing to use the Service or Website following such changes, you will be deemed to have agreed to such changes. If you do not agree with the terms of this Policy, as it may be amended from time to time, in whole or part, please do not continue using the Service or the Website.

Contacting JD Supra

If you have any questions about this privacy statement, the practices of this site, your dealings with this Web site, or if you would like to change any of the information you have provided to us, please contact us at:

- hide
*With LinkedIn, you don't need to create a separate login to manage your free JD Supra account, and we can make suggestions based on your needs and interests. We will not post anything on LinkedIn in your name. Or, sign up using your email address.