Exploring the Inclusion of eDiscovery-Centric Resources in the Google C4 Dataset: A Highly Selective Search

EDRM - Electronic Discovery Reference Model
Contact

EDRM - Electronic Discovery Reference Model

Exploring the inclusion of eDiscovery centric resources in the google C4 dataset A highly selective search
Image: ComplexDiscovery

[EDRM Editor’s Note: This article was first published here on April 26, 2023 and EDRM is grateful to Rob Robinson, editor and managing director of ComplexDiscovery, for permission to republish.]

[ComplexDiscovery Editor’s Background Note: The impact of organizations and entities on the output from Large Language Models (LLMs) can be more significant than one might initially anticipate. In some instances, specific resources within an industry can considerably influence how LLMs process and respond to information. One example of this influence can be observed by examining the Google C4 Dataset and searching for a non-comprehensive selection of domains from 55 eDiscovery-centric websites. While this exploration only offers a snapshot of selected resources from a non-all-inclusive list, it may provide valuable context for those evaluating the resource impact on LLMs and also highlight some tools that can help better understand the content populating LLMs. This deeper understanding can, in turn, contribute to shedding light on how selected eDiscovery resources may play a substantial role in shaping the knowledge and responses generated by LLMs – a role much more significant (or less important) than one might think.]

Industry Backgrounder

Exploring the Inclusion of eDiscovery-Centric Resources in the Google C4 Dataset: A Highly Selective Search

ComplexDiscovery*

Large language models, such as those developed by Google and OpenAI, are becoming increasingly sophisticated and pervasive in various industries. One such application of these models is in the eDiscovery ecosystem, which contains touchpoints ranging from cybersecurity and information governance to legal discovery. This article explores at a very high level the inclusion of selected eDiscovery-centric resources in the Google C4 Dataset. It also discusses why understanding this exploration may benefit professionals working in the eDiscovery ecosystem.

Google’s C4 Dataset and its Relevance to eDiscovery

Understanding the Google G4 Dataset

Google’s C4 (Colossal Clean Crawled Corpus) project aims to create a comprehensive and diverse dataset for training large language models. The dataset is built from web pages crawled by the CommonCrawl project and includes a diverse range of content in multiple languages. Google’s C4 Dataset serves as an essential foundation for developing more accurate and sophisticated language models that can understand and generate human-like text.

The C4 dataset from Google contains approximately 750GB of cleaned text data derived from CommonCrawl web pages. This large-scale dataset is utilized for training and improving large language models, such as those based on the GPT architecture.

CommonCrawl is an open-source initiative that crawls and archives publicly available web content. This vast repository of web-crawled data is invaluable for training large language models, as it provides a diverse and extensive source of text in multiple languages. The Common Crawl project significantly contributes to the C4 Dataset, enhancing its quality and usefulness for AI research.

The Role of large language models in eDiscovery

Large language models can potentially revolutionize the eDiscovery process by automating tasks ranging from document review to review reporting. These models can analyze vast amounts of data quickly and efficiently, identify relevant information, and generate insightful summaries or responses. As a result, they can save time, reduce costs, and improve the accuracy of eDiscovery outcomes.

Inclusion of eDiscovery-centric resources in the C4 Dataset

The presence of eDiscovery resources in the C4 Dataset is crucial for ensuring the accuracy and relevance of large language model outputs in the eDiscovery context. By training on high-quality eDiscovery resources, the models can better understand the domain-specific language, concepts, and best practices, leading to more reliable and valuable results for eDiscovery professionals.

ComplexDiscovery’s Non-Comprehensive List of eDiscovery Resources and Its Significance

Introduction to ComplexDiscovery’s resource listing

On March 9, 2023, ComplexDiscovery published a non-comprehensive list of potentially helpful eDiscovery-centric resources. These resources, ranging from analyst and research firms to industry associations and blogs, were designed to serve as a simple starting point for individuals seeking information related to eDiscovery. 

Selection of resources from ComplexDiscovery’s list for analysis

Given the manageable size of this resource listing and the direct or indirect relevance to the eDiscovery ecosystem of each listed resource, ComplexDiscovery created a truncated listing from an initial grouping of 100+ resources and used the top-level domain names of those resources to search the C4 Dataset. This truncation, which included the removal of top-level domain duplicates for multiple resources on the same domain and removing resources not available at the time of the Google C4 Dataset snapshot, resulted in a list of 55 resource domains.  

Top-level domain names search against the C4 Dataset

The objective of searching the top-level domain names of the selected resources within the C4 Dataset was to explore how a very targeted snapshot of eDiscovery resources might be represented in the C4 Dataset. This information on the representation of selected resources may help gauge how these resources are being used to train Google’s large language models in responding to inquiries and prompts related to eDiscovery.

The results of top-level domain name searches of 55 eDiscovery-centric resources are provided in the following table, as extracted from the C4 Dataset search capability resource featured in the Washington Post article titled “Inside the Secret List of Websites That Make AI Like ChatGPT Sound Smart.” The data is reported based on database rank, tokens, and the percentage of all tokens. The aggregated results for the selected resources below showcase the prevalence of content from these resources in the C4 Dataset.


Table: Selected eDiscovery Resources and the C4 Dataset. To search the dataset, click here.

--> Scroll to see full table data
Resource Category (ComplexDiscovery) Resource Domain Searched Rank Tokens (Rounded) Percent of All Tokens
Analyst, Research, and Review Firms G2 G2.com 152 16,000,000 0.01%
Analyst, Research, and Review Firms Capterra Capterra.com 216 13,000,000 0.008%
News, Announcement, and Commentary Resources Lexology Lexology.com 519 8,100,000 0.005%
Analyst, Research, and Review Firms Software Advice SoftwareAdvice.com 730 6,300,000 0.004%
Associations, Consortiums, and Groups IAPP (International Association of Privacy Professionals) IAPP.org 5,236 1,900,000 0.001%
News, Announcement, and Commentary Resources JD Supra JDSupra.com 5,274 1,800,000 0.001%
News, Announcement, and Commentary Resources Legaltech News Law.com 5,898 1,700,000 0.001%
Information and Research Resources NIST (National Institute of Standards and Technology) NIST.gov 5,920 1,700,000 0.001%
Analyst, Research, and Review Firms TrustRadius TrustRadius.com 6,958 1,500,000 0.001%
Information and Research Resources Cybersecurity Legal Task Force (American Bar Association) AmericanBar.org 8,266 1,300,000 0.0009%
Information and Research Resources FTC Premerger Notification Program (Federal Trade Commission) FTC.gov 10,959 1,100,000 0.0007%
Analyst, Research, and Review Firms Gartner Gartner.com 19,166 720,000 0.0005%
Industry Blogs eDiscovery Team (Ralph Losey) E-DiscoveryTeam.com 29,362 530,000 0.0003%
Analyst, Research, and Review Firms IDC IDC.com 41,812 400,000 0.0003%
Analyst, Research, and Review Firms Forrester Forrester.com 42,218 400,000 0.0003%
News, Announcement, and Commentary Resources LawSites LawSitesblog.com 63,769 290,000 0.0002%
Analyst, Research, and Review Firms Chambers and Partners Chambers.com 77,729 250,000 0.0002%
Industry Blogs Artificial Lawyer (Richard Tromans) ArtificialLawyer.com 85,162 230,000 0.0001%
Educational Training and Resources E-Discovery Team Training e-DiscoveryTeamTraining.com 93,748 210,000 0.0001%
News, Announcement, and Commentary Resources LexBlog LexBlog.com 110,534 180,000 0.0001%
News, Announcement, and Commentary Resources LegalIT Insider LegalTechnology.com 122,034 170,000 0.0001%
eDiscovery Provider Websites Relativity Relativity.com 145,664 150,000 0.00009%
Industry Blogs eDisclosure Information Project (Chris Dale) ChrisDaleOxford.com 187,731 120,000 0.00008%
News, Announcement, and Commentary Resources Legal IT Professionals LegalITProfessionals.com 220,976 100,000 0.00007%
Information and Research Resources ENISA (European Union Agency for Cybersecurity) ENISA.Europa.eu 271,149 85,000 0.00005%
Associations, Consortiums, and Groups EDRM (Electronic Discovery Reference Model) EDRM.net 293,316 79,000 0.00005%
eDiscovery Provider Websites IPRO IPROTech.com 299,993 77,000 0.00005%
Associations, Consortiums, and Groups Women in eDiscovery WomenineDiscovery.org 303,379 77,000 0.00005%
eDiscovery Provider Websites Nuix Nuix.com 323,733 72,000 0.00005%
eDiscovery Provider Websites Epiq EpiqGlobal.com 387,082 61,000 0.00004%
Analyst, Research, and Review Firms ComplexDiscovery ComplexDiscovery.com 445,248 53,000 0.00003%
Associations, Consortiums, and Groups ACEDS (Association of Certified E-Discovery Specialists) ACEDS.org 470,275 50,000 0.00003%
Industry Blogs Hanzo Blog (Hanzo) Hanzo.co 486,348 49,000 0.00003%
eDiscovery Provider Websites Exterro Exterro.com 508,502 46,000 0.00003%
Associations, Consortiums, and Groups The Sedona Conference (TSC) TheSedonaConference.org 508,617 46,000 0.00003%
Industry Blogs Ball In Your Court (Craig Ball) CraigBall.net 602,359 39,000 0.00002%
eDiscovery Provider Websites Disco CSDisco.com 747,835 31,000 0.00002%
eDiscovery Provider Websites HaystackID HaystackID.com 763,781 30,000 0.00002%
Information and Research Resources International Cyber Law in Practice: Interactive Toolkit (NATO CCDCOE) CCDCOE.org 818,082 28,000 0.00002%
eDiscovery Provider Websites Logikcull Logikcull.com 838,778 27,000 0.00002%
eDiscovery Provider Websites Lexbe Lexbe.com 894,973 26,000 0.00002%
Associations, Consortiums, and Groups ILTA (International Legal Technology Association) ILTAnet.org 929,143 24,000 0.00002%
eDiscovery Provider Websites Lighthouse LighthouseGlobal.com 1,049,929 21,000 0.00001%
eDiscovery Provider Websites KLDiscovery KLDiscovery.com 1,064,262 21,000 0.00001%
Information and Research Resources GDPR (General Data Protection Regulation) (European Union) GDPR.eu 1,089,043 20,000 0.00001%
Associations, Consortiums, and Groups CLOC (Corporate Legal Operations Consortium) CLOC.org 1,200,575 18,000 0.00001%
Industry Blogs Ride the Lightning (Sharon Nelson) SenseiEnt.com 1,222,763 18,000 0.00001%
Information and Research Resources EDPB (European Data Protection Board) EDPB.Europa.eu 1,306,894 17,000 0.00001%
Associations, Consortiums, and Groups ARMA International Arma.org 1,321,946 16,000 0.00001%
Industry Blogs The Cowen Group (David Cowen) CowenGroup.com 1,637,480 13,000 0.000008%
Industry Blogs eDiscovery Assistant Blog (Kelly Twigger) eDiscoveryAssistant.com 1,757,035 12,000 0.000007%
Educational Training and Resources Nordic Institute for Interoperability Solutions NIIS.org 2,609,572 7,000 0.000004%
Industry Blogs Reveal Blog (George Socha and Cat Casey) RevealData.com 5,437,005 2,100 0.000001%
Associations, Consortiums, and Groups GICLI (The Government Investigations & Civil Litigation Institute) GICLI.org 10,772,422 330 0.0000002%
eDiscovery Provider Websites L2 Services L2Services.net 13,335,285 110 0.00000007%
--> Scroll to see full table data

Showing 1 to 55 of 55 entries

Source: ComplexDiscovery and the Washington Post


Implications of eDiscovery Resource Representation in the C4 Dataset

Identifying potential biases and limitations

By analyzing the representation of eDiscovery resources in the C4 Dataset, professionals in the eDiscovery ecosystem can identify potential biases and limitations in the data used to train large language models. This knowledge may enable them to make more informed decisions about the reliability and applicability of AI-generated outputs in their work.

Enhancing the quality and diversity of data used to train large language models

Understanding the inclusion of eDiscovery resources in the C4 Dataset can also help researchers and developers improve the quality and diversity of data used to train large language models. By incorporating a more comprehensive range of eDiscovery-centric resources, models may become better equipped to generate more accurate and relevant responses in the eDiscovery context.

Addressing the needs of cybersecurity, information governance, and legal discovery professionals

By exploring the eDiscovery resources represented in the C4 Dataset, developers can better understand the needs of cybersecurity, information governance, and legal discovery professionals. This insight may allow them to fine-tune large language models to address better the unique challenges and requirements of the eDiscovery ecosystem, ultimately leading to more useful AI-generated outputs for these professionals.

Encouraging transparency in AI development

Highlighting the inclusion of eDiscovery-centric resources in the C4 Dataset emphasizes the importance of transparency in AI development. By understanding the data sources used to train large language models, professionals in the eDiscovery ecosystem may be able to evaluate the reliability of AI-generated outputs better and make more informed decisions about their adoption and integration into their work and workflows.

Conclusion

This high-level exploration of selected eDiscovery-centric resources in the Google C4 Dataset has meaningful implications for professionals in the eDiscovery ecosystem. Analyzing the representation of selected resources in the dataset may help identify potential biases and limitations, enhance the quality and diversity of data used to train large language models, and encourage transparency in AI development. It may also highlight, with context, resources that may have more influence than you would think on shaping LLM-driven answers to prompts and queries. As large language models continue to evolve and become more integrated into the eDiscovery ecosystem, understanding their data sources and potential limitations will be crucial in ensuring their successful application and adoption.

*Assisted by GAI and LLM Technologies

Article References

Additional Reading

Source: ComplexDiscovery

Written by:

EDRM - Electronic Discovery Reference Model
Contact
more
less

PUBLISH YOUR CONTENT ON JD SUPRA NOW

  • Increased visibility
  • Actionable analytics
  • Ongoing guidance

EDRM - Electronic Discovery Reference Model on:

Reporters on Deadline

"My best business intelligence, in one easy email…"

Your first step to building a free, personalized, morning email brief covering pertinent authors and topics on JD Supra:
*By using the service, you signify your acceptance of JD Supra's Privacy Policy.
Custom Email Digest
- hide
- hide