The Importance of a Consultant-Led Term Translation Approach in eDiscovery
When it comes to eDiscovery, term translation isn't just about converting words from one language into another. It's a highly specialized process that requires a nuanced understanding of both linguistics and technical search operations.
When it comes to eDiscovery, term translation isn't just about converting words from one language into another. It's a highly specialized process that requires a nuanced understanding of both linguistics and technical search operations.
- The linguistic elements of the translation.
- The "technical translation" that runs alongside the linguistic one.
The art lies in selecting the right translation; the science is in correctly preparing the linguistic elements and embedding them within the appropriate technical syntax. Both are essential for ensuring eDiscovery searches perform as intended—but the latter is often overlooked. Let's dive into how that oversight happens, why each component matters, and finally take a closer look at the technical specifics.
Why the “Technical” Translation Matters
Here's the catch: linguists, while experts in language, aren’t trained in the ins and outs of eDiscovery search operators. In fact, most linguists have never encountered Boolean syntax, proximity operators, or the constraints imposed by tokenization and index structures—all of which are fundamental to how terms are interpreted by a system.
By way of example at the most basic level: if you handed your partner, parent, or adult children a list of English eDiscovery search terms and asked them to QC and update it, their fluency in English would provide little advantage when it comes to the technical elements. This puts them in the same boat as linguists. And while that’s perfectly understandable (training in eDiscovery search takes years), it doesn’t change the fact that these operators make or break the search accuracy and defensibility.
Search operators govern how linguistic elements are interpreted by the search index. This is a crucial point because they dictate how variations in tense, plurality, and derivational forms are matched against the indexed content. The positioning of a wildcard, or the structure of a proximity clause, can radically change what documents are returned. When applied incorrectly, even the most accurate linguistic translation will yield poor recall or massive overcollection—both of which compromise the integrity of the review. This makes the second dimension of the translation process—the technical translation—critical to the success of the search process.
A Real-World Example: Wildcards Gone Askew
Let’s talk about wildcards—a common feature in many search queries. In English, wildcards are often placed to isolate the root of a term, stripping away tense, plurality, adverbial, and derivational affixes. Sounds pretty straightforward, right?
Well, in translation of wildcarded terms, the challenge arises because you’re essentially asking a linguist to translate half a word. This frequently scrambles the translated linguistic element as the frame of reference doesn’t align with how linguists typically think about language. Linguists commonly mis-shave word forms (taking too few or too many characters off) or don’t shave the infinitive forms at all. Mis-shaving terms means misplaced wildcards—leading to search terms that go wildly off-target, often in spectacular fashion.
In practice, linguists frequently misplace wildcards by applying them to the infinitive form of a verb—an error that gravely impacts search accuracy. While this approach typically works best in English where the infinitive often closely resembles the verb root, it fails in many European languages where the infinitive form is structurally distinct from conjugates, which are the key forms being targeted in search.
Take French, for instance. Infinitive verbs typically end in -ER, -IR, or -RE (e.g., emprunter, to borrow). These endings are dropped and replaced with various conjugated forms depending on tense, mood, and subject. Applying a wildcard directly to the infinitive (e.g., emprunter*) yields minimal results because it doesn’t reflect how the verb actually appears in real-world text.
Example:
ENGLISH: borrow*-captures borrow, borrowing, borrowed, borrower, etc.✅
FRENCH: emprunter*- ❌ return few hits
emprunt*- ✅ captures conjugated forms (emprunte, emprunté, empruntons, empruntez, empruntaient) and derived nouns
Compounding the wildcard issue, English verbs and their associated nouns frequently share a common root (borrow ↔ borrower), but in other languages, this alignment may not exist. In such cases, a single wildcard will not cover both verb and noun forms.
French maintains a shared root:
emprunter (v) ↔ emprunteur (n) → emprunt* covers both
In contrast, two cross-language divergence examples:
PORTUGUESE:
- Verb (to borrow): pedir emprestado (a verb phrase—is not wildcardable as-is)
- Noun (borrower): mutuário → No shared root; separate search terms are required to capture nouns and verbs
ITALIAN:
- Verb (to borrow): prendere in prestito (another verb phrase—is not wildcardable as-is)
- Noun (borrower): mutuatario→ Again, no morphological link between the verb and noun
These subtle but critical differences across languages necessitate term expansion and linguistic alignment. Search term translations must be arranged not just to reflect surface-level translations, but also to account for underlying morphological and syntactic structures.
The Need for Precision: Dropping Wildcards and Adjusting Proximity Distance
In some cases, the best translation drops the wildcard altogether, explicitly listing out all the full-form variants. This ensures a more accurate translation by minimizing false hits. False positives are expensive to review—and they’re especially present when using wildcards in mixed-language datasets. Just this week, we fine-tuned a single term in a large dataset by dropping a wildcard in the string and explicitly defining the ~15 or so variants. The small tweak in translation saw 76,000 false hits avoid promotion to review. That’s a lot of wasted effort and review cost saved!
Another example is adjusting the proximity distances when working across languages. English words tend to be more concise than their French counterparts. In fact, 1,000 English words translates to 1,200 words in French. So, when working with proximity operators—say W/8 in English—the distance needs to expand to W/10 in French to maintain the same performance. Every language has a specific uplift (or reduction) that’s necessary for harmonising proximity performance—and that’s before even getting into CJK languages, which require an additional layer of character-based tokenisation proximity distance uplift (a topic for a future post).
A Case Study: Misplaced Translation of "Attorney-Client"
Take a recent case we were brought in to remediate: a previous eDiscovery vendor inadvertently disclosed a significant volume of privileged material after a linguist provided a literal French translation of the phrase “attorney-client” as "avocat-client." Because this type of term translation falls well outside a linguist’s typical experience, the linguist simply wasn’t aware of what was actually required—and the linguist-led approach ultimately fell well short of what the circumstance demanded. That translation did not account for the legal terminology or the nuances of French morphology and the technical artefacts required to address them.
The correct translation is: (avocat W/3 (client OR “secret professionnel”))
This formulation preserves the semantic intent by accounting for French legal terminology and the specific morphological patterns of the language. This is what was needed to capture all the privileged documents that were missed and produced. As you can see, shifting from a simple literal translation of 'avocat client' to the far more linguistically and technically complex expression—avocat W/3 (client OR 'secret professionnel')—requires expertise well beyond that of a typical linguist.
Other Important Technical Adjustments
The technical translation process also involves several other considerations, such as:
- Restructuring search terms to account for subject–verb–object (SVO) order differences is often necessary when working across languages. Variations in syntactic structure can significantly impact the precision of term matching, particularly in languages where canonical word order diverges from English norms.
- English has SVO order: "The lawyer (S) filed (V) the motion (O)."
- Japanese uses SOV: "Lawyer (S) motion (O) filed (V)."
- Classical Arabic often prefers VSO: "Filed (V) the lawyer (S) the motion (O)."
As with the 'avocat client' example above, it's not hard to imagine VSO/SVO/SOV shifts between languages potentially requiring restructuring into a proximity to accommodate structural linguistic differences. Circumstances like these underscore why a consultant-led approach is critical for accurate and effective translation.
- Integrating English and foreign language terms within a single search clause.
A common industry practice is running the English terms while translations are in progress, then running the translated terms once they’re available. While seemingly efficient, this approach creates linguistic silos—treating English and non-English content as separate streams. The result?Multilingual documents, which contain a mix of languages, fall between the two sets of terms and get missed entirely.
Search terms like (United States AND Senat*) OR(Estados UnidosAND Senad*) may seem sufficient on the surface, but in practice, they miss multilingual documents. An integrated clause: (United States AND Senat*) OR(Estados UnidosAND Senad*)hits cross-lingual references to the same conceptual entities.
This highlights the need for linguistically harmonized and cross-language-aware search logic, especially in datasets containing bilingual or mixed-language content.
-
Platform-specific syntax: Different eDiscovery platforms have their own syntax rules. Terms translated for Relativity, when used in Nuix, can have unintended consequences—often invisible until it’s too late, like in a courtroom or on a call with the DOJ, when someone realizes key documents were never captured because the translation logic didn’t match search index rules.
In Relativity: Telefonica captures both Telefonica and Telefónica. That’s because Relativity’s indexes use accent folding—meaning the index normalizes accented and unaccented characters, treating them as equivalent during search.
In Nuix: The behaviour depends on how the environment is configured. Most Nuix instances are set to preserve diacritics, meaning Telefonica and Telefónica are treated as distinct terms. As a result, searching Telefonica will miss documents containing the accented Telefónica and vice versa.
This scenario must specifically be accounted for with:
(Telefonica OR Telefónica)
Why Linguist-Led Term Translations Fall Short
Here’s the kicker: Term translations produced by most language service providers—even when guided by detailed instructions from firm case teams or litigation support teams—still broadly miss the eDiscovery context. Linguist-led translation approaches overlook the above-described technical translation necessary for accurate search.
Best practice for search term translation sees a consultant lead the charge. A consultant-led approach blends linguistic expertise with deep technical knowledge of eDiscovery search operators in non-English content. Consultants bring an understanding of higher-order considerations involved in translating search terms that shape the review pool. This know-how ensures translations land exactly where they need to for the task at hand.
In Conclusion
Term translation in eDiscovery is not just about converting words—it requires a delicate balance of linguistic precision and technical expertise. Addressing both dimensions in concert assures that search terms hit the intended targets maximizing the accuracy and defensibility of your review process. It’s a complex art—one in which mastery is essential.
So, the next time you’re faced with a multilingual eDiscovery project, remember: getting your terms right isn't just about knowing a language—it's about understanding the language of search.