Resources Article LegalBench: The LLM Benchmark for Legal Reasoning

LegalBench: The LLM Benchmark for Legal Reasoning

Zian (Andy) Wang

Published on 09/26/23Updated on 10/11/23

Table of Contents

CUAD (Classification)Rule QA (Rule)Abercrombie (Application and Conclusion)Hearsay (Application and Conclusion)Diversity Jurisdiction (Application and Conclusion)Personal Jurisdiction (Application and Conclusion)PROA (Application and Conclusion)Intra-Rule Distinguishing (Issue)Outlooks and Implications

Share this guide

Legal professionals and computer scientists have joined forces to introduce LegalBench, a collaborative legal reasoning benchmark that could revolutionize law practice. The tool has been designed to test the reasoning abilities of large language models (LLMs) like GPT-3 and Jurassic. Unlike other benchmarks, LegalBench is an ongoing project anyone can contribute to. The paper outlines some initial tasks and their purposes, but its full potential is yet to be discovered as the legal and machine learning communities contribute to the project. Its goal not only lies in identifying and analyzing various legal cases but also in evaluating an LLM's performance on other responsibilities of a lawyer.

The prime focus of LegalBench is not on whether computational systems should replace lawyers but rather on determining the degree to which these systems can execute tasks that require legal reasoning. In this way, they would augment, educate, or assist lawyers, not replace them.

The LegalBench benchmark is split into two different types of tasks: IRAC reasoning and non-IRAC reasoning. Issue, Rule, Application, and Conclusion, or IRAC for short, is a framework American legal scholars use to execute legal reasoning. Specifically, it splits the reasoning process into 4 steps:

Issue:
- The 'Issue' is the legal question or issue that needs to be resolved.
- It is the problem or question arising from a case's facts.
Rule:
- The 'Rule' refers to the legal rule or principle that applies to the issue.
- This can be a statute, a case that sets a precedent, or a regulation.
Application:
- The 'Application' involves applying the rule to the facts of the case or situation.
- It involves analyzing the situation to determine whether the rule applies.
Conclusion:
- The 'Conclusion' is where the analyst states their conclusion based on the analysis.
- It involves summarizing the issue, the rule, and the analysis and then stating the final conclusion.

For example, in a case involving a client accused of stealing a bike:

Issue: Did the client commit theft?
Rule: The legal definition of theft is taking someone else's property without their consent and with the intention to deprive them of it permanently.
Application: The client borrowed the bike from a friend with permission and intended to return it after using it. Therefore, the client did not take the bike without consent and did not intend to deprive the friend of the bike permanently.
Conclusion: The client did not commit theft as he had the owner's consent to use the bike and did not intend to deprive the owner of it permanently.

However, a typical lawyer has responsibilities outside of what the IRAC outlines; such tasks can include client counseling, negotiation, drafting legal documents, analyzing contracts, and much more. The LegalBench benchmark attempts to cover these tasks as well.

LegalBench employs the IRAC framework to categorize and evaluate various legal tasks:

Issue tasks: These assess the ability of LLMs to accurately identify legal issues, focusing on whether the model accurately identifies the issue without bringing up unrelated issues.
Rule tasks: These test the LLMs' capacity to remember and state the relevant legal rules, emphasizing whether the model accurately states all rule components without including language that is not part of the actual rule.
Application tasks: These examine the LLMs' proficiency in applying the relevant legal rules to the specific facts of a situation, with the key criterion being whether the model can produce a legally coherent explanation of its answer. Chain-of-thought prompts have been found to be effective in obtaining such explanations.
Conclusion tasks: These test the LLMs' ability to reach the correct legal outcome based on the previous analysis, with the sole criterion being whether the model correctly determines the outcome (usually a simple yes/no) without needing to provide an explanation.

Moreover, LegalBench also includes non-IRAC tasks, referred to as “classification tasks.” While these tasks do not necessitate IRAC-style reasoning, they are vital for two reasons. First, they are common in the legal profession and demand skills that are uniquely associated with lawyers. Second, legal practitioners have identified these tasks as areas where computational tools could replace or support legal professionals. Hence, gauging the modeling approaches' performance in these areas is crucial.

As mentioned previously, LegalBench is an ongoing project, with community contributors creating additional tasks according to the benchmark's guidelines. The paper introduced 44 initial tasks, grouped into 8 families:

CUAD (Classification)

Objective: Classify different types of clauses in contracts from the EDGAR database.

Details: LegalBench refashioned the CUAD dataset, aiming to differentiate between 32 distinct clause types. It challenges LLMs to identify and distinguish clauses from randomly sampled clauses.

Rule QA (Rule)

Objective: Respond to questions related to specific legal rules.

Details: This dataset consists of 50 questions that ask the LLM about particular legal rules. These questions typically involve restating a rule, pinpointing where it's codified, or listing the factors integral to the rule.

Abercrombie (Application and Conclusion)

Objective: Evaluate the distinctiveness of a product name or "mark".

Details: This task tests the Abercrombie distinctiveness spectrum, which has five levels of product name distinctiveness. The LLM must predict the distinctiveness level of a given product-mark pair.

Hearsay (Application and Conclusion)

Objective: Determine the admissibility of out-of-court statements during a trial.

Details: LegalBench crafted 95 samples that require the LLM to classify each sample based on whether it qualifies as hearsay, examining how statements relate to a factual argument.

Diversity Jurisdiction (Application and Conclusion)

Objective: Assess if state claims can be brought to a federal court.

Details: For a state claim to be eligible, it must involve citizens from different states and meet certain monetary thresholds. Through varied templates and generated sentences, the LLM must decide if diversity jurisdiction is valid.

Personal Jurisdiction (Application and Conclusion)

Objective: Determine a court's authority over a defendant for a legal claim.

Details: The LLM evaluates the defendant's interactions with a forum state and verifies if a claim is related to those interactions. There are 50 manually generated scenarios challenging the model to justify personal jurisdiction.

PROA (Application and Conclusion)

Objective: Decide if a statute grants a private right of action.

Details: A private right of action (PROA) permits individuals to take legal action to enforce their rights. The LLM assesses 95 different statutes and determines whether they include a PROA.

Intra-Rule Distinguishing (Issue)

Objective: Identify the correct task name for given fact patterns.†hum

Details: As a prototype for issue-spotting, the LLM is presented with fact patterns from the Hearsay, Personal Jurisdiction, and Abercrombie tasks. The challenge is to map each fact pattern to its corresponding task correctly.

For an up-to-date set of tasks and categories, visit LegalBench's GitHub page. The project can also be found on HuggingFace.

Outlooks and Implications

Following its initial tests, LegalBench has observed that LLMs generally perform best at classification tasks and worst at application tasks. Furthermore, it found that model size strongly correlates with performance. The larger the model, the better the performance. It also discovered significant performance variances for models with the same architectures, attributed to differences in the pretraining regime and instruction finetuning. In general, LLMs perform better at tasks that are reframed as information-extraction or relation-extraction tasks. Therefore, continuous studying and testing will benefit the performance of models in new domains and make them more efficient.

Overall, the LegalBench will be attractive to a wide range of communities. Legal practitioners can use the benchmarks to determine where LLMs may be integrated into existing workflows to improve outcomes for clients. Meanwhile, regulatory bodies such as judges and other legal institutions can use it to assess model performance without compromising professional ethics. Finally, computer scientists may benefit from studying the performance of these models in new areas, where unique tasks and distinct lexical properties offer fresh insights.