How accurate is legal AI, really? What the benchmarks show in 2026

A Stanford study found purpose-built legal AI hallucinates on 17 to 33 percent of queries. Here is what the evidence says about which tasks are safer.

By Caleb Mercer10 min read

Legal AI is heavily marketed to law firms today. At least two major vendors use "hallucination-free" language to sell their software. However, independent researchers have tested these platforms. They found that these claims do not match real-world performance. The gap between what vendors promise and what legal AI actually delivers is wide.

To evaluate whether you can trust legal AI, you need objective data. This is especially true because the structural absence of user reviews in this category (Why Almost No Legal AI Tool Has Reviews (And How to Vet One Anyway)) means independent benchmarks carry unusual weight. Without peer-reviewed user feedback, attorneys must rely on empirical testing.

Fortunately, independent data is available. The primary benchmarks include the Vals AI legal benchmark, first published in February 2025 with a follow-up in October 2025. We also have the landmark Stanford RegLab hallucination study from May 2024. Together, these studies provide the closest thing the legal tech market has to peer review on accuracy.

The February 2025 Vals AI benchmark findings

The first major independent benchmark was the Vals AI VLAIR study in February 2025. This study evaluated four legal AI tools: CoCounsel Legal, Harvey, Vincent AI, and Oliver. The researchers tested the tools across seven specific legal tasks. These tasks included data extraction, document Q&A, document summarization, redlining, transcript analysis, chronology generation, and EDGAR research.

According to the Vals AI February 2025 report, CoCounsel Legal averaged 79.5% accuracy across the four tasks on which it was evaluated. Its highest individual score was 77.2% on the document summarization task. Harvey participated in six of the tasks and earned the top score on five.

However, AI did not beat human experts in every category. Human lawyers outperformed every single AI tool on redlining tasks. These February 2025 figures show that while AI was highly capable, it was not perfect. Keep in mind that these are earlier figures. The field has moved forward since these tests were conducted.

The October 2025 Vals AI research benchmark

In October 2025, Vals AI released a second benchmark. This round tested three legal AI tools: Alexi, CounselStack, and Midpage. The researchers also tested ChatGPT and compared all of them against a professional lawyer baseline. The test used 210 research questions across nine different task types.

The results revealed a tight cluster. Human lawyers averaged 71% accuracy. ChatGPT scored 80% accuracy. The three specialized legal AI tools all scored within four points of ChatGPT. This round indicated a 78% to 81% accuracy range for leading legal AI tools compared to the 80% ChatGPT baseline.

However, there is a critical caveat to this October benchmark. According to a Legal IT Insider report, the market's largest commercial research platforms, Westlaw AI and Lexis+ AI, did not participate. This means we do not know how the largest research databases would have scored against this specific test set.

The October 2025 benchmark used weighted criteria to evaluate performance. Accuracy accounted for 50% of each tool's score. Authoritativeness, which requires reliable source citations, accounted for 40%. The final 10% was based on appropriateness.

Together, the two Vals benchmarks show that AI can match or modestly exceed human lawyer baselines on structured, closed-universe research tasks. However, the cohorts in both studies were small. Because the largest market players were missing from the second round, these benchmarks show a floor rather than a ceiling for the industry.

The Stanford RegLab hallucination study

While benchmarks show solid baseline performance, the risk of errors remains high. The most significant issue is hallucination. The Stanford RegLab study, published in May 2024, directly addressed this problem. The study was titled "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools."

Stanford researchers tested three commercial legal AI products and GPT-4 on 202 legal queries. Legal experts scored the responses by hand. The tested products included Lexis+ with Protege, Ask Practical Law AI, Westlaw AI-Assisted Research, and GPT-4.

The findings contradicted vendor marketing claims. According to the Stanford RegLab study, the measured hallucination rates were:

  • Lexis+ AI: 17%
  • Westlaw AI-Assisted Research: 33%
  • GPT-4: 43%

A follow-up study run in June 2024 confirmed these findings. The second run showed that Westlaw AI-Assisted Research hallucinated at roughly double the rate of the LexisNexis product.

These error rates are notable when compared to vendor statements. LexisNexis had publicly claimed "100% hallucination-free linked legal citations." Thomson Reuters claimed its tools "avoid hallucinations." The Stanford researchers came to a direct conclusion, stating that "providers' claims are overstated."

Why RAG reduces but does not solve hallucinations

Many vendors use retrieval-augmented generation (RAG) to prevent these errors. RAG grounds the AI model by forcing it to search a secure database of real legal documents instead of relying on its training memory. This is why Lexis+ with Protege had a 17% hallucination rate, which is much lower than GPT-4's 43% rate.

However, RAG does not eliminate the problem. The underlying AI model must still interpret, summarize, and extract information from the retrieved documents. During this step, the model still introduces errors. RAG is a major step forward, but it is not a complete solution.

What legal hallucination looks like in practice

To understand these risks, you must know what hallucination actually means in legal work. It is not just a general error. It shows up in four distinct ways:

  • Fabricated citations: The AI cites a case that does not exist. It may also invent a docket number that returns no real results.
  • Mischaracterized holdings: The case exists, but the AI describes a legal holding that is opposite to or different from what the court decided.
  • Inapplicable authority: The AI cites a real case, but the case is from the wrong jurisdiction. It might also cite a case that was overruled by a controlling decision.
  • Invented chronology facts: In timeline and transcript tasks, the AI inserts dates or events that never occurred in the source files.

The real-world consequences

These percentages translate to real-world consequences. An analysis by AI Law Librarians in February 2026 documented nearly 1,000 cases where lawyers or self-represented litigants submitted AI-generated hallucinations to courts. These are not theoretical issues. They are active risks that occur when error rates hover in the 17% to 43% range.

Accuracy varies by task

To manage these risks, you must look at specific task categories. Accuracy is not uniform. Some tasks are inherently safer than others.

Low-risk task categories

Lower-risk tasks usually involve pattern-matching against a closed set of documents. In these tasks, errors are usually omissions rather than completely fabricated information.

  • Document extraction from structured sources: This involves pulling defined fields like dates, parties, and dollar amounts from files. Tools like Supio use this for medical record review, while Everlaw uses it for e-discovery. The AI matches clear patterns, and attorneys can verify the fields directly against the source documents. For a deeper look at discovery tools, see our guide on the Best AI eDiscovery Platforms for Law Firms (2026).
  • First-draft summarization with human review: The Vals benchmarks show that summarization accuracy sits around 77% to 80%. When the AI summarizes a document, it might leave out an important detail, but it rarely invents new facts. This is safe if an attorney reviews the summary against the original file.
  • Template-driven form completion: Filling out known fields in a standardized form is a low-risk task. If you use a fixed template and verify the final document, the risk remains low.

High-risk task categories

Higher-risk tasks require the AI to generate new text, interpret legal meaning, or find external sources. These are the areas where the Stanford hallucination data applies.

  • Legal research with citation generation: This is the most sensitive area of AI use. With error rates between 17% and 33%, you cannot use tools like Paxton AI or Lexis+ with Protege without manual verification. You can read more about options in this space in our comparison of the Best AI Legal Research Tools for Law Firms (2026). Unsupervised use is not defensible.
  • Contract review and redlining: The February 2025 Vals benchmark showed that lawyers still outperform AI on redlining tasks. Tools like Spellbook are excellent for finding missed clauses or suggesting standard language, but they cannot replace attorney review. You can compare draft assistants in our guide to the Best AI Contract Review & Drafting Tools for Lawyers (2026).
  • Chronology and timeline generation: Chronology tools can sometimes insert events that do not exist. Even demand letter engines like EvenUp require careful validation of all dates and medical facts against the actual medical records.
  • Novel legal reasoning: AI struggles with multi-step analysis, circuit splits, or unique fact patterns. No current benchmark supports using AI for unsupervised legal analysis in complex cases.

How to test a tool's accuracy yourself

You should not rely solely on vendor demonstrations or general benchmarks. You need to test accuracy on your own matters. This methodology is distinct from standard security reviews or vendor due diligence. It focuses entirely on measuring error rates.

  1. Run the tool on closed matters. Choose 10 to 20 cases that your firm has already resolved. Ask the AI to research the exact legal questions you previously litigated. Compare the AI's output to your actual legal work. This reveals the tool's true error rate in your specific practice area.
  2. Spot-check every citation it produces. Take 20 consecutive citations generated by the tool. Manually verify three things: that the case exists, that it says what the AI claims, and that it remains good law. Do this before you use the tool for active client work.
  3. Build a fixed test set. Select 20 representative documents from your practice. Use a mix of simple and complex files. Run the same prompt on all 20 files and score the results by hand. Track your scores across different tasks. Use this same test set to evaluate the tool again after any major software update.
  4. Compare tools on identical inputs. If you are choosing between two options, do not rely on pre-made vendor demos. Upload the same 5 to 10 documents to both tools. Run identical prompts and score them using the exact same criteria.
  5. Set an explicit error-rate floor before you start. Decide on your threshold before you begin testing. For example, you might accept a 5% error rate on a first-draft summary under attorney review. You might require a 0% error rate for unsupervised citations. Write these limits down and hold the tools to them.

Evaluating accuracy is only one part of the process. For a broader look at how to handle vendor selection when public information is scarce, read our guide on Why Almost No Legal AI Tool Has Reviews (And How to Vet One Anyway).

FAQ

Does legal AI hallucinate?

Yes. Independent testing shows that purpose-built legal tools hallucinate. In the 2024 Stanford study, specialized tools had hallucination rates of 17% and 33%. General models like GPT-4 recorded a higher rate of 43%. These hallucinations can include fabricated case law or invented dates.

Is legal AI accurate enough to rely on?

It depends on how you use it. For structured data extraction or draft summaries under human supervision, the technology is highly reliable. For legal research or contract redlining, you cannot rely on the output without independent verification. You should treat legal AI as a starting draft, not a finished product.

Are "hallucination-free" legal AI claims true?

No. Independent studies have disproven these claims. When researchers tested tools with "hallucination-free" marketing, the tools still generated errors in 17% to 33% of queries. The Stanford study concluded that these vendor claims are overstated.

Which legal AI tasks are safest to automate?

The safest tasks are structured data extraction and first-draft summarization. These tasks rely on a closed set of files. The most dangerous tasks are legal research, citation generation, and unsupervised contract redlining. These tasks require external information and complex legal interpretation.

The Bottom Line

The available benchmark data shows that verification is not optional. AI tools are highly capable, but they are not independent legal experts. The Stanford and Vals studies show that even the most advanced tools make errors. You must build human review into every AI workflow.

This does not mean legal AI is not useful. The benchmarks prove that AI can match or exceed lawyer baselines on structured research and administrative tasks. It is a powerful assistant for document review, extraction, and initial drafting. For solo practitioners, using these tools under proper supervision is a viable way to increase efficiency. If you are a solo practitioner, you can explore tailored options in our guide on Legal AI for Solo & Small Law Firms: A Buyer's Guide.

No single tool has achieved zero errors. The key is to understand the strengths and weaknesses of each platform. For a broader overview of how to vet a tool when public information is scarce, see our guide on Why Almost No Legal AI Tool Has Reviews (And How to Vet One Anyway).