GPT-5 Tops Harvey’s BigLaw Bench Eval – Artificial Lawyer

August 8, 2025

13

As AL shared last night, Harvey – and other companies – have had early access to GPT-5. The genAI pioneer has analysed the new LLM’s outputs and marked it as the best-performing OpenAI model using its ‘BigLaw Bench’ AI evaluation system. It scored 89.22% overall.

The company launched BigLaw Bench (see AL article) last year to help with gauging the quality of genAI responses, in particular relative to how a lawyer would expect an acceptable response to read.

As they explained at the time – ‘Each task in BigLaw Bench is assessed using custom-designed rubrics that measure:

Answer Quality: Evaluates the completeness, accuracy, and appropriateness of the model’s response based on specific criteria essential for effective task completion.
Source Reliability: Assesses the model’s ability to provide verifiable and correctly cited sources for its assertions, enhancing trust and facilitating validation.
Scores are calculated by combining positive points for meeting task requirements and negative points for errors or missteps (e.g. hallucinations).
Those scores are then expressed as percentages.’

And below is the chart they have provided. As you can see GPT-5 scored 89.22%, a notable improvement of around 5% on the next closest results shown, which were of another OpenAI model, o3, which was at 84.13%. (Note: Harvey uses other companies’ models, not just OpenAI, but those are not shown here.)

Screenshot 2025 08 08 at 08.19.51 — *Harvey data, August 2025.*

Moreover, this is really starting to get close to ‘last mile’ territory.

I.e. the closer we get to something where lawyers can go ‘yep, that’s fine, let it through’, the harder and harder it gets.

Getting to ‘it’s kind of right, but needs some work to get to the level I want’ is relatively easy for many LLMs. But, getting up to 90% and then into that massive last mile on the journey to 99%, is a totally different experience.

But, we are moving in the right direction. Plus, these outputs will get improved as Harvey – (and other legal tech companies) – applies refinement, system prompting, and orchestration with related data.

Which raises the question: can we ever get to 99.9% on BigLaw Bench? Probably not for some years yet, but eventually…? Why not. It goes back to the Waymo analogy this site has used a few times now: getting to the level of success where people just go with it is incredibly hard to do in a super-complex, unstructured environment, but, as Waymo showed, it can be done with enough time and investment.

Will new genAI models get much better? It’s hard to say. There will be incremental improvements for sure. But, bigger steps may come from other strategies, such as improving the verification layer.

Either way, we are making progress, and at an incredible pace. In three years we have gone from scepticism about AI, to now a majority of large law firms engaging deeply with the technology – so too their clients. And central to this change is the performance of the models. If those LLMs didn’t deliver, then the lawyers would not be so enthusiastic about the current wave of legal AI tools.

—

Right, what else?

In Harvey’s blog post on the new model, they also added some details about their own plans on how to leverage GPT-5:

‘Integrated into Harvey’s systems, these baseline capabilities can be leveraged to enable more powerful use cases in the document drafting and complex research domains. GPT-5 is also the first orchestration model that appears capable of combining these tasks—allowing for a single agent to both collaborate with a user on the research and produce the finished work product.

For example, on a task like: ‘Identify if any of these internal guidance documents are inconsistent with current regulation, we operate in the United States and the European Union’ . . . GPT-5 can be used to orchestrate agents that:

Review the internal documents to identify relevant trends to search for;
Find recent changes in global regulation;
Perform a comprehensive review of any gaps between the two; and
Draft a memo of recommendations of how to best update your internal guidance to stay aligned with the new regulatory environment.
All while prompting the user as needed for additional context to ensure it reaches the goal as expected.

Coupled with our recently-announced data partnerships with LexisNexis and iManage, Harvey is now able to see the full picture – public and proprietary – before it acts. With GPT-5’s substantially improved tool-use and drafting capabilities, we can now build a deeply integrated AI system that reasons over an organization’s internal data and leverages trusted third-party content in real-time.

Building an Intelligent Coworker

Complex matters don’t unfold linearly; they advance dynamically through iteration, and in close collaboration with internal and external stakeholders. With GPT-5, and our product and data ingredients in place, Harvey’s north star of creating an intelligent coworker comes into focus.’

—

You can find more about Harvey and read the original post here. Thanks to CEO Winston Weinberg and team for sharing.

—

Legal Innovators Conferences in New York and London – Both In November ’25

If you’d like to stay ahead of the legal AI curve….then come along to Legal Innovators New York, Nov 19 + 20, where the brightest minds will be sharing their insights on where we are now and where we are heading.

And also, Legal Innovators UK – Nov 4 + 5 + 6

GPT-5 Tops Harvey’s BigLaw Bench Eval – Artificial Lawyer

Discover more from Artificial Lawyer

From slop to Sotheby’s? AI art enters a new phase

A Coding Guide to Build an AI-Powered Cryptographic Agent System with Hybrid Encryption, Digital Signatures, and Adaptive Security Intelligence

Qualifire AI Releases Rogue: An End-to-End Agentic AI Testing Framework, Evaluating the Performance of AI Agents

Most Popular

From slop to Sotheby’s? AI art enters a new phase

Rashmika Mandanna’s REACTS to engagement wishes with Vijay Deverakonda: ‘I will take your…’

Where Golfers Lose The Most Strokes: Data Reveals Golf’s Costliest Shot

“48 Hours” contributor finds herself in uncharted territory investigating D.C.-area murder

Recent Comments

EDITOR PICKS

From slop to Sotheby’s? AI art enters a new phase

Rashmika Mandanna’s REACTS to engagement wishes with Vijay Deverakonda: ‘I will take your…’

Where Golfers Lose The Most Strokes: Data Reveals Golf’s Costliest Shot

POPULAR POSTS

From slop to Sotheby’s? AI art enters a new phase

Rashmika Mandanna’s REACTS to engagement wishes with Vijay Deverakonda: ‘I will take your…’

Where Golfers Lose The Most Strokes: Data Reveals Golf’s Costliest Shot

POPULAR CATEGORY

ABOUT US

FOLLOW US