Firmwork
<- Back to blog

Firmwork research

The Matter Is the Benchmark

Most legal AI benchmarks ask whether an agent can complete a task. A live M&A matter asks a harder question: can agents help the deal move?

by Firmwork Research

What 425 workflow runs and 6,685 completed tool outputs reveal about AI-native M&A execution in a live transaction

A transaction does not move because a lawyer receives one good AI answer.

It moves when hundreds of small pieces of matter context turn into reviewed work product at the right time: a markup returned before a client call, an appendix list corrected before signing, a lease package revised before closing, a late operating issue resolved before it blocks closing.

That is why isolated legal AI benchmarks miss part of the question. They can test whether an agent completes a task. A live M&A matter tests whether agents can help the deal move.

We analyzed the operating trace of one live Spanish M&A transaction supported by Firmwork. The public-safe trace includes:

  • 1,619 matter emails and communications;
  • 99 formal email threads;
  • 121 matter attachments;
  • 425 project-linked Firmwork workflow runs across a roughly eight-week matter window;
  • 6,685 completed tool outputs in the refreshed database, aligned with the normalized 6,648-call parser;
  • 755 live-period change sets;
  • 2,402 workspace files;
  • 5,760 workspace events;
  • document-like outputs grouped into 41 delivery packages.

The result was not a single "AI drafted an SPA" story. Firmwork supported work across SPA, APA, leases, signing, closing, appendices, side letters, corporate books, research, VDR/DD review and final adendas.

More importantly, the matter trace lets us measure something closer to operational leverage: how often requests became work product, where in the deal work concentrated, what kinds of artifacts were produced, how much review survived, and where the system still failed.

The unit of analysis is not the prompt. It is the matter.

In this article, a delivery package means a cluster of one or more document-like outputs returned together in response to a practical matter request. One package may be a single SPA markup, a closing agenda, a lease review pack, or a set of assignment documents delivered as one work product. We use packages because M&A work is often delivered in bundles, and counting individual files can overstate volume.

Matter-system trace

Operational evidence exported from one live M&A transaction

Figure 1. Matter-system trace. The evidence surface is a matter operating system trace, not a chat transcript. Note: Workflow runs, tool calls, workspace files, events, VFS activity, documents, and review artifacts exported from one live M&A matter.

Why M&A is hard for agents

M&A work is not one long document task. It is a moving system.

At any point in a transaction, the relevant context may live in:

  • an SPA markup;
  • an email from counterparty counsel;
  • an Excel appendix;
  • an informal instruction from the legal team;
  • a folder of VDR materials;
  • a previous issue list;
  • a client instruction from three days earlier;
  • a closing agenda that changed twice since yesterday.

The hard part is not only reasoning over long documents. It is knowing what matters now.

There is another problem that matters even more in practice: carrying issues forward. A risk identified in diligence may reappear in the SPA, then in a side letter, then in a closing condition, then in a last-minute email. If the system treats each request as a fresh prompt, the issue gets rediscovered instead of managed. A useful M&A agent has to remember open issues, understand how they evolve, and keep them attached to the matter until they are resolved or deliberately dropped.

Early in a deal, the work is often about building context: reviewing diligence materials, understanding the perimeter, identifying open issues and turning source documents into a usable matter state.

In the middle, the work becomes coordination: integrating comments, producing markups, preparing signing documents, organizing appendices, and translating negotiated positions into work product.

Near closing, the work becomes compression: leases, waivers, corporate books, certificates, adendas, signatures, missing attachments, counterparty versions, last-minute factual corrections.

This is where a matter-level agent system is different from a drafting interface. A drafting interface starts with a document. A matter system starts with the file, the channels, the state and the deadline.

Dataset

We studied one live transaction through four layers of evidence.

The communication layer contains 1,619 matter emails and communications across the live matter window, including 99 formal email threads, 121 attachments and 16 communication prompts reviewed for qualitative context. We treat this as the intake surface of the matter: the stream of requests, updates, follow-ups, source documents and late-breaking factual context that normally drives a transaction team.

The Firmwork execution layer contains 425 project-linked workflow runs across a roughly eight-week matter window. The refreshed database records 6,685 completed tool outputs, aligned with the normalized 6,648-call parser, with a median nonzero run depth of 12 tool calls and a maximum observed run depth of 110 tool calls. That matters because the system was not operating as a chat wrapper. It was repeatedly reading, searching, writing, editing, validating, routing models and packaging work against a matter file.

The matter-system layer contains 2,402 workspace files, 1,771 project documents, 5,760 workspace events, 898 change proposals, 755 change sets and 1,392 VFS branch entries. This is the infrastructure surface: not only what the model said, but what the system touched, versioned, proposed, routed and reviewed.

The work-product layer contains document-like outputs grouped into 41 delivery packages. This is the reviewable surface of the matter: the artifacts and proposed changes that lawyers could inspect, accept, reject, supersede or use in downstream deal work.

This is not a controlled benchmark. It is an observational trace. That makes it messier, but also more useful for understanding legal work in production.

Benchmarks tell us whether agents pass tasks. Matter traces tell us whether agents fit the way legal work actually happens.

Method

We did not count every model response as value.

Instead, we reconstructed the matter around eight operating metrics.

  1. Matter-system trace.

We measured the operational surface of the matter: workflow runs, completed tool outputs, workspace files, workspace events, VFS branches, change proposals and change sets.

  1. Agent execution depth.

We measured tool calls per run to distinguish shallow chat usage from multi-step file, document, email and validation work.

  1. Model routing.

We grouped AI usage ledger rows by transaction work episode and model tier to see which work pulled frontier reasoning models, which work used frontier-standard models, and which work shifted toward fast low-cost or open and alternative model families.

The current export records observed model usage, cost and token volume. It does not reliably distinguish auto-router selection from direct model selection, so we treat routing claims as observed model-mix economics rather than audited policy attribution.

  1. Phase and factory pressure.

We divided the matter into phases and measured runs, completed tool outputs, change sets, workspace versions and AI cost by phase.

  1. Request-to-artifact conversion.

We grouped document-like outputs into delivery packages and matched those packages to prior request signals in the matter communications. A delivery package is the transaction-level unit of work product: a coherent return to the legal team, whether it contains one file or several related files.

The package ledger is a traceability proxy, not a causality proof, but it lets the matter be audited at the level lawyers actually experience: request, work episode, reviewed artifact, delivery.

  1. Workstream coverage.

We classified delivered packages and run labels across SPA, APA, leases, signing, closing, appendices, corporate books, side letters and related workstreams.

  1. Review survival.

We measured whether Firmwork change sets were accepted, auto-accepted, rejected, superseded or still pending.

  1. Economic metering and capital allocation.

We measured AI cost, token usage and contribution-adjusted Return on Token to ask how much reviewed matter movement each unit of AI consumption bought.

The goal was to answer one question:

Did Firmwork install a production loop around the matter?

Result 1: Work concentrated around signing and closing, not only initial drafting

The most obvious assumption would be that AI value clusters around the first SPA draft or markup.

That is not what the trace shows.

The work concentrated in the operational middle and late parts of the deal, where transaction teams usually fight version drift, appendix control, closing documents, lease details, counterparty changes and factual corrections.

The signing architecture, side letter and annexes phase in the middle weeks of the matter had 119 runs, 1,548 completed tool outputs, 164 change sets and 434 VFS branch entries. The peak signing, APA and appendices phase added 83 runs and 1,335 completed tool outputs. The leases, waivers and closing execution phase in the later closing weeks had 92 runs, 1,797 completed tool outputs, 278 change sets, 250 workspace versions and 411 VFS branch entries.

This matters for product design. The most valuable M&A agent is not only a better first-draft generator. It has to support the ugly middle and late parts of a transaction, where many small artifacts and factual changes have to converge quickly.

Where work concentrated across the deal

Completed tool outputs by transaction phase

Figure 2. Phase workload. Agent work peaked around signing preparation and closing execution, not only first drafting. Note: Completed tool outputs by matter phase, using the refreshed blog data.

Result 2: The agent loop had real depth

The more important system metric is not how long a single run took. It is how much operational work happened inside the run.

Across the project-linked matter window, the median tool-using run contained 12 tool calls. The p75 run contained 28 tool calls. The p90 run contained 52 tool calls. The deepest observed run contained 110 tool calls.

That is a different usage pattern from a drafting chatbot. These runs were often doing the work of a junior transaction team member: reading files, searching emails, extracting facts, writing drafts, editing DOCX files, checking outputs, updating state and producing packages.

The top tool-heavy runs were not abstract demos. They were live deal tasks: filling lease placeholders from VDR and web sources, preparing a side letter from diligence findings, reading company accounts, investigating sublease issues, adapting lease contracts and building closing documents.

This matters because a live matter does not reward isolated reasoning. It rewards repeatable execution against changing source material.

Agent execution depth

Tool calls per tool-using run

Figure 3. Agent execution depth. The median tool-using run was a multi-step loop, with a long tail of deep execution. Note: Depth distribution across tool-using workflow runs.

Result 3: The final act matters

The refreshed database changed the article's shape.

The earlier local export made the final adenda and assignment-document packages look weakly linked to Firmwork runs because the run export stopped before the matter did. The live database shows 32 workflow runs, 44 AI usage rows, 40 change sets, 22 accepted or auto-accepted change sets, and 670 completed tool outputs in the final closing days.

That does not prove causality for every delivered file. It does show that the closing window was still an active execution period, not merely a human-only tail after the agent trace ended.

Result 4: Model routing changed with the work

The model trace shows something more interesting than one provider winning the matter.

The matter did not behave like a single-model benchmark. Frontier models carried high-judgment drafting and negotiation work. Faster lower-cost or alternative models became more prominent during high-volume execution, especially as the system had more matter state and review context to operate against.

In the completed AI usage ledger, an earlier frontier-heavy week was almost entirely frontier and reasoning spend: USD 193.13 over 172.5M raw tokens. By a later high-volume closing week, fast or lower-cost models carried 100.1M raw tokens for USD 45.48, while frontier models handled 14.0M raw tokens for USD 29.32. In the final closing week, frontier spend rose again for final adendas and higher-judgment closing work.

This is an early routing analysis, not an audited model policy. But it points to a more important benchmark than raw model quality: which parts of a transaction actually need frontier reasoning, and which parts need reliable execution at scale.

AI spend vs token usage by model tier

Bars are weekly AI spend. The line is token-equivalent usage.

Figure 4. AI spend vs token usage. As the matter moved into high-volume execution, token volume shifted toward cheaper model tiers. Note: Observed model-mix economics, not audited auto-router attribution.

Cost per token after the routing mix shifted

Bars are USD per 1M token-equivalents. The line is fast/open token share.

Figure 5. Cost per token after routing shifted. The cost per 1M token-equivalent units fell as fast lower-cost models took more volume. Note: Weekly cost-per-token view generated from the model-spend evidence CSV.

Model routing by work type

Share of matched AI spend by model tier across transaction work episodes

Figure 6. Model routing by work type. Different workstreams pulled different model tiers rather than one model carrying the whole matter. Note: Grouped by transaction work episode and model tier.

Result 5: The closing factory was the strongest operating signal

The most impressive part of the trace is the closing lease and employee-notification workstream.

The comparison is unusually clean. The SPA, side letter and closing-agenda phase and the closing lease factory phase each had 56 runs. But the closing factory phase produced 1,421 tool calls versus 891, 216 change sets versus 77 and 240 workspace versions versus 119, while attributed run cost was lower.

That is where Firmwork looked least like a drafting assistant and most like an M&A execution layer. The work included lease model generation, repeated factual checks, a multi-contract lease package, indexes, employee notification letters, closing agenda work, sublease discovery, ancillary contract reclassification and final correction loops.

The pattern is exactly what a deal team needs near closing: many related artifacts, repeated source facts, small but dangerous errors, and a need to converge quickly under review.

Closing factory intensity

Closing phase versus same-size earlier drafting phase

Figure 7. Closing factory intensity. The closing factory produced more tool calls, change sets, and workspace versions than an earlier equal-run drafting phase. Note: Comparison of two 56-run phases from the normalized evidence pack.

The broader delivery trace still matters. Firmwork produced document-like outputs grouped into 41 delivery packages. But request-linkage is a supporting metric. The sharper public claim is that Firmwork created a repeatable execution loop when the matter got operationally dense.

Legal agent work should not be judged by whether a model produced text. It should be judged by whether the system produces reviewable work product under human control.

Across the live matter window, the database records 755 change sets. Of those, 573 were accepted or auto-accepted, 93 were rejected, 66 were superseded and 23 remained pending. Overall review survival was 75.9 percent.

The more interesting cut is by file kind. DOCX is the core legal artifact surface, and DOCX change sets survived review at a much higher rate: 299 accepted or auto-accepted out of 336 DOCX change sets, or 89.0 percent.

This is not a claim that 89.0 percent of DOCX legal work was independently correct. It is a workflow metric: the proposed legal-artifact changes survived the platform review process.

For legal work, that distinction matters. The product is not autonomous output. The product is a controlled loop where agents produce, lawyers review, and the matter advances.

Legal artifact review survival

Accepted or auto-accepted change sets by file kind

Figure 8. Legal artifact review survival. DOCX and XLSX change sets survived review at materially higher rates than scratch-work surfaces. Note: The refreshed live DB keeps DOCX survival directionally strong at 89.0%.

Result 7: Return on Token is the capital allocation meter

Return on Token is the amount of fee-weighted legal work product produced per unit of AI consumption. There are two useful forms:

  1. Value per AI dollar: EUR of fee-weighted work product per USD of AI cost.
  2. Value per token: EUR of fee-weighted work product per 1M token-equivalent units.

The closest financial analogy is Return on Invested Capital. ROIC is useful because it does not only ask whether a company generates profit. It asks whether the next unit of capital deployed into the operating system produces enough return to justify the investment.

Return on Token should do the same for agentic legal work. The question is not whether a matter used many tokens, or whether the tokens were cheap. It is whether the next unit of AI consumption turns into reviewable work product, reduced lawyer load, faster closing execution or better matter control.

Gross ROT is intentionally simple, but it is too generous as a standalone claim because it attributes all matter value to AI. A more conservative contribution-adjusted model assigns only a scenario share of fee-weighted value to Firmwork and keeps human legal involvement visible.

On the completed matter, gross ROT was EUR 53.88 per USD of AI cost and EUR 53.34 per 1M token-equivalent units. Under 60 percent, 75 percent and 90 percent Firmwork contribution scenarios, contribution-adjusted ROT ranged from EUR 32.33 to EUR 48.49 per USD of AI cost, and from EUR 32.01 to EUR 48.01 per 1M token-equivalent units.

Return on Token is therefore best understood as an economic metering layer on top of matter throughput, not as a replacement for quality or legal review.

Contribution-adjusted return on token

Matter-level economic meter under Firmwork contribution scenarios

Figure 9. Contribution-adjusted return on token. Return on token is strongest when treated as an economic meter on top of supervised matter throughput. Note: Scenario model under 60%, 75%, and 90% Firmwork contribution assumptions.

The metric is still early. It needs better run-level token attribution, cleaner workstream allocation, a sharper human contribution model and review-adjusted value weights. But it is directionally important because it lets an AI-native legal system connect three things that are usually separate: matter value, agent consumption and reviewable work product.

The better question is not "can tokens produce legal value?" It is: when a matter becomes executable, how much legal work product does each unit of AI consumption move through the system?

For operators, the most useful version will be marginal ROT. If the system spends the next 1M token-equivalent units on a workstream, does it create enough additional reviewed output to clear the hurdle rate? Does frontier reasoning improve survival or reduce senior review time enough to justify its cost? Or should the work route to faster lower-cost models because the matter state and review loop are already strong?

That is where ROT becomes a management metric rather than a retrospective chart. It gives an AI-native legal system a way to rank token spend across workstreams, models and phases of the deal.

Behavior analysis: what good agent work looked like

The strongest runs were not simple drafting runs.

They looked more like associate work under pressure:

  • gather the relevant matter context;
  • inspect the source documents;
  • produce a draft or markup;
  • validate it against the record;
  • revise after checks;
  • package the artifact;
  • return it through the channel where the legal team is already working.

This is consistent with what legal-agent research increasingly shows: better legal agents do not simply produce more. They retrieve more selectively, validate after drafting, ground work product against source documents, and revise before delivery.

In this matter, the high-value pattern was not "one prompt, one answer." It was a tool loop around the matter file.

The product lesson is clear: Firmwork should make the loop explicit.

Failure modes

The trace also shows where the system was not good enough.

The main issues were not abstract model intelligence. They were workflow and document-system issues: DOCX and markup fidelity, latest-version resolution, appendix and folder completeness, attachment delivery plumbing, source provenance, VFS visibility and audit-grade spend attribution.

These are not side issues. In M&A, they are the work.

If the agent drafts well but the wrong version is used, the matter does not move. If an appendix list is incomplete, the deal team cannot rely on it. If a document exists in the workspace but is missing from the email thread, the delivery failed.

That is why the matter operating system matters as much as the model.

What this changes

The first generation of legal AI was organized around assistance: ask a question, summarize a document, draft a clause.

M&A work needs something else: a system where the matter file is persistent, channels feed structured work episodes, agents operate against matter state, work product is versioned and reviewable, outputs return through existing legal workflows, and the full trace can be measured.

This is the core Firmwork thesis:

M&A agents need a matter operating system, not just a drafting interface.

What comes next

The next step is not just another research post. It is a better M&A application layer on top of agents.

The model layer will improve. Larger context windows, better reasoning, cheaper inference and stronger specialized models will all help. But this matter trace suggests that much of the remaining leverage sits above the model, in the algorithms and product primitives that make agents useful inside a transaction.

The most important primitive is issue tracking. M&A work depends on carrying issues forward: identifying them, assigning them to workstreams, connecting them to documents, noticing when new evidence changes them, and making sure they appear again when the legal team needs to act. A matter agent should not simply answer a question about an issue. It should keep the issue alive until closing.

The second primitive is matter memory. The system needs a persistent matter graph: parties, documents, clauses, dates, open points, source materials, work product, decisions and unresolved questions. That graph should feed retrieval, drafting, validation and review.

The third primitive is routing. Different parts of a deal require different agent behavior. Diligence review is not the same as SPA markup. Appendix organization is not the same as lease adaptation. Closing execution is not the same as legal research. This matter trace already shows why this matters: frontier models carried more of the high-judgment drafting and negotiation work, while the closing factory could push much more volume through fast low-cost models once the matter state and review loop were in place. A useful system should make that routing explicit across models, tools and specialized workflows based on phase, workstream, risk and artifact type.

The routing layer should also have an economic objective: maximize marginal Return on Token subject to review survival and legal risk. That means routing is not only a model-selection problem. It is capital allocation for compute inside a matter.

The fourth primitive is provenance. Every material output should carry source lineage: what documents, communications, prior state and assumptions grounded the result. Without provenance, the lawyer has to re-audit the work manually.

The fifth primitive is review memory. The system should learn from accepted, rejected and superseded changes. Review should not be a terminal UI state. It should become training signal for the matter and for the firm's way of working.

That is the product direction: not agents instead of an M&A workflow, but agents embedded inside an M&A operating layer.

Method notes

This is an observational trace of one live matter, not a randomized benchmark.

Request-to-delivery matching is a traceability proxy, not proof that every run caused every delivered artifact.

Review survival is a workflow metric, not an independent legal correctness score.

The current DB ledger distinguishes completed production spend from failed or cancelled attempts; economics should be read as directional until run-level allocation is fully audited.

Conclusion

The matter is the benchmark.

In this transaction, Firmwork was not just used to produce isolated documents. It operated across a live M&A matter: 1,619 matter emails and communications, 425 workflow runs, 6,685 completed tool outputs, 2,402 workspace files, 5,760 workspace events, 41 delivery packages and 755 change sets.

The evidence suggests a new way to measure legal agents in transactional work.

Do not start with token ROI as a vanity metric. Start with matter movement, then measure whether each additional unit of AI consumption turns into reviewed work product at an attractive Return on Token.

Can the system turn scattered context into reviewed work product? Can it do that across workstreams? Can it survive closing pressure? Can lawyers control it? Can the trace be measured?

If yes, the value is not just faster drafting.

The value is installed execution capacity.