TaxBench by Rivet

01

Pass^k (pronounced "pass power k") estimates the probability that an agent would succeed on all k independent attempts. This is useful for evaluating consistency and reliability in agent performance.

02

Nicholas Sadjoli, Tim Siefken, Atin Ghosh, Yifan Mai, Daniel Dahlmeier. 2025. Optimization before Evaluation: Evaluation with Unoptimized Prompts Can be Misleading.
‍https://aclanthology.org/2025.acl-industry.44/

Tax workflows are fundamentally broken. Context is constantly lost across emails, PDFs, and legacy systems, which means the same questions get asked, the same data gets reprocessed, and the same mistakes repeat.

When we started Rivet, we built our core platform to fix this at the root. It unifies client document collection with our internal system of record — tracking returns, workflows, entity structures, and every task required to get a filing across the finish line. This gives our accountants and (and AI agents) full visibility into each client’s situation and turns tax prep into a continuous, system-driven workflow instead of a series of disconnected steps.

But once that orchestration layer exists, a new question emerges: what can you actually trust AI to do? We can assign tasks to humans — people we hire, train, trust, and learn with — or to agents that operate ruthlessly and (hopefully) correctly.

Twitter would have you believe the answer is obvious. “AI will do your taxes. Fire your CPA. Praise our Lord and Savior, Sam Altman.” In practice, it isn’t. Models can appear highly capable in demos, yet fail under repetition, edge cases, or real client data. The gap between what works once and what works reliably is still too significant in tax to believe agents can do this all today.

TaxBench is how we measure that gap.

To measure models capabilities properly, we track both:

First-attempt accuracy (pass@1)— does the model get the correct answer on the first try?
Reliability (pass^5) — does the model get the correct answer 5 times in a row? *

These capture two very different things. A model might get the right answer once (which is definitely great!), but fail on the next attempt, or the one after that. In practice, that’s not usable. Clients don’t give you five chances to get it right, or know when the answer you’ve given them is wrong until correction notices start to arrive in the mail from the IRS.

Our evaluation includes recent frontier models from OpenAI, Anthropic, Google, and xAI with knowledge cut-offs ranging between early 2024 and late 2025. A small percentage of the questions—particularly those involving the 2025 tax year—require up-to-date information, so we enable internet access and expect models to locate and reference primary sources online as part of their reasoning, just like a human accountant would need to do.

We focused primarily on income tax questions in this benchmark, reflecting our current practice area. While many benchmarks in this space publish their underlying datasets for replicability, we are unable to do so here for two reasons. First, many questions are derived from real client workflows and involve tax knowledge and calculations that have not been fully anonymized. Second, a subset of tasks specifically evaluate data retrieval capabilities and require access to our internal systems via a secure tool harness (e.g., retrieval-augmented generation over proprietary data sources). As a result, the benchmark cannot be cleanly separated from the environment in which it is evaluated.

To balance transparency with these constraints, we plan to release a synthetic dataset in a future iteration that captures the structure of these tasks without exposing sensitive data. In addition, we intend to expand coverage beyond income tax to more prominently include payroll and sales tax scenarios, enabling a more comprehensive, multi-domain evaluation of tax agent performance.

03

New York Consolidated Laws, Tax Law - TAX § 612. New York adjusted gross income of a resident individual
‍https://codes.findlaw.com/ny/tax-law/tax-sect-612.html

RivetBench evaluates AI systems the way they are actually used in tax workflows — by accountants, during filing season, in the context of real client work. Not in isolation, and not on synthetic tasks. These benchmarks are drawn directly from the work our accountants perform every day.

The questions in RivetBench are not trivial either. They represent the types of problems a senior preparer would need to stop and research, not something answered off the top of their head. These types of questions are structured into three categories:

Tax knowledge — Can an AI model find the correct tax rules and apply them correctly to reason through a question posed by a client? Example:

Sample

“I am an individual who sold QSBS-eligilble shares on 6/1/24. I live/lived full-time in NY. The shares qualify for QSBS federally. Am I allowed to claim QSBS at the state level to avoid paying capital gains for this sale? Answer with either Yes or No, no other text.”

The correct answer is Yes (NY conforms to the federal QSBS treatment⁰³). Most models answer No.

Most models failed this question (and broadly, seem to consistently struggle with state/city/local tax questions). The consequences for these failures aren’t theoretical.

In this case, a real client was convinced they needed to move out of New York to qualify for QSBS ahead of an impending M&A transaction. They were already planning the move — literally pricing out moving trucks — after seeing multiple AI answers all confidently saying New York does not conform to federal QSBS rules. They even showed us the receipts: three different tabs of AI chatbots, all saying the same thing. Those answers were wrong.

New York does conform. But the models were consistent, confident, and incorrect — which made them way more convincing than they had any right to be.

Tax calculations — Can an AI model correctly apply tax rules and formulas to produce accurate figures? Example:

Sample

“A taxpayer paid $15,000 in childcare expenses for their 4-year-old. They have an AGI of $28050. What credit do they receive under the Child and Dependent Care Credit? Respond with only the maximum credit as an integer (eg. 505)”

We highlight this question because models often get tripped up on credit limits, phaseouts, and what actually qualifies — even when the structure of the problem is straightforward.

The correct answer is $840.

On the surface, this looks simple: take the expenses, apply the credit percentage, return a number. But the Child and Dependent Care Credit has caps, income-based phaseouts, and specific rules governing what portion of those expenses are actually eligible.

In this case, two things matter: eligible expenses are capped at $3,000 for one child, and the applicable credit rate is 28%, based on the taxpayer’s AGI of $28,050 (not the more commonly cited 20% rate used higher income levels). Models need to correctly apply both constraints.

In practice, many don’t. They correctly cap the $15,000 down to $3,000, but then default to the most common rate — 20% — and return a clean, confident number. It looks reasonable, but it’s wrong.

Data retrieval — Can an AI model correctly find and extract the requested information, given access to our internal systems? (For these data retrieval questions, we employed a model harness equipped with tools that enable secure retrieval-augmented generation (RAG) over internal data sources.)

Sample

“Find the taxpayers 2023 return. What was their aggregate capital loss carryforward reported for the year (that’ll be used on the 2024 return)? Respond with an integer ONLY (eg. -10000000). You are allowed to respond "I cannot find this based on the documents available" if it's truly unknown.”

We highlight this question because it requires more than simple retrieval — it’s a multi-step workflow that’s required for every 1040 (personal tax return) that we prepare.

The model needs to find the correct 2023 return, navigate to the appropriate form (Schedule D), and identify the specific line that corresponds to aggregate short-term loss carryforward. This is exactly the kind of document traversal and needle-finding accountants do every day preparing returns in season.

Models often fail along the way. They’ll retrieve the wrong return, stop at the wrong form, or pull a nearby number that looks right but isn’t. In some cases, they skip the retrieval step entirely and return a plausible answer without ever grounding it in the underlying document.

The modern tax team for your company & you.

Corporate and personal tax preparation, R&D credits, & year-round tax advisory — backed by 30+ ex–Big Four accountants and a platform that gets out of your way.

Your submission has been received!

Oops! Something went wrong while submitting the form.

Our experiments show that strong single-shot benchmark scores do not imply reliable production behavior. When we evaluate repeated consistency, performance drops sharply across models, often into low double digits or single digits. Top pass@1 scores of 84.2% / 77.2% / 71.9% collapse to pass^5 of 42.3% / 27.0% / 19.2% — meaning even the strongest models fail at least once across 5 attempts on a majority of questions in every category. This gap is especially clear on our data retrieval tasks, the category that most closely matches real accounting work. No model clears 50% for pass^5 in any category.

Newer or larger models do not always win, and ranking can flip by category. Even within top families, version upgrades can underperform prior versions on domain-specific workloads. On Tax Knowledge, Grok 4.1 Fast Reasoning (Nov 2025) outperformed the newer Grok 4.2 Reasoning (Mar 2026) by 5.9 percentage points (42.3% vs 36.4%); on Data Retrieval, Claude Opus 4.6 (Feb 2026) beat Claude Opus 4.7 (Apr 2026) by 8.1 percentage points (19.2% vs 11.1%).

Overall, Claude Opus 4.6 leads on retrieval, the most operationally relevant category, and Anthropic models stay in the top tier elsewhere. But the bigger conclusion is industry-wide: reliability remains the bottleneck. We argue for reliability-first evaluation and have set internal objectives to improve these metrics across all relevant tax categories.

“Almost always right” isn’t good enough for tax

At a high level, the job reduces to two outputs: answering questions and producing tax returns. But both are downstream of the same core workflow — interpreting client-provided facts and information (which are often incomplete, incorrect, or poorly understood), and applying tax rules that are fragmented across statutes, guidance, and institutional knowledge.

Every return and answer for a client is built from a chain: identifying the right documents and relevant facts, extracting the key values, reasoning through how they apply, and carrying those results through into an output (an email, or directly inputted into the tax software). If an agent can’t follow that chain — if it can’t find the carryforward losses, know when it needs to research state-level rules that don’t conform federally, or recognize when tax law has changed (e.g., whether Section 174 expenses can be fully capitalized under new guidance) — it can’t do the job.

And this chain compounds. It’s not one decision — it’s dozens. If you make 10 judgment calls for a single client, and a system is 99% accurate on each step (which would be extraordinary — and current models are nowhere close), that implies roughly a 10% chance of at least one mistake on that return. At scale, that means making material errors for a meaningful portion of clients.

50% accuracy is useless. 75% is unusable. Even 99% isn’t good enough.

In tax, “almost always right” is just another way of saying “wrong often enough to get fired.”

So startups have shifted their focus to document ingestion — OCRing a W-2, a K-1, or a consolidated 1099 and importing it into a return. But that’s one small step in the chain, and it’s been solved for years. Tools like GruntWorx (bundled with DrakeTax, the most popular professional tax prep software in the US) and CCH Axcess Scan already handle this reliably.

It’s worth highlighting the work of our friends at ColumnTax (recently acquired by Aiwyn), who introduced TaxCalcBench last year — a benchmark focused on evaluating whether models can correctly compute tax returns given structured inputs and access to a tax calculation engine. Their approach isolates the calculation layer of tax prep: given the right inputs (a W-2, a 1099), can a model produce the correct return as an XML file that can be e-filed with the IRS? It’s a hard problem, and their results show that even frontier models struggle to reliably apply tax rules and computations at scale — with the best-performing model only reaching the low-30% range on strict correctness.

We intentionally took a different path. Instead of starting with clean, structured inputs, we evaluate models on the full workflow that precedes calculation: finding the right documents, interpreting messy client data or fact patterns, applying the correct rules, and only then producing an answer or output. In practice, this is where most of an accountant’s time is spent. The downstream step of converting correct inputs into an e-filed return has been solved for years by platforms like Column, Drake, CCH Axcess, and others. The hard part isn’t computation once everything is clean. It’s getting to the point where the inputs are actually correct.

The winner in this market won’t be the company with the flashiest demo or the best OCR. It will be the one operating at scale, running a real accounting practice, with real clients, real edge cases, and real consequences, and pushing AI directly into those workflows where it can be tested, measured, and understood.

The only way to know what actually works is to run these models in production — supervised and validated by experts — where every mistake matters, and where “almost always right” isn’t good enough. And to pair that with a robust orchestration layer that can route work intelligently, verify outputs, catch failures, and ensure that what ultimately reaches the client meets the standard.

That’s what we’re building at Rivet – the orchestration layer required to deploy AI in real tax workflows, so that as models improve, we’re ready to use them safely, reliably, and at scale.

TaxBench

Our methodology

Our unique approach

Results

“Almost always right” isn’t good enough for tax

About us