Tax workflows are fundamentally broken. Context is constantly lost across emails, PDFs, and legacy systems, which means the same questions get asked, the same data gets reprocessed, and the same mistakes repeat.
When we started Rivet, we built our core platform to fix this at the root. It unifies client document collection with our internal system of record — tracking returns, workflows, entity structures, and every task required to get a filing across the finish line. This gives our accountants and (and AI agents) full visibility into each client’s situation and turns tax prep into a continuous, system-driven workflow instead of a series of disconnected steps.
But once that orchestration layer exists, a new question emerges: what can you actually trust AI to do? We can assign tasks to humans — people we hire, train, trust, and learn with — or to agents that operate ruthlessly and (hopefully) correctly.
Twitter would have you believe the answer is obvious. “AI will do your taxes. Fire your CPA. Praise our Lord and Savior, Sam Altman.” In practice, it isn’t. Models can appear highly capable in demos, yet fail under repetition, edge cases, or real client data. The gap between what works once and what works reliably is still too significant in tax to believe agents can do this all today.
TaxBench is how we measure that gap.
To measure models capabilities properly, we track both:
- First-attempt accuracy (pass@1)— does the model get the correct answer on the first try?
- Reliability (pass^5) — does the model get the correct answer 5 times in a row? *
These capture two very different things. A model might get the right answer once (which is definitely great!), but fail on the next attempt, or the one after that. In practice, that’s not usable. Clients don’t give you five chances to get it right, or know when the answer you’ve given them is wrong until correction notices start to arrive in the mail from the IRS.
Our evaluation includes recent frontier models from OpenAI, Anthropic, Google, and xAI with knowledge cut-offs ranging between early 2024 and late 2025. A small percentage of the questions—particularly those involving the 2025 tax year—require up-to-date information, so we enable internet access and expect models to locate and reference primary sources online as part of their reasoning, just like a human accountant would need to do.
We focused primarily on income tax questions in this benchmark, reflecting our current practice area. While many benchmarks in this space publish their underlying datasets for replicability, we are unable to do so here for two reasons. First, many questions are derived from real client workflows and involve tax knowledge and calculations that have not been fully anonymized. Second, a subset of tasks specifically evaluate data retrieval capabilities and require access to our internal systems via a secure tool harness (e.g., retrieval-augmented generation over proprietary data sources). As a result, the benchmark cannot be cleanly separated from the environment in which it is evaluated.
To balance transparency with these constraints, we plan to release a synthetic dataset in a future iteration that captures the structure of these tasks without exposing sensitive data. In addition, we intend to expand coverage beyond income tax to more prominently include payroll and sales tax scenarios, enabling a more comprehensive, multi-domain evaluation of tax agent performance.

