TaxBench is now live and accepting submissions!
TaxBench Logo
TaxBench
Paper
Hugging Face Logo
Dataset
GitHub
Latest News
TaxBench News Item
The Intelligence Consortium
Stanford University
Carnegie Mellon University
TaxBench is steered by the Tax AI Consortium, a body of The Intelligence Consortium.
Introduction
Benchmarks are essential for tracking large language model (LLM) capabilities and improving models through reinforcement learning paradigms. Despite their importance, specialized domains such as taxation remain critically understudied, though they demand unique reasoning across legal interpretations, mathematical calculations, and regulatory contexts. We introduce TaxBench, the definitive benchmark for evaluating LLMs on tax-related tasks. TaxBench features a comprehensive assessment framework with hyper-granular grading rubrics that enable consistent evaluation by both humans and LLMs. Our evaluation of leading models reveals significant capability gaps in tax reasoning, establishing clear targets for future development. To maintain benchmark relevance in this evolving domain, we establish the Tax AI Consortium as a governing body to oversee submissions and regular updates. TaxBench provides the standard for measuring LLM tax proficiency, driving innovation at the intersection of artificial intelligence and taxation.
Dataset
TaxBench is a comprehensive evaluation framework comprising expert-crafted questions and detailed grading rubrics spanning diverse tax domains, including personal taxation, corporate taxation, international tax law, estate planning, tax procedure, and specialized areas such as cryptocurrency taxation and nonprofit tax compliance.
Sample Questions
1 / 7
International Tax
Question:
USCO is a company based in Texas. It is the parent company of CANCO, a subsidiary based in Canada. USCO produces alcoholic beverages in the United States. There are two brands manufactured and sold by USCO, and they are both sold in 6-pack configurations. USCO Light is sold at USD 6 per 6-pack, while USCO Premium is sold at USD 8 per 6-pack.
Historically, USCO has sold its products through independent distributors in the United States, and in Canada as well as, because of certain regulations that did not allow manufacturers of alcoholic beverages to be involved in distribution, in both countries.
Recent regulatory changes in Canada have removed these restrictions, and now USCO can distribute its beverages in the Canadian market through a related party distributor. It has now set up CANCO in Canada, to be the primary distributor in the Canadian market, in Toronto, British Columbia, and Alberta.
USCO plans to continue distributing in all other Canadian markets through independent distributors as it has historically.
From a US perspective, USCO management wants to know if they can rely on the price they sell beverages to independent distributors for pricing products sold to CANCO. Note: USCO has not yet finalized the detailed contractual terms with CANCO, including volume commitments, credit terms, and marketing responsibilities.
A sample of the diverse and challenging questions in TaxBench.
Results
Performance by Domain
All Domains
1
Accordance AI79.5%
2
Claude Opus 4.7High reasoningWeb search69.1%
3
Claude Opus 4.8High reasoningWeb search67.8%
4
GPT 5.5High reasoningWeb search57.0%
5
DeepSeek V4 Pro48.9%
6
Gemini 3 FlashWeb search47.6%
7
Co-Pilot (GPT-4.1)Web search39.4%
Head-to-Head Win Rate
Accordance AI
vs
Co-Pilot (GPT-4.1)
87%
11%
2%
Accordance AI winsTiesCo-Pilot (GPT-4.1) wins

Accordance AI wins 87% of all head-to-head questions against Co-Pilot (GPT-4.1) (11% tied).

Excluding ties: Accordance AI 97% · Co-Pilot (GPT-4.1) 3%

Example Evaluation
SALT (State & Local)
Question:
Since users on our app or website don't purchase anything or enter into contracts with us, and their activity is limited to browsing or engagement, wouldn't that mean their presence alone can't trigger nexus for state tax purposes?
Sample Answer (Accordance AI)
grounded in 7 sources

No — the premise that nexus requires purchases or contracts with in-state users is not correct. Whether browsing-only users create nexus depends on the tax type and how the business monetizes those users. The risk is lower than for a seller, but it is not zero, and in several states it is affirmatively present.

The Constitutional Baseline: Physical Presence Is Not Required, and Virtual Presence Counts

*South Dakota v. Wayfair* (2018) eliminated the physical-presence requirement: "Physical presence is not necessary to create a substantial nexus"1 — nexus exists when a taxpayer "avails itself of the substantial privilege of carrying on business" in the jurisdiction1. The Court specifically addressed the exact fact pattern at issue here: "a company with a website accessible in South Dakota may be said to have a physical presence in the State via the customers' computers. A website may leave cookies saved to the customers' hard drives, or customers may download the company's app onto their phones."1 The Court also noted that Massachusetts proposed defining physical presence to include apps downloaded by in-state residents and cookies placed on in-state browsers, and that Ohio adopted a similar standard at Ohio Rev. Code §5741.01(I)(2)(i)1. So the argument that mere browsing/engagement is constitutionally incapable of creating nexus is foreclosed — cookies and app downloads on in-state devices are recognized contacts.

Sales/Use Tax: Browsing Users Alone Generally Don't Trigger Collection Obligations

Post-*Wayfair* economic nexus statutes are framed around sales and transactions into the state, not user counts:

RegimeTriggerEffect of browsing-only users
South Dakota model (upheld in *Wayfair*)>$100,000 of goods/services delivered into the state or 200+ separate transactions annually1No sales into the state → threshold never met
Illinois remote retailer (86 Ill. Adm. Code 150.201)$100,000+ cumulative gross receipts or 200+ separate transactions with Illinois purchasers2Same — measured by sales, not visits
Click-through nexus statutes (e.g., N.Y. Tax Law §1101(b)(8)(vi); Illinois)In-state persons compensated for referring customers, e.g., Illinois presumes nexus where tracked referrals generate >$10,000 in sales over four quarters12Requires a referral contract — passive browsing doesn't qualify

If the company makes no taxable sales into a state, there is nothing to collect and the dollar/transaction thresholds are never satisfied. Caveat: in cookie/app-presence states (Ohio-style statutes1), in-state software presence can establish nexus as a legal matter, though it is consequential only once the company has taxable sales to collect on.

Income/Franchise Tax: User Engagement Can Create Nexus If It Is Monetized

This is where the "users don't buy anything" framing breaks down. Most ad-supported or data-driven businesses do derive receipts from their users' in-state activity — just from advertisers and data buyers rather than from the users themselves — and states source those receipts to the users' location:

  • Factor-presence statutes. Alabama §40-18-31.2(b) establishes substantial nexus when in-state property exceeds $50,000, payroll exceeds $50,000, sales exceed $500,000, or any factor exceeds 25% of the total3, and counts toward the sales threshold receipts from "services, intangibles, and digital products for primary use by a purchaser known to the seller to be in this state"3. Michigan imposes substantial nexus on a taxpayer that has physical presence for more than one day or actively solicits and has $350,000+ in gross receipts sourced to the state4.
  • Market-based sourcing of ad revenue. Oregon's sourcing rule assigns advertising receipts to the state "to the extent that the audience for the advertising is in Oregon," and its Example 20 directly covers an out-of-state web company paid per view or click: those receipts are Oregon receipts to the extent the viewers are in Oregon55. In-state browsers therefore pull the company's advertising revenue into the state's sales factor — which is exactly what factor-presence thresholds measure.

So if the platform earns advertising, data-licensing, or sponsorship revenue attributable to an audience in a state, that audience's browsing is what sources the receipts there, and sufficient receipts create income tax nexus without any user paying anything.

P.L. 86-272 Offers Little Protection Here

P.L. 86-272 (15 U.S.C. §§381–384) immunizes only the solicitation of orders for tangible personal property; services, intangibles, licensing, and digital products are outside its protection6. For internet activities, New Jersey's regulation (N.J.A.C. 18:7-1.9A, following the MTC's revised statement) draws the line precisely on the browsing/cookie issue:

  • Protected: cookies used only ancillary to soliciting orders for tangible goods; a searchable e-commerce website; posting FAQs777.
  • Unprotected (nexus-creating): placing cookies on in-state users' devices that gather market or product data assembled for sale to data brokers or third parties7; selling targeted advertising based on in-state users' location and cookie data7; contracting with in-state customers for subscriptions7.

A free-to-use, engagement-driven platform that monetizes user data or attention is squarely in the unprotected column. Note also that even where P.L. 86-272 does apply, it preempts only net income taxes — Alabama §40-18-31.2(e) confirms protected taxpayers escape the income tax despite exceeding factor thresholds3 — but it does not shield against gross-receipts taxes (Washington B&O, Oregon CAT, Ohio CAT), minimum/franchise taxes7, or sales/use tax collection duties.

Conclusion

For sales/use tax, browsing-only users do not by themselves trigger a collection obligation, because economic nexus thresholds are measured by sales and transactions into the state12 — though cookie/app-presence statutes mean the company may technically have nexus that becomes relevant the moment it makes a taxable sale1. For income, franchise, and gross-receipts taxes, the conclusion does not hold if the company monetizes those users: advertising and data receipts are sourced to the states where the audience sits5, factor-presence statutes convert those sourced receipts into substantial nexus at thresholds as low as $350,000–$500,00034, and cookie-based data gathering and targeted-ad sales defeat P.L. 86-272 immunity77. The accurate framing is not "no purchases, no nexus" — it is "no in-state-sourced revenue, no nexus." A revenue model built on user engagement (ads, data sales, subscriptions sold elsewhere but consumed in-state) sources revenue to users' states and can create nexus on the strength of browsing activity alone.

Sources Cited
1SCOTUS Opinion2§ 150.201: General Definitions3§ 40-18-31.2: Factor Presence Nexus Standard for Business Activity4§ 206.621: Nexus; "actively solicits" and "physical presence" defined5§ 150-314-0435: Sales Factor; Sales Other Than Sales of Tangible Pe…6Rule 8.9: Public Law 86-272 (15 U.S.C. §§ 381-384) - Solicitation D…7§ 18:7-1.9A: Internet activities conducted by a business
Grader Output
Discussion
Humanity's Last Tax Evaluation
TaxBench serves as the definitive and last necessary benchmark for tax reasoning capabilities in AI systems. Unlike static benchmarks that quickly become outdated in dynamic domains, TaxBench's living nature—supported by the Tax AI Consortium's commitment to regular updates and expert validation—ensures its continued relevance despite evolving tax regulations and policies. The comprehensive coverage across personal, corporate, international, and specialized tax areas, combined with hyper-granular evaluation rubrics, addresses the full spectrum of tax reasoning challenges. This represents the last necessary benchmark for tax domain evaluation, as its adaptable framework accommodates future regulatory changes and emerging tax concepts through the Consortium's governance structure. This approach eliminates the need for fragmented, specialized benchmarks that would make progress tracking difficult. By establishing a single, authoritative standard with built-in mechanisms for evolution, we create a sustainable framework for measuring, comparing, and advancing AI capabilities in tax reasoning for years to come, regardless of how tax systems or AI technologies develop.
Impact
TaxBench represents a significant advancement in domain-specific AI evaluation with implications extending beyond academic research. By improving AI tax reasoning capabilities, we can democratize access to tax expertise, potentially benefiting millions of individuals and businesses who cannot afford professional consultation. Enhanced tax reasoning in AI systems could dramatically reduce compliance burdens, which currently cost taxpayers billions annually in administrative expenses and professional fees. For policymakers, better AI tax reasoning enables more sophisticated analysis of proposed regulations, allowing for better understanding of distributional effects and unintended consequences. These advancements could ultimately contribute to more equitable tax systems by ensuring all taxpayers—regardless of resources—can properly navigate complex tax codes and receive the benefits and protections to which they are entitled.