Funding & Businessenterprise aiai metricspricingtokens

Enterprises Move Beyond Token Counts to Measure AI

||By LDS Team
6.8
Relevance Score
Enterprises Move Beyond Token Counts to Measure AI
Photo: pymnts.com · rights & takedowns

PYMNTS reports, citing the Financial Times, that some Amazon engineering teams used the company\u0000s internal AI tool MeshClaw to inflate token-usage leaderboards this spring; the Financial Times described the behaviour as "tokenmaxxing" and quoted an employee saying, "Managers are looking at it." PYMNTS also reports similar behaviour at Meta. PYMNTS argues that raw token consumption is a poor proxy for business value and that enterprise finance teams face unpredictability as costs shift to per-call model billing. PYMNTS cites Salesforce as a case study: the outlet reports Salesforce introduced $2-per-conversation pricing for Agentforce in late 2024 and logged 5,000 Agentforce deals in the first two quarters under that model, but only 3,000 paid. PYMNTS describes Salesforce\u0000s current unit, the Agentic Work Unit (AWU), as a move to measure discrete tasks completed by agents rather than tokens.

What happened

PYMNTS, citing the Financial Times, reports that internal leaderboards tracking token consumption prompted some employees at Amazon to use the company\u0000s internal AI tool MeshClaw to delegate tasks to agents and boost token totals, a practice the Financial Times calls "tokenmaxxing." PYMNTS reports similar incentives at Meta, and quotes an FT-sourced employee: "Managers are looking at it." PYMNTS frames tokens as a poor metric for value and attributes billing and forecasting difficulties to token-based consumption models. PYMNTS reports that Salesforce tested $2-per-conversation pricing for Agentforce in late 2024, logging 5,000 deals in its first two quarters under that model but only 3,000 paid, and that Salesforce currently measures work in Agentic Work Units (AWUs).

Editorial analysis - technical context

Metrics that reward raw model calls or token volume create optimisation pressure on users. Industry-pattern observations show that when consumption is the measured KPI, teams often prioritize activity that increases the tracked metric rather than downstream outcomes. This pattern surfaces technical issues such as inefficient prompting, agentic workflow leakage, and harder-to-audit chains of tool calls.

Industry context

For finance and procurement teams, token-based billing converts engineering choices into direct cost drivers. Organizations shifting from fixed-license SaaS to per-call model economics face forecasting gaps and procurement friction, as reported by PYMNTS in its Salesforce example. Observed patterns across vendors include experimenting with unit definitions that better align cost with discrete business outcomes, as illustrated by Salesforce\u0000s move to AWUs.

What to watch

Monitor whether more vendors publish alternative billing units (task- or outcome-based metrics), whether industry consortia recommend standard usage units, and whether enterprises adopt internal guardrails to separate valuable agent outcomes from metric-driven noise. For practitioners: track both token consumption and outcome metrics when evaluating agentic systems.

Key Points

  • 1Measured WHAT: Internal leaderboards tied to token consumption drove token-inflating behaviour, reducing correlation with business value.
  • 2Measured WHY: Token-based pricing shifts technical decisions into finance territory, creating forecasting and auditing challenges for enterprises.
  • 3Measured SO WHAT: Vendors and buyers are exploring alternative units, like task-based AWUs, to align billing with discrete agent outcomes.

Scoring Rationale

The story matters to practitioners who manage cost, procurement, and observability for AI systems. It highlights real enterprise pain with token-based billing and vendor responses, but it is not a frontier technical breakthrough.

Sources

Public references used for this report.

1 source

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems