Kerrison and Clyburn Examine LLM Performance Evaluations

According to InfoQ, Legare Kerrison and Cedric Clyburn from the Red Hat team presented practical methods for evaluating and optimising LLM inference in a recent talk reported by InfoQ. They discussed resource and cost tradeoffs for common workloads including RAG (Retrieval Augmented Generation), and highlighted evaluation metrics such as Requests Per Second (RPS), Time to First Token (TTFT), and Inter-Token Latency (ITL), per InfoQ. The presenters argued that public leaderboards and generic benchmarks are limited for most business problems and emphasised a three-way tradeoff between model quality, responsiveness, and cost, as described in the InfoQ coverage. The talk framed 2023 as the year of LLMs, 2024 as the year of RAG, 2025 as the year of fine-tuning and agents, and predicted 2026 will focus on LLM evaluations, according to InfoQ.
What happened
According to InfoQ, Legare Kerrison and Cedric Clyburn of the Red Hat team delivered a talk about practical methods for evaluating and optimising LLM inference. The InfoQ report says the presenters examined resource requirements and cost implications for workloads such as RAG and highlighted evaluation metrics including Requests Per Second (RPS), Time to First Token (TTFT), and Inter-Token Latency (ITL). The speakers described a three-way tradeoff among model quality, responsiveness (latency), and cost, and noted that public leaderboards and generic benchmarks do not capture many unique enterprise use cases. Per InfoQ, they framed 2023 as the year of LLMs, 2024 as the year of RAG, 2025 as the year of model fine-tuning and AI agents, and suggested 2026 will be about LLM evaluations.
Editorial analysis - technical context
Companies deploying LLMs commonly balance latency, accuracy, and cost, and the tradeoff triangle described in the talk mirrors observed production constraints across the industry. For RAG-style applications, practitioners should expect evaluation complexity to increase because performance depends on both the retrieval index and the generative model, not the model alone. Metrics named by the presenters, RPS, TTFT, and ITL, cover throughput and perceived responsiveness; measuring all three is necessary to capture user experience and operational cost. Observed patterns in similar deployments show that microbenchmark leaders often fail to reflect application-level latencies introduced by retrieval, network calls, or batching strategies.
Context and significance
Industry context: The InfoQ coverage places the talk within a broader shift toward application-centric LLM evaluation. As models and toolchains diversify, off-the-shelf leaderboards remain useful for baseline comparison but insufficient for systems that combine retrieval, prompting strategies, and downstream postprocessing. For teams engineering production LLM systems, this means more investment in workload-specific measurement and tooling rather than relying solely on public benchmarks.
What to watch
Observers should track the emergence of standardised application-level benchmarks that incorporate retrieval latency, index freshness, and multi-step agent workflows. Watch for tooling that automates measurement of TTFT and ITL under realistic batching and concurrency, and for community patterns around cost-normalised quality metrics. Also monitor whether vendors and open-source projects publish RAG-aware benchmark suites that make tradeoffs between throughput, latency, and cost explicit.
Scoring Rationale
The piece highlights practical, practitioner-focused measurement topics that matter for production LLM deployments. It is notable for operations and engineering teams but does not introduce a new model or major platform change.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


