Multimodal LLMs Automate Product Retrieval Evaluation

A new research paper presents "Retrieve, Annotate, Evaluate, Repeat", a multimodal LLM framework for large-scale product retrieval evaluation, validated on 20,000 query-product pairs and deployed at Zalando. The method generates query-specific annotation guidelines and uses multimodal models (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo) to achieve human-comparable accuracy while reducing cost up to 1,000× and assessing 20,000 pairs in about 20 minutes. Results are reported in English and German.
Key Points
- 1Demonstrates MLLM-based framework evaluating 20,000 query-product pairs with human-comparable accuracy
- 2Shows cost and speed benefits, achieving assessments ~1,000x cheaper and 20 minutes for 20,000
- 3Enables scalable multilingual evaluation, freeing humans for complex edge cases and error analysis
Scoring Rationale
High practical impact from real-world deployment and strong efficiency gains, limited novelty as an application rather than new model.
Sources
Public references used for this report.
Practice with real Retail & eCommerce data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Retail & eCommerce problems

