Zalando Deploys LLM-As-Judge For Search Quality Assurance

Zalando published a 2024 research paper and in 2025 applied an LLM-as-a-judge framework to evaluate search relevance proactively. The system uses NER-based query clustering, LLM translation, and visual-text context to score results at scale for new markets including Luxembourg, Portugal and Greece. This approach automates pre-launch QA, reduces manual annotation, and enables reproducible re-evaluation after fixes.
Key Points
- 1Implements LLM-as-a-judge to score semantic relevance of search results across languages and modalities.
- 2Automates test generation using NER clustering and LLM translation to cover diverse search intents.
- 3Enables proactive pre-launch QA for Luxembourg, Portugal, and Greece, shortening debugging and verification cycles.
Scoring Rationale
Strong practical impact from official Zalando deployment and reproducible pipelines, limited academic novelty compared with foundational LLM research.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

