Fine-R1 Delivers Few-Shot Fine-Grained Visual Recognition

Researchers post an arXiv preprint on Feb 7, 2026 introducing Fine-R1, a multimodal large language model tailored for fine-grained visual recognition using Chain-of-Thought supervised fine-tuning and Triplet Augmented Policy Optimization. With only 4-shot training, the model reportedly outperforms general MLLMs and contrastive CLIP models on seen and unseen sub-categories, improving robustness to intra-class variance and discriminative ability; code is available.
Scoring Rationale
Strong methodological novelty and few-shot results drive score; limited by single arXiv preprint without peer review.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
