DistilViT Uses Prefix Conditioning With LoRA
On Dec. 16, 2025 the PDF.js alt-text project adopts a prefix-conditioning plus LoRA architecture for DistilViT to generate concise alt text. The team freezes a SigLIP vision encoder and base decoder, training only a projection head and ~2.2M LoRA parameters (≈221M total), achieving five-times faster training with a 1–2% CLIP score difference versus cross-attention models. The approach simplifies deployment and targets on-device inference.
Key Points
- 1Adopts prefix-conditioning plus LoRA, freezing vision encoder and base decoder, training ≈2.2M parameters
- 2Reduces architectural complexity, avoids cross-attention injection, and cuts GPU memory and export friction
- 3Enables five-times faster training and simpler ONNX deployment for on-device alt-text models
Scoring Rationale
Practical, reproducible engineering pattern demonstrating five-times faster training and simple deployment; limited by single-project results and a modest CLIP score trade-off.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems