LLM Embeddings Enable Variant-Level Genomic Representations

Researchers publish on April 1, 2026 a systematic framework that uses large language model embeddings to represent genetic variants across the human genome. They generate embeddings for 8.9 billion possible variants at three scales — 1.5 million HapMap3/MEGA, 90 million imputed UK Biobank, and 9 billion all variants — using OpenAI's text-embedding-3-large and open-source Qwen3 models. Baseline tests show high predictive accuracy and embedding-augmented polygenic risk score predictions on UK Biobank; resources are publicly available on Hugging Face.
Scoring Rationale
Large-scale, novel contribution producing embeddings for billions of variants with demonstrated predictive utility and public release increases actionability. Score boosted for scope and usability but limited slightly by being an arXiv preprint rather than peer-reviewed work.
Practice with real Banking data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Banking problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.
Sources
- Read Original[2509.20702] Incorporating LLM Embeddings for Variation Across the Human Genomearxiv.org
