Researchembeddingsgenomicsuk biobankopenai

LLM Embeddings Enable Variant-Level Genomic Representations

|April 1, 2026

9.1

Relevance Score

Researchers publish on April 1, 2026 a systematic framework that uses large language model embeddings to represent genetic variants across the human genome. They generate embeddings for 8.9 billion possible variants at three scales — 1.5 million HapMap3/MEGA, 90 million imputed UK Biobank, and 9 billion all variants — using OpenAI's text-embedding-3-large and open-source Qwen3 models. Baseline tests show high predictive accuracy and embedding-augmented polygenic risk score predictions on UK Biobank; resources are publicly available on Hugging Face.

Scoring Rationale

Large-scale, novel contribution producing embeddings for billions of variants with demonstrated predictive utility and public release increases actionability. Score boosted for scope and usability but limited slightly by being an arXiv preprint rather than peer-reviewed work.

MoreOpenAI news