QVAC Releases Genesis II Educational Dataset

QVAC, Tether Data’s AI research division, released Genesis II, adding 107 billion tokens to its open-source synthetic dataset for AI pre-training. The dataset now totals 148 billion tokens across 19 education-focused domains, introduces Option-Level Reasoning for multiple-choice reasoning, and is available under a Creative Commons license on Hugging Face to support open research and local model development.
Key Points
- 1Adds 107 billion synthetic tokens, totalling 148 billion across 19 education-focused domains.
- 2Introduces Option-Level Reasoning to teach multiple-choice reasoning and extend Genesis I failure analysis.
- 3Enables open research and local pretraining through Creative Commons release on Hugging Face.
Scoring Rationale
High novelty and direct usability via CC release and reasoning innovation + limited generality due to education-focused scope.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

