Models & Researchepitranscriptomicsrna modificationfoundation modelm6a

m6A-FORM introduces foundation model for m6A biology

|June 11, 2026|By LDS Team

6.8

Relevance Score

m6A-FORM introduces foundation model for m6A biology

A June 10, 2026 arXiv preprint from University of Pittsburgh researchers (Ting-He Zhang et al.) introduces m6A-FORM, a transformer foundation model pretrained on nearly 25 million RNA sequence windows from 143 human MeRIP-seq studies to predict N6-methyladenosine (m6A) sites and their regulatory function. Its fine-tuned site predictor reaches a PR-AUC of 0.635, a 0.14-point gain over the prior best method (DeepSRAMP), while running about 10.7 times faster than standard per-adenosine inference. Task-specific versions of the model also predict binding for 19 m6A regulator proteins and identify 19,631 tissue-conserved m6A sites across 24 human tissues that correlate with reduced gene expression. The preprint has not yet been peer-reviewed, and the reviewed sections do not state a public code or data release.

The more interesting result here is not the headline accuracy number, it is the 10.7-times inference speedup: by reformulating m6A site prediction as sequence labeling over an entire MeRIP-seq peak instead of scoring each candidate adenosine separately, the authors address both an accuracy and a scaling problem at once, a reformulation that matters more for practical genome-wide deployment than a few points of PR-AUC.

What happened

A June 10, 2026 arXiv preprint, "m6A-FORM: An m6A-focused Foundation Model for Decoding m6A Regulatory Function" by Ting-He Zhang, Sumin Jo, Shou-Jiang Gao, and Yufei Huang (University of Pittsburgh / UPMC Hillman Cancer Center), presents a transformer pretrained via masked language modeling on 24,909,934 RNA sequence windows derived from 22,548,379 MeRIP-seq peaks across 143 human studies. The team fine-tuned a site-prediction variant, m6A-FORM-sites, on a high-confidence set of 131,320 single-nucleotide m6A sites built by intersecting m6A-Atlas v2.0 and GLORI annotations, reformulating prediction as a sequence-labeling task over 3-mers within each MeRIP-seq peak rather than scoring adenosines one at a time. This version reached a PR-AUC of 0.635 and ROC-AUC of 0.988, a gain of at least 0.14 PR-AUC over the previous best model, DeepSRAMP, and ran roughly 10.7 times faster than the conventional adenosine-centered approach in the paper's own ablation. A second fine-tuned variant, m6A-FORM-RWEBind, predicts binding for 19 m6A regulator proteins (13 readers, 5 writers, 1 eraser) using CLIP-seq data from POSTAR3, improving median PR-AUC by 0.09 over baseline RBP-binding predictors iDeepS and RNAProt. A third variant, m6A-FORM-decay, predicts YTHDF2-associated mRNA decay. Applied across 67 MeRIP-seq datasets from 24 human tissues, the model identified 19,631 tissue-conserved m6A sites showing higher methylation, more predicted YTHDF2 binding, and a negative correlation with gene expression (Spearman correlation -0.45) compared with 241,000 infrequent sites.

Technical context

The paper's central methodological move is treating MeRIP-seq peaks, not individual adenosines, as the unit of representation: pretraining on peak-derived sequences gives the model an enrichment-informed prior that narrows the search space and, per the authors' ablation, drives most of the PR-AUC gain (the pretrained and non-pretrained baselines reach similar ROC-AUC but diverge sharply on PR-AUC, the harder metric under real class imbalance). The same representation is reused across three fine-tuning tasks, site calling, regulator-binding prediction, and decay prediction, rather than training separate models from scratch, the now-standard foundation-model pattern applied here to a genomics problem with unusually sparse, peak-structured labels.

For practitioners

For computational biology teams, the reusable-encoder design and the concrete 10.7x inference speedup are the most transferable elements, worth evaluating even before independent replication, since the same peak-as-context reformulation could apply to other epitranscriptomic marks (m5C, pseudouridine) that face the same adenosine-centered inefficiency. As with any single, not-yet-peer-reviewed preprint, treat the specific PR-AUC and enrichment statistics as reported pending replication; the reviewed sections do not state whether code, trained weights, or processed datasets will be released, which would be the key gate for independent verification and reuse.

What to watch

Watch for a code or model-weights release, or a peer-reviewed publication, that would let other labs reproduce the PR-AUC 0.635 / ROC-AUC 0.988 site-prediction result and the 10.7x speedup claim, and for independent benchmarking against DeepSRAMP and other adenosine-centered predictors on held-out GLORI sites.

Key Points

1m6A-FORM reformulates m6A site prediction as peak-level sequence labeling, reaching PR-AUC 0.635 while running 10.7 times faster than prior methods.
2The same pretrained encoder fine-tunes for three tasks: site calling, regulator-binding prediction, and YTHDF2-linked decay prediction, without separate training.
3Applying the model across 24 human tissues identified 19,631 conserved m6A sites linked to higher methylation and reduced gene expression.

Scoring Rationale

Directly verified against the full preprint text (not just the abstract), confirming precise, non-inflated figures for pretraining scale, benchmark performance, and the notable 10.7x inference speedup from the peak-level reformulation. Genuinely useful methodological contribution for computational biology practitioners, but remains a single not-yet-peer-reviewed preprint with no stated code or data release, capping it just above the notable threshold pending independent replication.

Sources

Public references used for this report.

1 source

arxiv.orgm6A-FORM: An m6A-focused Foundation Model for Decoding m6A Regulatory Function

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems