Amino-acid-aware kmers improve TCR repertoire classification
Per the University of Cambridge repository entry for the accepted paper (Hannah et al., 2026), the authors introduce a T-cell receptor (TCR) repertoire representation that encodes amino-acid similarity inside kmers. The paper reports training XGBoost and logistic regression classifiers on repertoires including samples from patients with coeliac disease and donors with prior cytomegalovirus (CMV) infection, and finds that XGBoost models outperform logistic regression on testing performance (Hannah et al., 2026; bioRxiv preprint). The authors also report that a reduced alphabet derived from BLOSUM62 yields slightly stronger XGBoost test performance than alternative kmer encodings. The paper highlights the challenge that repertoire datasets contain orders of magnitude fewer labeled samples than sequences, motivating similarity-aware representations (Hannah et al., 2026).
What happened
Per the University of Cambridge repository entry for the accepted paper (Hannah et al., 2026), the authors propose a T-cell receptor (TCR) repertoire representation that incorporates amino-acid similarity into kmer features. The manuscript reports experiments on repertoire datasets that include coeliac disease and prior cytomegalovirus (CMV) infection samples, and compares XGBoost and logistic regression classifiers on those representations. The authors report that XGBoost models outperform logistic regression on testing performance and that a reduced alphabet based on BLOSUM62 produced slightly stronger XGBoost test results than other kmer encodings (Hannah et al., 2026; bioRxiv preprint).
Technical details
Per the paper's abstract, the representation modifies standard kmer counts by grouping or encoding amino acids according to similarity, allowing non-identical kmers to contribute information. The work frames this as a compact, computationally efficient alternative to alignment-heavy or embedding-based approaches for small labeled repertoires. The experiments pair these kmer features with XGBoost and logistic regression classifiers and evaluate discrimination between disease and control repertoires; dataset scale limitations are emphasized, with far fewer labeled samples than total TCR sequences reported in the repository entry (Hannah et al., 2026).
Editorial analysis
Companies and research groups working on repertoire classification often face the same data imbalance: many sequences per donor but relatively few donors with confident phenotype labels. Industry-pattern observations: similarity-aware, reduced-alphabet encodings like BLOSUM-derived groupings can increase signal-to-noise for tree-based learners on small-sample problems, because they pool biologically meaningful variation and reduce feature sparsity. For practitioners: XGBoost's superior performance here aligns with common findings that gradient-boosted trees capture non-linear feature interactions that simple linear models miss, especially when engineered features capture domain structure.
Context and significance
Editorial analysis: This paper sits at the intersection of immunoinformatics and ML feature engineering rather than proposing a new neural architecture. For the biomarker and diagnostics community, modest gains from similarity-aware kmers matter because they can be implemented with lower compute and annotated-data requirements than large embedding models. For ML practitioners, the result reinforces that domain-informed feature transformations and ensemble learners remain competitive baselines in low-label regimes.
What to watch
For practitioners: replication on larger, independent cohorts and head-to-head comparisons with sequence-embedding approaches (for example, pretrained protein-language models) will clarify how broadly BLOSUM-based reductions transfer. Observers should also watch for open code or datasets accompanying the final publication, and for metrics beyond overall test accuracy (per-class sensitivity, calibration, and robustness to sampling variability) to assess clinical utility. Finally, methodological extensions that combine similarity-aware kmers with modern embeddings or multiple-instance learning could determine whether the reported gains scale with sample size.
Scoring Rationale
This is a method-focused paper with incremental but practical improvements for TCR repertoire classification, relevant to ML practitioners in bioinformatics. It is not a paradigm shift but offers useful baselines for low-label problems. The paper is not brand-new (April 2026), reducing immediacy.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

