Inspired by DeepSeek Engram, Gengram Boosts Genomic Foundation Models by Up to 22.6%
The Genos team introduces Gengram, a 20‑million‑parameter plug‑in that stores 1‑6‑mer embeddings in a hash memory, uses local window aggregation and gated writing, and delivers up to 22.6% performance gains across multiple genomic tasks while accelerating training.
Background
Genomic foundation models (GFMs) decode DNA sequences to infer cellular functions and organism development. Existing Transformer‑based GFMs depend on massive pre‑training and dense computation, which limits efficiency and hampers motif‑driven functional element detection.
Gengram Design
Gengram is a lightweight conditional memory plug‑in (~20 M parameters) that stores all k‑mers of length 1–6 in a hash table with static keys (the k‑mer strings) and learnable embedding values. During a forward pass, a sliding window (W = 21 bp) retrieves every k‑mer appearing in the window, aggregates embeddings per k, concatenates across k, and passes the result through a gate‑controlled module. The gated output is written into the residual stream before the attention block, allowing selective injection of motif evidence.
Local Window Aggregation
The window size was searched empirically; a 21‑bp window achieved the best validation performance. This length corresponds to two turns of the DNA helix (≈10.5 bp per turn), aligning bases that face the same side of the helix and facilitating phase‑consistent motif aggregation.
Training Data
The pre‑training corpus consists of 145 high‑quality haplotype‑resolved assemblies covering human (HPRC v2, GRCh38, CHM13) and non‑human primates (NCBI RefSeq). Sequences are one‑hot encoded with a vocabulary {A, T, C, G, N, <EOD>}. Three token mixes support ablation and full pre‑training:
50 B tokens at 8 192 context length (ablation)
200 B tokens at 8 k context length (formal pre‑train, 10 B tokens)
100 B tokens at 32 k context length (formal pre‑train, 10 B tokens)
Each mix maintains a 1:1 human‑to‑non‑human ratio.
Architecture Details
Hash Memory Construction : For each k = 1…6, a hash table stores static keys (k‑mer strings) and learnable embedding values.
Retrieval : All k‑mers within the current window are mapped to their table entries.
Aggregation : Embeddings are first aggregated per k, then concatenated across k.
Gating : A gate‑controlled module decides whether to write the aggregated motif evidence into the residual stream, enabling selective activation in functional regions.
Evaluation
Eighteen representative datasets spanning five task categories were used: genomic structure understanding, gene regulation prediction, epigenetic profiling, variant‑effect & clinical impact, and evolutionary analysis. Experiments were run with both 8 k and 32 k context lengths.
Splice‑site prediction AUC improved from 0.776 to 0.901 (+16.1%).
H3K36me3 epigenetic prediction AUC improved from 0.656 to 0.804 (+22.6%).
Compared with large DNA language models such as Evo2, NTv3, and GENERATOR‑3B, Gengram‑enhanced models achieve comparable or superior results while using far fewer training tokens and parameters, demonstrating strong data‑efficiency.
Training Acceleration Analysis
KL‑divergence diagnostics (LogitLens‑KL) were applied to quantify layer‑wise prediction‑readiness. After integrating Gengram, shallow layers exhibit faster KL decay, indicating earlier stabilization of useful supervision signals and smoother optimization trajectories, which translates into faster convergence.
Motif Memory Mechanism
Visualization of residual‑write intensities shows sparse, high‑contrast peaks aligned with functional regions such as TATA‑box promoters, low‑complexity poly‑T tracts, and exon boundaries. This pattern suggests that Gengram selectively captures decisive local evidence rather than injecting uniform information.
The mechanism can be described as “on‑demand retrieval → selective write → structured alignment,” allowing the model to rely less on implicit memory from massive data and more on explicit, interpretable motif storage.
Conclusion
Gengram demonstrates that a conditional k‑mer memory plug‑in can substantially improve genomic foundation models, delivering higher performance, faster training, and better interpretability without altering the base architecture. The approach highlights a shift from ever‑larger models toward smarter, modular designs for genomics.
Paper: https://arxiv.org/abs/2601.22203
Code: https://github.com/BGI-HangzhouAI/Gengram
Model weights: https://huggingface.co/BGI-HangzhouAI/Gengram
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
