Inspired by DeepSeek Engram, Gengram Boosts Genomic Foundation Models by Up to 22.6%

The Genos team introduces Gengram, a 20‑million‑parameter plug‑in that stores 1‑6‑mer embeddings in a hash memory, uses local window aggregation and gated writing, and delivers up to 22.6% performance gains across multiple genomic tasks while accelerating training.

HyperAI Super Neural
HyperAI Super Neural
HyperAI Super Neural
Inspired by DeepSeek Engram, Gengram Boosts Genomic Foundation Models by Up to 22.6%

Background

Genomic foundation models (GFMs) decode DNA sequences to infer cellular functions and organism development. Existing Transformer‑based GFMs depend on massive pre‑training and dense computation, which limits efficiency and hampers motif‑driven functional element detection.

Gengram Design

Gengram is a lightweight conditional memory plug‑in (~20 M parameters) that stores all k‑mers of length 1–6 in a hash table with static keys (the k‑mer strings) and learnable embedding values. During a forward pass, a sliding window (W = 21 bp) retrieves every k‑mer appearing in the window, aggregates embeddings per k, concatenates across k, and passes the result through a gate‑controlled module. The gated output is written into the residual stream before the attention block, allowing selective injection of motif evidence.

Local Window Aggregation

The window size was searched empirically; a 21‑bp window achieved the best validation performance. This length corresponds to two turns of the DNA helix (≈10.5 bp per turn), aligning bases that face the same side of the helix and facilitating phase‑consistent motif aggregation.

Training Data

The pre‑training corpus consists of 145 high‑quality haplotype‑resolved assemblies covering human (HPRC v2, GRCh38, CHM13) and non‑human primates (NCBI RefSeq). Sequences are one‑hot encoded with a vocabulary {A, T, C, G, N, <EOD>}. Three token mixes support ablation and full pre‑training:

50 B tokens at 8 192 context length (ablation)

200 B tokens at 8 k context length (formal pre‑train, 10 B tokens)

100 B tokens at 32 k context length (formal pre‑train, 10 B tokens)

Each mix maintains a 1:1 human‑to‑non‑human ratio.

Architecture Details

Hash Memory Construction : For each k = 1…6, a hash table stores static keys (k‑mer strings) and learnable embedding values.

Retrieval : All k‑mers within the current window are mapped to their table entries.

Aggregation : Embeddings are first aggregated per k, then concatenated across k.

Gating : A gate‑controlled module decides whether to write the aggregated motif evidence into the residual stream, enabling selective activation in functional regions.

Evaluation

Eighteen representative datasets spanning five task categories were used: genomic structure understanding, gene regulation prediction, epigenetic profiling, variant‑effect & clinical impact, and evolutionary analysis. Experiments were run with both 8 k and 32 k context lengths.

Splice‑site prediction AUC improved from 0.776 to 0.901 (+16.1%).

H3K36me3 epigenetic prediction AUC improved from 0.656 to 0.804 (+22.6%).

Compared with large DNA language models such as Evo2, NTv3, and GENERATOR‑3B, Gengram‑enhanced models achieve comparable or superior results while using far fewer training tokens and parameters, demonstrating strong data‑efficiency.

Training Acceleration Analysis

KL‑divergence diagnostics (LogitLens‑KL) were applied to quantify layer‑wise prediction‑readiness. After integrating Gengram, shallow layers exhibit faster KL decay, indicating earlier stabilization of useful supervision signals and smoother optimization trajectories, which translates into faster convergence.

Motif Memory Mechanism

Visualization of residual‑write intensities shows sparse, high‑contrast peaks aligned with functional regions such as TATA‑box promoters, low‑complexity poly‑T tracts, and exon boundaries. This pattern suggests that Gengram selectively captures decisive local evidence rather than injecting uniform information.

The mechanism can be described as “on‑demand retrieval → selective write → structured alignment,” allowing the model to rely less on implicit memory from massive data and more on explicit, interpretable motif storage.

Conclusion

Gengram demonstrates that a conditional k‑mer memory plug‑in can substantially improve genomic foundation models, delivering higher performance, faster training, and better interpretability without altering the base architecture. The approach highlights a shift from ever‑larger models toward smarter, modular designs for genomics.

Paper: https://arxiv.org/abs/2601.22203

Code: https://github.com/BGI-HangzhouAI/Gengram

Model weights: https://huggingface.co/BGI-HangzhouAI/Gengram

Transformerperformance improvementAI genomicsGengramGenomic Engramk-mer memorymotif retrieval
HyperAI Super Neural
Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.