Artificial Intelligence 13 min read

MIT's Pichia-CLM model learns yeast DNA language, boosting protein yield up to 3‑fold

A MIT research team introduced Pichia-CLM, a GRU‑based language model trained on a 27 k‑pair Pichia pastoris dataset that optimizes codon usage, and demonstrated across six proteins that it consistently outperforms four commercial codon‑optimization tools, delivering up to a three‑fold increase in heterologous protein secretion.

HyperAI Super Neural

Mar 2, 2026

MIT's Pichia-CLM model learns yeast DNA language, boosting protein yield up to 3‑fold

A MIT research team proposed Pichia-CLM, a deep‑learning language model that treats yeast DNA as a language to improve codon optimization for the industrial host Komagataella phaffii (Pichia pastoris). The model was trained on approximately 27,000 amino‑acid–coding‑sequence pairs collected from two Pichia variants (CBS7435 and GS115) and additional genome annotations.

Dataset construction : The sequences were tokenized with <START>, <END> and <PAD> markers, split into training (80%) and test (20%) sets, and no explicit optimization objective was injected, allowing the model to learn the host’s natural expression preferences.

Model architecture : Pichia-CLM uses a GRU‑based encoder–decoder network. GRU was chosen over Transformer because, on the ~27 k‑sequence dataset, Transformers added unnecessary complexity while GRU achieved comparable performance with lower computational cost.

Training process : Early stopping was applied using a validation set (20% of training data). Hyper‑parameters such as amino‑acid embedding dimension, codon embedding dimension, encoder unit count, and decoder layer sizes were tuned via Bayesian optimization.

Experimental validation : Six proteins of varying complexity (hGH, hGCSF, VHH nanobody 3B2, engineered SARS‑CoV‑2 RBD variant, HSA, and Trastuzumab) were expressed in Pichia pastoris using constructs designed by Pichia‑CLM and by four commercial codon‑optimization tools (Azenta, IDT, GenScript, Thermo Fisher). Two metrics were used: BestTiter (number of proteins achieving highest titer) and Aggregated Score (sum of normalized titers). Pichia‑CLM achieved the highest titer for five of the six proteins and outperformed all commercial tools on both metrics, with up to a three‑fold increase for HSA and ~25% improvement for hGH and hGCSF.

Sequence‑feature analysis : The study examined correlations between traditional codon‑usage‑bias (CUB) metrics and protein yield, finding weak and inconsistent relationships (e.g., maximum positive correlation of 0.43 for CFD on HSA). Pichia‑CLM‑designed constructs contained no predicted negative cis‑regulatory elements, whereas commercial designs frequently did (e.g., GenScript had 1 element in 3 of 6 proteins; Azenta and IDT had 3–4 elements in at least one protein). In a broader benchmark of 52 biotechnology‑related proteins, 75% of Pichia‑CLM sequences lacked negative elements, compared with higher incidences in commercial designs.

The authors conclude that Pichia‑CLM not only generates high‑yield constructs but also captures meaningful genetic sequence features, offering a robust, data‑driven alternative to traditional codon‑optimization methods.