Artificial Intelligence 10 min read

How ChartMoE Uses Sparse MoE to Master Chart Understanding and Preserve General Knowledge

ChartMoE, an oral paper at ICLR 2025, introduces a multi‑stage alignment training pipeline and a diversified MoE Connector that dramatically improves chart comprehension while maintaining performance on general multimodal tasks, backed by extensive data construction, training recipes, and thorough evaluations.

AI Frontier Lectures

Apr 3, 2025

How ChartMoE Uses Sparse MoE to Master Chart Understanding and Preserve General Knowledge

ChartMoE was accepted as an oral paper at ICLR 2025, a collaboration between IDEA, Tsinghua University, Peking University, and HKUST (Guangzhou), with only about 1.8% of submissions receiving oral presentations.

Research Motivation and Contributions

The goal is not to increase model capacity but to explore how a sparse Mixture‑of‑Experts (MoE) structure can enhance downstream chart tasks by aligning tasks to improve chart understanding while preserving performance on generic tasks.

Instead of random or co‑upcycle expert initialization, ChartMoE initializes experts using diverse alignment tasks, increasing heterogeneity and interpretability of the learned visual representations.

Model Overview

ChartMoE builds on the InternLM‑XComposer2 foundation and adds a MoE Connector composed of multiple experts. It supports advanced chart capabilities such as understanding, redraw, edit, highlight of important parts, and chart‑type conversion. The model aligns each chart to three structured formats—Table, JSON, and Python code—through a multi‑stage training process, creating a (Chart, Table, JSON, Code) quadruple that yields richer visual representations.

Data Construction

Starting from open chart datasets (ChartQA, PlotQA, ChartY), the authors defined JSON keys for each chart type and populated them via random generation or GPT‑based methods. These JSON entries were then inserted into predefined Python code templates to generate corresponding charts, forming a (Chart, Table, JSON, Code) quadruple. This process produced roughly 900 k aligned samples, referred to as ChartMoE‑Align.

Training Recipes

Multi‑stage alignment training on ChartMoE‑Align (≈500 k Table, 200 k JSON, 100 k Code) where only the MLP Connector is trained before assembling the full MoE Connector.

Broad knowledge learning using the MMC‑Instruct dataset, which contains many chart‑related tasks (e.g., chart summarization). This stage trains the MoE Connector—especially the learnable router—and applies LoRA to the LLM.

Chart‑domain supervised fine‑tuning (SFT) with ChartQA and ChartGemma.

Program‑of‑Thought (PoT): the model outputs Python code to solve problems, improving accuracy on calculation‑heavy queries. Example:

profits = [5, 7, 9, 1, 11, -3]

print(max(profits) - min(profits))

Expert Analysis

Visualization of expert selection shows that background tokens predominantly choose generic experts, data points and graphical edges favor the code expert, while textual elements such as titles, axis labels, and legends tend to select table/JSON experts. This distribution aligns with the intended specialization of each expert type.

Performance Evaluation

In the general domain, ChartMoE was evaluated on MME and MMBench benchmarks. Compared with the baseline InternLM‑XComposer2 and a model directly fine‑tuned on chart data, ChartMoE exhibits minimal forgetting on generic tasks and even gains in several sub‑domains.

In the chart domain, benchmarks such as ChartQA, ChartBench, ChartFC, and ChartCheck show that ChartMoE consistently outperforms the baseline, with especially large improvements over the directly SFT‑trained model.

Conclusions

From a representation standpoint, heterogeneous MoE experts provide more diverse and comprehensive visual features, leading to superior downstream performance.

From a knowledge perspective, the sparse MoE acts as an implicit regularizer, significantly boosting task performance while mitigating forgetting on generic tasks.

ChartMoE is positioned as a pioneering work that encourages further research into sparse structures for downstream multimodal language models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts multimodal LLM Chart Understanding ChartMoE Sparse Modeling

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.