Reduce Memory by 75% Using D‑CHAG’s Cross‑Channel Hierarchical Aggregation
Researchers at Oak Ridge National Laboratory introduced D‑CHAG, a distributed cross‑channel hierarchical aggregation method that cuts memory consumption by up to 75% and more than doubles throughput when training massive multi‑channel foundation models on up to 1,024 AMD GPUs, as demonstrated on hyperspectral imaging and weather‑forecasting workloads.
Problem
Vision‑based scientific foundation models require tokenizing high‑dimensional, multi‑channel data. Tokenization and cross‑channel aggregation incur large compute and memory costs, and existing parallelism strategies (tensor, sequence, or data parallelism) do not fully mitigate these bottlenecks.
Distributed Cross‑Channel Hierarchical Aggregation (D‑CHAG)
D‑CHAG combines two orthogonal techniques:
Distributed tokenization : each tensor‑parallel (TP) rank tokenizes only a subset of input channels. After tokenization a single AllGather operation collects the partial token sets so that cross‑attention can be applied.
Hierarchical cross‑channel aggregation : tokens are aggregated in a multi‑level tree. Each TP rank performs a partial‑channel aggregation module locally, then a final cross‑attention step merges the results. During back‑propagation only the gradients for the locally held channels are gathered, eliminating extra communication.
Datasets for Evaluation
Hyperspectral plant images : 494 poplar images from ORNL’s Advanced Plant Phenotyping Lab, each with 500 spectral channels spanning 400 nm–900 nm.
ERA5 weather reanalysis : 80 input channels (5 atmospheric variables and 3 surface variables across >10 pressure levels). Original 0.25° resolution (770 × 1440) is re‑gridded to 5.625° (32 × 64) using the xESMF package with bilinear interpolation.
Implementation Variants
D‑CHAG‑L (Linear layer) : hierarchical aggregation uses linear layers, giving low memory overhead and suitability for very high channel counts.
D‑CHAG‑C (Cross‑attention layer) : replaces linear layers with cross‑attention, incurring higher compute cost but delivering larger gains for ultra‑large models or extremely high channel numbers.
Performance Results on Frontier (AMD GPUs)
Peak memory usage reduced by up to 75 % compared with pure TP.
Sustained throughput increased by >2× when scaling to 1,024 GPUs.
For 512‑channel data a single cross‑attention layer is slightly slower than the baseline; for 1,024‑channel data it improves throughput by ~60 %.
Increasing hierarchy depth (Tree0 → Tree2) yields noticeable gains for 512 channels while performance for 1,024 channels remains stable.
Linear‑layer variant (D‑CHAG‑L) achieves the best overall speed‑up; the optimal configuration is D‑CHAG‑L‑Tree0 (one aggregation layer).
Cross‑attention variant (D‑CHAG‑C) shows modest gains on two GPUs and ~60 % improvement on eight GPUs.
Scaling with Model Size
7 B parameter models: linear aggregation gives 30‑70 % speed‑up; cross‑attention gives 10‑60 %.
15 B parameter models: 20‑50 % speed‑up.
26 B parameter models: 10‑30 % speed‑up.
With D‑CHAG a 26 B model using 512 channels fits within <80 % of available memory, whereas pure TP cannot train a 26 B model even with 256 channels.
Application‑Level Evaluation
Hyperspectral mask‑prediction : training loss curves for a single‑GPU baseline and D‑CHAG (run on two GPUs) overlap, indicating comparable convergence.
ERA5 weather forecasting : D‑CHAG matches baseline loss and yields negligible differences in RMSE, MSE, and Pearson correlation (wACC) for 7‑, 14‑, and 30‑day forecasts.
Relation to Vision Transformers
Vision Transformers (ViT) treat images as patch token sequences, which naturally benefits from D‑CHAG’s ability to handle high‑dimensional, multi‑channel scientific data. D‑CHAG therefore extends the scalability of ViT‑based foundation models to scientific workloads with many input channels.
Key Figures
Reference
Distributed Cross‑Channel Hierarchical Aggregation for Foundation Models, SC25. DOI: 10.1145/3712285.3759870 (https://dl.acm.org/doi/10.1145/3712285.3759870)
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
