Artificial Intelligence 10 min read

Can Table Modeling Scale? Rethinking Tree Models in the Age of Massive Compute

The article examines how the dramatic increase in GPU compute power—illustrated by a single H100 GPU equaling about 200 Hadoop instances—challenges the dominance of tree‑based models for structured data, presents scaling‑law experiments with KMLP and FOUND, and argues that pre‑training can redefine the balance between compute, data, and algorithms.

Machine Learning Algorithms & Natural Language Processing

Apr 17, 2026

Can Table Modeling Scale? Rethinking Tree Models in the Age of Massive Compute

Compute imbalance: A single H100 GPU (FP16) provides roughly 200 times the compute of a 96‑core CPU Hadoop instance, highlighting a massive shift in raw processing power.

The "Bitter Lesson": Following Richard Sutton’s observation, methods that scale with compute tend to outpace handcrafted, domain‑specific solutions. Large language models exemplify this, yet many high‑value industries still rely on tree models such as XGBoost and Random Forest for structured data tasks.

Question posed: With a 200‑fold compute gap, can GPU‑scale parallelism and pre‑training be introduced to structured‑data modeling to rebalance the three core factors—compute, data, and algorithms?

Work 1 – Table‑data pre‑training (KMLP): The Zhejiang‑Ant AIforData team built KMLP (Kolmogorov‑Arnold Network with gated MLP), a hybrid architecture that uses shallow KAN for feature engineering and gMLP as the backbone. On a real‑world credit‑scoring dataset containing 2 billion samples, KMLP consistently outperformed traditional GBDT models, and its performance gap widened as data volume grew, demonstrating a clear scaling law for table data.

Key advantages of KMLP: It overcomes GBDT’s distributed‑computing bottleneck on massive datasets and eliminates reliance on manual feature engineering by learning feature representations end‑to‑end.

Work 2 – Sequential‑data pre‑training (FOUND): The FOUND framework (Transferable and Forecastable User Targeting Foundation Model) targets heterogeneous user behavior sequences and structured data across internet platforms. By aligning compressed sequence embeddings with semantically‑derived textual descriptions via contrastive learning, FOUND improves cross‑domain transferability and predictive power, delivering gains in over 50 business scenarios.

Scaling law for sequences: Experiments show that when input sequence length or user count is modest, performance scales roughly linearly (exponential in days/users). As length or count grows large, scaling plateaus, creating a bottleneck. Compressing sequences with RQ‑VAE increases information density, delaying the bottleneck—a phenomenon the authors call the “Densing Law”.

Unified compression‑driven scaling: By converting multi‑source user sequences into semantic token IDs using MRQ‑VAE, the team achieved further performance gains on 80 % of real‑world benchmarks, with successful deployment in digital finance, payment security, recommendation, and advertising.

Conclusion and outlook: The massive compute increase validates that structured‑data modeling can adopt scaling‑law‑driven pre‑training, similar to NLP and CV. The authors assert that the era dominated by handcrafted feature engineering and tree‑model tuning is ending, and a “large‑model moment” for structured data is imminent.

GPU scaling law pretraining Structured Data FOUND KMLP table modeling

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.