Why Kriging and Gaussian Process Regression Share a Math Framework Yet Perform So Differently
This article benchmarks Kriging variants, Gaussian Process Regression, and several machine‑learning baselines on the SPE9 3‑D permeability dataset, revealing why GPR dramatically outperforms Kriging in accuracy despite their shared covariance‑kernel foundation, and explains the occurrence of negative R² scores.
Dataset: SPE9
SPE9 is a classic benchmark consisting of a synthetic yet realistic 3‑D permeability grid used for testing simulators and modeling workflows. The task is treated as a spatial interpolation problem with normalized (x, y, z) coordinates as inputs and permeability as the target. A total of 1,500 points are sampled, split into training and test sets for single‑split, 5‑fold, and 20‑fold cross‑validation.
Compared Models
Ordinary Kriging (OK) : predicts values as a weighted average of nearby points using a variogram; two configurations are tested – automatic variogram model selection (spherical, exponential, Gaussian, linear) and a fixed spherical model.
Universal Kriging : adds a polynomial drift (linear or quadratic) to capture large‑scale trends.
Nested Variogram Kriging : combines two covariance structures, e.g.
gamma(h) = nugget + ps1 * structure_1(h, range_1) + ps2 * structure_2(h, range_2)which mirrors the GPR practice of summing kernels such as RBF(short_length_scale) + RBF(long_length_scale). Implemented in Rust with a generalized Levenberg‑Marquardt optimizer, testing spherical+Gaussian and spherical+spherical combinations.
Gaussian Process Regression (GPR) : uses sklearn.GaussianProcessRegressor. The kernel hyper‑parameters are optimized by maximizing marginal likelihood, leading to O(N³) training complexity.
Random Forest Regressor and Regression Kriging : non‑spatial baseline and a hybrid that fits a large‑scale trend with a machine‑learning model before applying Kriging to residuals.
First Round – Single 1,200/300 Split
GPR achieves a clear advantage with R² = 0.646, while Kriging variants range from 0.18 to 0.24.
GPR training is ~1,000× slower and prediction ~200× slower than Kriging.
Nested variogram Kriging narrows the gap (R² improves from 0.177 to 0.244) with negligible computational cost.
Universal Kriging’s polynomial drift adds little benefit (R² 0.177 → 0.19) but increases runtime 30–100×.
Why GPR Wins
The fitted kernel in the first round is:
0.96² · RBF([0.294, 2.91e-05, 0.0183])
+ 0.072² · Matern([4.47e-05, 3.08e+04, 4.53e+04], nu=1.5)
+ WhiteKernel(0.0293)The Matérn component has extremely large length scales in the y and z dimensions (≈30,000–45,000), effectively representing a near‑linear large‑scale trend. GPR therefore discovers a short‑range RBF structure in the x direction combined with a smooth, almost linear regional trend, matching the true permeability field.
Nested variogram Kriging attempts the same decomposition but is limited to bounded covariance shapes (spherical, exponential, Gaussian) and cannot express the unbounded near‑linear trend as cleanly. Moreover, Kriging fits the variogram in two separate steps (empirical curve then prediction), whereas GPR jointly optimizes all hyper‑parameters, leading to higher accuracy.
Anisotropy Scaling Attempt
Scaling coordinates by the variogram range per axis to correct geometric anisotropy degrades performance (R² drops to –0.03 to 0.07) because the extreme scaling makes most point pairs appear “infinitely far,” breaking the spatial correlation that Kriging relies on.
Second Round – Cross‑Validation
Five‑fold CV (average ± std) shows GPR with a simpler ARD RBF + white‑noise kernel (restarted twice) achieving mean R² = 0.342, lower than the single‑split result, indicating sensitivity to the test split.
Two issues emerge:
GPR’s accuracy is unstable (std = 0.374); one fold yields negative R² (–0.351) while another reaches 0.694.
Nested variogram Kriging is the most stable spatial method (std = 0.050) and never produces negative R², whereas ordinary and universal Kriging show negative values in some folds.
Twenty‑fold CV (75 test points per fold) makes the pattern clearer: all methods except GPR have negative mean R², meaning they perform worse than a naïve mean‑permeability baseline. GPR remains the only method consistently above the baseline (mean R² = 0.497, lowest variance = 0.236). Nested variogram Kriging is the “least bad” spatial method, with mean R² close to zero.
Understanding Negative R²
R² is defined as: R² = 1 - (SS_res / SS_tot) where SS_res = Σ(y_true - y_pred)² and SS_tot = Σ(y_true - ȳ)². If the model’s residual sum of squares exceeds the total sum of squares, the fraction exceeds 1 and R² becomes negative. This is not a bug; it indicates the model predicts worse than simply using the mean of the target.
The usual guarantee that R² lies in [0, 1] holds only for ordinary least‑squares regression evaluated on the training data, not for out‑of‑sample test sets or non‑linear spatial models like Kriging and GPR.
When to Choose Kriging vs. GPR
Choose GPR when :
Dataset size is a few thousand points (O(N³) training cost is acceptable).
Accuracy is more important than latency.
Time for hyper‑parameter search can be afforded (each fit takes tens of seconds; 4,000 points may require >11 min).
Automatic discovery of multi‑scale structure via kernel composition is desired.
Choose (nested) Kriging when :
Scoring 50 k–1 M+ points is required (GPR cannot scale).
Millisecond‑level prediction speed is needed.
Some loss of accuracy is acceptable but the most stable spatial method is required.
Domain knowledge indicates multiple spatial scales; nested variogram Kriging captures part of the multi‑scale information at negligible cost.
In practice, use Kriging (especially the nested variant) for large grids, and reserve GPR for small, high‑interest regions or calibration tasks where its higher accuracy justifies the computational expense.
Conclusion
Kriging trades roughly 0.47 R² units for a 1,000× speedup. Under 20‑fold CV, all Kriging variants and the random‑forest baselines have negative average R², while GPR consistently exceeds the mean‑permeability baseline. Nested variogram Kriging mitigates failure without fully closing the gap to GPR. The key take‑away is to always perform cross‑validation and compare R² against a simple mean baseline; a single train‑test split can misleadingly favor a fast but fragile model.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
