How to Evaluate Time Series Forecasting Models: Baselines, MAPE & MSE

This lesson explains how to set baseline reference models, use MAPE and MSE as quantitative evaluation metrics, and demonstrates their application on quarterly EPS data and simulated random walks to reveal the limits of predictability in non‑stationary series.

xkx's Tech General Store
xkx's Tech General Store
xkx's Tech General Store
How to Evaluate Time Series Forecasting Models: Baselines, MAPE & MSE

Learning objectives – Before building complex models, establish a baseline (historical mean, last value, seasonal naive) and quantify performance with MAPE and MSE. These tools replace subjective visual checks.

Baseline models – Historical mean suits long‑term stable series; last value captures strong inertia; seasonal naive reuses the same quarter from previous years. The baseline acts as a "passing line" for any ARIMA or deep‑learning model.

Evaluation metrics

MAPE (Mean Absolute Percentage Error) expresses deviation as a percentage, is scale‑independent, and is widely comparable across business metrics. Smaller values indicate better accuracy; industry practice treats MAPE < 10 % as excellent. It fails when actual values approach zero.

MSE (Mean Squared Error) squares errors before averaging, making it sensitive to large deviations and able to handle zero actuals, though its unit differs from the original data.

Case 1: Quarterly EPS baseline comparison

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf

# EPS data
eps_values = np.array([0.71, 0.63, 0.85, 0.44, 0.61, 0.69, 0.92, 0.55,
    0.72, 0.77, 0.92, 0.60, 0.83, 0.80, 1.00, 0.77,
    0.92, 1.00, 1.24, 1.00, 1.16, 1.30, 1.45, 1.25,
    1.26, 1.38, 1.86, 1.56, 1.53, 1.59, 1.83, 1.86,
    1.53, 2.07, 2.34, 2.25, 2.16, 2.43, 2.70, 2.25,
    2.79, 3.42, 3.69, 3.60, 3.60, 4.32, 4.32, 4.05,
    4.86, 5.04, 5.04, 4.41, 5.58, 5.85, 6.57, 5.31,
    6.03, 6.39, 6.93, 5.85, 6.93, 7.74, 7.83, 6.12,
    7.74, 8.91, 8.28, 6.84, 9.54, 10.26, 9.54, 8.73,
    11.88, 12.06, 12.15, 8.91, 14.04, 12.96, 14.85, 9.99,
    16.20, 14.67, 16.02, 11.61])

eps_index = pd.period_range("1960Q1", periods=len(eps_values), freq="Q").to_timestamp()
eps = pd.Series(eps_values, index=eps_index, name="EPS")
train = eps.iloc[:80]
test = eps.iloc[80:84]
forecasts = pd.DataFrame(index=test.index)
forecasts["历史均值"] = train.mean()
forecasts["最后值"] = train.iloc[-1]
forecasts["季节性朴素"] = train.iloc[-4:].to_numpy()
forecasts["实际值"] = test

rows = []
for name in ["历史均值", "最后值", "季节性朴素"]:
    rows.append({
        "baseline": name,
        "MAPE": mean_absolute_percentage_error(test, forecasts[name]) * 100,
        "MSE": mean_squared_error(test, forecasts[name])
    })
metrics = pd.DataFrame(rows).set_index("baseline")
print(metrics)

The resulting metrics show that the seasonal naive method achieves the lowest MAPE (≈11.56 %) and MSE (≈2.90), while the historical mean performs worst (MAPE ≈ 70 %). Visual comparison confirms the seasonal pattern in EPS data.

Random walk simulation

rng = np.random.default_rng(42)
steps = rng.standard_normal(1000)
steps[0] = 0
random_walk = pd.Series(np.cumsum(steps), name="random_walk")
plt.figure(figsize=(12, 4.8))
plt.plot(random_walk, color="#2563eb", linewidth=1.2)
plt.title("模拟随机游走")
plt.xlabel("时间点")
plt.ylabel("取值")
plt.grid(True, alpha=0.25)
plt.show()

ACF of the raw series decays slowly, indicating non‑stationarity. The Augmented Dickey‑Fuller test returns p = 0.5787 > 0.05, so the null hypothesis of a unit root cannot be rejected.

After first‑order differencing:

diff_series = pd.Series(np.diff(random_walk), name="diff")
fig, ax = plt.subplots(figsize=(12, 4))
plot_acf(diff_series, lags=30, ax=ax)
ax.set_title("一阶差分后 ACF(白噪声特征)")
plt.grid(True, alpha=0.2)
plt.show()
result_diff = adfuller(diff_series)
print(f"差分后ADF p值: {result_diff[1]:.4f}")

The differenced series yields p ≈ 0.0000 < 0.05, confirming stationarity and white‑noise behavior, which explains why random walks are inherently unpredictable.

Experiment 3: Predicting the last value of a random walk

train_rw = random_walk.iloc[:800]
test_rw = random_walk.iloc[800:]
pred_value = train_rw.iloc[-1]
pred_rw = pd.Series(np.full(len(test_rw), pred_value), index=test_rw.index)
mse_rw = mean_squared_error(test_rw, pred_rw)
print(f"随机游走测试集MSE: {mse_rw:.4f}")

The MSE of 88.58 shows that simply using the last observed value is the optimal baseline for a random walk; even sophisticated ML/DL models struggle to surpass it.

Key takeaways

Baseline models (historical mean, last value, seasonal naive) provide a necessary performance floor for any forecasting project.

MAPE offers an intuitive, scale‑independent error measure; MSE captures large deviations and works with zero values. Using both gives a balanced assessment.

Random walks are non‑stationary; after differencing they become white noise, confirming the theoretical prediction limit. Simple baselines often outperform complex models on such data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonforecastingMSEtime seriesrandom walkbaseline modelsMAPE
xkx's Tech General Store
Written by

xkx's Tech General Store

Code with the left hand, enjoy with the right; a keystroke sweeps away worries.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.