Predicting Tomorrow’s Weather with Random Forests: A European City Case Study
Using detailed meteorological records from 18 European cities between 2000 and 2010, this article demonstrates how random forest regression and comprehensive data preprocessing can forecast daily precipitation, evaluate model performance, and compare climatic patterns across cities, highlighting both strengths and limitations of the approach.
How can we predict tomorrow's weather? Farmers, travelers, and everyday people all rely on accurate forecasts, making weather prediction a blend of science, culture, economy, and social life.
Scientists use weather stations, satellites, and especially data science to decode the weather code and anticipate future conditions.
Using detailed records from 18 European cities, we explore how random forest algorithms can map tomorrow's sky.
Data Collection, Selection, and Processing
The original meteorological data were retrieved from the ECA&D project, which provides daily observations from European and Mediterranean stations. We selected 18 cities with daily data available between 2000 and 2010. The cities include Basel (Switzerland), Budapest (Hungary), Dresden, Düsseldorf, Kassel, Munich (Germany), De Bilt, Maastricht (Netherlands), Heathrow (UK), Ljubljana (Slovenia), Malmö, Stockholm (Sweden), Montpellier, Perpignan, Tours (France), Oslo (Norway), Rome (Italy), and Sankt Blas (Austria).
Only the 2000‑2010 period was kept, yielding 3,654 daily observations across the 18 locations. The dataset contains variables such as average temperature, maximum temperature, minimum temperature, cloud cover, wind speed, gust, humidity, pressure, global radiation, precipitation, and sunshine duration.
After collection, basic cleaning removed columns with >5% invalid entries (marked “-9999”). Columns with ≤5% invalid entries had missing values replaced by the column mean, resulting in 165 variables for 3,654 days. Units were converted to more intuitive scales (e.g., temperature in °C, wind speed in m/s, humidity as a percentage, pressure in 1000 hPa, radiation in 100 W/m², precipitation in cm, sunshine in hours) to facilitate machine‑learning modeling without additional standardization.
Physical Units of Variables
Original units:
CC: cloud cover (eighths)
DD: wind direction (degrees)
FG: wind speed (0.1 m/s)
FX: gust (0.1 m/s)
HU: humidity (1 %)
PP: sea‑level pressure (0.1 hPa)
QQ: global radiation (W/m²)
RR: precipitation (0.1 mm)
SS: sunshine duration (0.1 h)
TG: mean temperature (0.1 °C)
TN: minimum temperature (0.1 °C)
TX: maximum temperature (0.1 °C)
Converted units:
FG, FX: 1 m/s
HU: 100 % scale
PP: 1000 hPa
QQ: 100 W/m²
RR: 10 mm
SS: 1 h
TG, TN, TX: 1 °C
<code># Import required libraries
import pandas as pd
# Load dataset
file_path = "data/weather_prediction_dataset.csv"
weather_data = pd.read_csv(file_path)
# Show first five rows
weather_data.head()
</code>Sample of the first five rows:
<code> DATE MONTH BASEL_cloud_cover BASEL_humidity BASEL_pressure \
0 2000-01-01 1 8 0.89 1.0286
1 2000-01-02 1 8 0.87 1.0318
2 2000-01-03 1 5 0.81 1.0314
3 2000-01-04 1 7 0.79 1.0262
4 2000-01-05 1 5 0.90 1.0246
BASEL_global_radiation BASEL_precipitation BASEL_sunshine \
0 0.20 0.03 0.0
1 0.25 0.00 0.0
2 0.50 0.00 3.7
3 0.63 0.35 6.9
4 0.51 0.07 3.7
BASEL_temp_mean BASEL_temp_min ... BASEL_humidity_lag_3
0 2.9 1.6 ... NaN
1 3.6 2.7 ... NaN
2 2.2 0.1 ... NaN
3 3.9 0.5 ... 0.89
4 6.0 3.8 ... 0.87
BASEL_pressure_lag_1 BASEL_pressure_lag_2 BASEL_pressure_lag_3 \
0 NaN NaN NaN
1 1.0286 NaN NaN
2 1.0318 1.0286 NaN
3 1.0314 1.0318 1.0286
4 1.0262 1.0314 1.0318
BASEL_cloud_cover_lag_1 BASEL_cloud_cover_lag_2 BASEL_cloud_cover_lag_3 \
0 NaN NaN NaN
1 8.0 NaN NaN
2 8.0 8.0 NaN
3 5.0 8.0 8.0
4 7.0 5.0 8.0
BASEL_precipitation_lag_1 BASEL_precipitation_lag_2 \
0 NaN NaN
1 0.03 NaN
2 0.00 0.03
3 0.00 0.00
4 0.35 0.00
BASEL_precipitation_lag_3
0 NaN
1 NaN
2 NaN
3 0.03
4 0.00
[5 rows x 180 columns]
</code>Data Integrity Analysis
First we check for missing values and outliers.
<code># Check missing values
missing_data_summary = weather_data.isnull().sum()
# Show columns with missing values (if any)
missing_columns = missing_data_summary[missing_data_summary > 0]
missing_columns
</code> <code>Series([], dtype: int64)</code>No missing values were found. Next, we examine outliers for a few representative variables.
Seasonality and Trend Analysis
We analyze Basel’s temperature and precipitation seasonality and trends.
<code>import matplotlib.pyplot as plt
# Convert DATE column to datetime
weather_data['DATE'] = pd.to_datetime(weather_data['DATE'], format='%Y%m%d')
# Select Basel temperature and precipitation
basel_temp_mean = weather_data[['DATE', 'BASEL_temp_mean']]
basel_precipitation = weather_data[['DATE', 'BASEL_precipitation']]
# Plot temperature time series
plt.figure(figsize=(12, 6))
plt.plot(basel_temp_mean['DATE'], basel_temp_mean['BASEL_temp_mean'], label='Mean Temperature (°C)')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.title('Mean Temperature Trend in Basel (Switzerland)')
plt.legend()
plt.show()
# Plot precipitation time series
plt.figure(figsize=(12, 6))
plt.plot(basel_precipitation['DATE'], basel_precipitation['BASEL_precipitation'], label='Precipitation (cm)')
plt.xlabel('Date')
plt.ylabel('Precipitation (cm)')
plt.title('Precipitation Trend in Basel (Switzerland)')
plt.legend()
plt.show()
</code>Figures:
Observations:
Average temperature shows clear seasonal cycles, rising in summer and falling in winter, with a relatively stable long‑term trend.
Precipitation also exhibits seasonal variation, though less pronounced, and remains stable over the decade.
City‑wise Comparison
We compare average temperature and humidity across all 18 cities for 2000‑2010.
<code># Extract city names from column prefixes
cities = list(set([col.split('_')[0] for col in weather_data.columns if '_' in col]))
# Compute average temperature and humidity per city
city_avg_temp_humidity = []
for city in cities:
temp_cols = [col for col in weather_data.columns if city in col and 'temp_mean' in col]
humidity_cols = [col for col in weather_data.columns if city in col and 'humidity' in col]
avg_temp = weather_data[temp_cols].mean().mean()
avg_humidity = weather_data[humidity_cols].mean().mean()
city_avg_temp_humidity.append((city, avg_temp, avg_humidity))
city_avg_temp_humidity_df = pd.DataFrame(city_avg_temp_humidity, columns=['City', 'Avg_Temperature (°C)', 'Avg_Humidity'])
city_avg_temp_humidity_df = city_avg_temp_humidity_df.sort_values(by='Avg_Temperature (°C)', ascending=False)
city_avg_temp_humidity_df
</code>Resulting table (image):
Temperature: Rome and Perpignan have the highest averages; Sankt Blas the lowest.
Humidity: Sankt Blas shows the highest average humidity, while Perpignan the lowest.
Machine Learning Model
We use Basel’s data to predict the next day’s precipitation.
Select target variable and features – precipitation as target; temperature, humidity, pressure, and cloud cover as features.
Create lag features – past three days of each variable.
Split data – training, validation, and test sets.
Model selection and training – we choose a suitable time‑series model.
Prediction and evaluation – assess performance on the test set.
<code># Choose target and features
target_variable = 'BASEL_precipitation'
features = ['BASEL_temp_mean', 'BASEL_humidity', 'BASEL_pressure', 'BASEL_cloud_cover']
# Create lag features (1‑3 days)
lag_days = 3
for feature in features + [target_variable]:
for lag in range(1, lag_days + 1):
weather_data[f'{feature}_lag_{lag}'] = weather_data[feature].shift(lag)
# Assemble feature matrix and target vector
selected_features = [f'{feature}_lag_{lag}' for feature in features for lag in range(1, lag_days + 1)]
selected_features += features
X = weather_data[selected_features][lag_days:]
y = weather_data[target_variable][lag_days:]
# Standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train‑test split (no shuffling for time series)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=False)
# Show first few rows of the prepared data
X_train[:5], y_train[:5]
</code>We then train a Random Forest regressor.
<code>from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)
# Predictions
y_train_pred = rf_regressor.predict(X_train)
train_mae = mean_absolute_error(y_train, y_train_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
y_test_pred = rf_regressor.predict(X_test)
test_mae = mean_absolute_error(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_mae, train_rmse, test_mae, test_rmse
</code>Performance:
Training MAE: 0.0865 cm
Training RMSE: 0.1651 cm
Test MAE: 0.2450 cm
Test RMSE: 0.4743 cm
The higher test error suggests possible over‑fitting, which could be mitigated by hyper‑parameter tuning, feature engineering, or regularization.
<code># Visualize actual vs. predicted precipitation on the test set
plt.figure(figsize=(12, 6))
plt.plot(weather_data['DATE'][lag_days:].iloc[-len(y_test):], y_test, label='Actual Precipitation (cm)')
plt.plot(weather_data['DATE'][lag_days:].iloc[-len(y_test):], y_test_pred, label='Predicted Precipitation (cm)', linestyle='--')
plt.xlabel('Date')
plt.ylabel('Precipitation (cm)')
plt.title('Predicted vs Actual Precipitation in Basel (Switzerland)')
plt.legend()
plt.show()
</code>The plot shows that while the model captures general precipitation trends, noticeable deviations remain for certain days.
Conclusion
We selected Basel’s daily precipitation as the prediction target and used related meteorological variables as features.
A Random Forest regression model was trained and evaluated, achieving modest accuracy on the test set.
Performance gaps indicate the need for further feature engineering, hyper‑parameter optimization, or alternative modeling approaches.
References:
[1] Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th‑century surface air temperature and precipitation series for the European Climate Assessment. Int. J. of Climatol., 22, 1441‑1453. [2] https://www.kaggle.com/datasets/thedevastator/weather-prediction
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.