Big Data 9 min read

How Tsinghua’s Big Data Initiative Boosted Refinery Energy Forecasts with GRU

The Tsinghua University Big Data Capability Project applied GRU‑based deep learning, pulse‑event encoding, and advanced feature engineering to transform discrete refinery energy data into continuous sequences, achieving prediction accuracies of 84.2%, 82.7% and 81.6% for fuel gas, medium‑pressure and low‑pressure steam respectively.

Data Party THU

Jan 25, 2026

How Tsinghua’s Big Data Initiative Boosted Refinery Energy Forecasts with GRU

Background

The project was carried out by a graduate student from Tsinghua University’s School of Software in collaboration with Sinopec Engineering Construction Co. The goal was to develop a data‑driven model for predicting energy consumption and carbon emissions of refinery units, supporting China’s “dual‑carbon” targets.

Problem Statement

Conventional thermodynamic or mechanistic models are computationally intensive and cannot be validated in real time. Existing machine‑learning approaches often focus only on prediction accuracy and ignore the identification and interpretation of key influencing factors, which is essential for industrial decision‑making.

Data Preprocessing – Pulse‑Event Encoding

Measured energy data were recorded as discrete integer steps because of instrument precision limits, producing a staircase‑like series unsuitable for linear or simple time‑series models. To address this, a pulse‑event encoding based on stop‑time theory was introduced. The method converts each integer jump into an interval‑equivalent pulse event where the length of the interval reflects the frequency of the underlying physical event. By mapping the sparse stepwise series to a continuous curve enriched with temporal dynamics, the data become more amenable to deep‑learning models.

# Pseudocode for pulse‑event encoding
for i in range(1, len(raw_series)):
    delta = raw_series[i] - raw_series[i-1]
    if delta != 0:
        # interval length inversely proportional to event frequency
        interval = compute_interval(delta)
        pulse_series.append((timestamp[i], interval))
# Interpolate pulse_series to obtain a continuous signal
continuous_signal = interpolate(pulse_series)

Feature Engineering

Performed clustering on the continuous signals to group similar operating regimes.

Calculated pairwise time‑series similarity using Dynamic Time Warping (DTW) to quantify shape similarity between clusters.

Applied Principal Component Analysis (PCA) on the DTW distance matrix to obtain a low‑dimensional representation.

Conducted Pearson correlation between each principal component and target energy variables; retained the top 20 % of features with the highest absolute correlation for model input.

Model Construction and Comparison

Four model families were benchmarked on the engineered feature set:

Tree‑based algorithms (e.g., Random Forest, Gradient Boosting).

Multi‑Layer Perceptron (MLP).

Gated Recurrent Unit (GRU) networks.

Transformer‑based sequence models.

The GRU architecture was selected because its gating mechanisms (reset and update gates) effectively capture long‑term dependencies while mitigating the vanishing‑gradient problem of standard RNNs. The final GRU configuration was:

Input dimension: number of selected features (≈ 0.2 × original feature count).

Hidden units: 128.

Layers: 2 stacked GRU layers.

Dropout: 0.2 between layers.

Optimizer: Adam (learning rate = 0.001).

Loss function: Mean Squared Error (MSE).

Training epochs: 100 with early stopping (patience = 10).

Batch size: 64.

Results

On a held‑out test set, the GRU model achieved the following prediction accuracies (coefficient of determination, R²) for three key refinery streams:

Fuel gas: 84.2 % (R² = 0.842).

Medium‑pressure steam: 82.7 % (R² = 0.827).

Low‑pressure steam: 81.6 % (R² = 0.816).

Additional error metrics (MAE, RMSE) confirmed consistent performance across all targets, demonstrating the model’s robustness for real‑time energy forecasting.

Conclusions and Contributions

Introduced a pulse‑event encoding technique that transforms discrete, instrument‑limited measurements into continuous, information‑rich time series.

Combined DTW‑based similarity assessment with PCA to sparsify high‑dimensional sensor data and isolate the most influential factors (top 20 %).

Developed a GRU‑based predictive framework that outperforms traditional tree models, MLPs, and Transformer variants in capturing long‑range temporal dependencies of refinery energy consumption.

Validated the approach on real industrial data, achieving > 80 % prediction accuracy for multiple energy streams, thereby providing a practical tool for refinery energy management and carbon‑emission reduction.

feature engineering GRU energy prediction refinery

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.