Industry Insights 6 min read

What We Learned from Winning 3rd Place in China’s 2025 Big Data Challenge

The Dalian University team’s third‑place finish in the 2025 China University Computer Competition’s Big Data Challenge revealed key lessons about data cleaning, focused feature engineering, the power of simple robust models like Random Forest, custom evaluation metrics, and the indispensable role of tight teamwork in data science projects.

Data Party THU

Sep 10, 2025

What We Learned from Winning 3rd Place in China’s 2025 Big Data Challenge

Data Cleaning

The raw competition dataset contained numerous anomalies: negative price values, truncated decimal precision, and heterogeneous formatting of timestamps and identifiers. The team first filtered out rows with negative prices, converting them to NaN and applying median imputation. Next, they standardized numeric precision to two decimal places using round(value, 2) and unified date strings with pd.to_datetime. Categorical fields were stripped of whitespace and encoded consistently. These steps transformed a chaotic source into a clean, analysis‑ready table.

Feature Engineering

Initially the pipeline generated dozens of technical indicators (e.g., MACD, Bollinger Bands, stochastic oscillators). After observing diminishing returns and overfitting, the team performed correlation analysis and feature importance ranking (using a preliminary Random Forest) to isolate the most predictive signals. The final feature set included:

Simple moving average (window = 5, 10)

Relative Strength Index (RSI)

Trading volume and its 7‑day rolling mean

Price momentum (percentage change over 1, 3, 5 days)

Bid‑ask spread

This “less‑is‑more” approach reduced dimensionality and improved model interpretability.

Model Selection

The team evaluated three families of models:

Ensemble methods (Gradient Boosting, XGBoost)

Mixture‑of‑Experts (MOE) architectures

Recurrent neural networks (LSTM)

Using 5‑fold cross‑validation on the cleaned data, Random Forest consistently achieved the highest validation stability (average R² ≈ 0.68) and the lowest variance across folds. The final hyper‑parameters were:

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(
    n_estimators=500,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

More complex models did not yield significant gains and required substantially longer training time, leading to the strategic pivot toward the simpler, robust Random Forest.

Custom Evaluation Metric

Mean Squared Error (MSE) alone does not reflect the practical utility of stock‑price predictions for traders. The team designed a composite metric that balances hit‑rate for the top‑10 predicted returns with ranking alignment:

# Top‑10 hit rate
hits = sum(pred_rank[:10] == true_rank[:10]) / 10
# Spearman rank correlation
from scipy.stats import spearmanr
rank_corr, _ = spearmanr(pred_rank, true_rank)
# Composite score (equal weight)
custom_score = 0.5 * hits + 0.5 * rank_corr

This metric better captures both absolute performance and the relative ordering of predictions, aligning evaluation with real‑world trading decisions.

Key Takeaways for Practitioners

Invest substantial effort in data cleaning; subtle anomalies can dominate model error.

Prioritize feature relevance over quantity; correlation and importance analysis help prune redundant indicators.

Simple, well‑tuned models (e.g., Random Forest) often outperform sophisticated architectures when data volume is limited.

Design evaluation metrics that reflect the end‑user’s objectives rather than relying on generic loss functions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

team collaboration model evaluation Data Science Competition

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.