What We Learned from Winning 3rd Place in China’s 2025 Big Data Challenge
The Dalian University team’s third‑place finish in the 2025 China University Computer Competition’s Big Data Challenge revealed key lessons about data cleaning, focused feature engineering, the power of simple robust models like Random Forest, custom evaluation metrics, and the indispensable role of tight teamwork in data science projects.
Data Cleaning
The raw competition dataset contained numerous anomalies: negative price values, truncated decimal precision, and heterogeneous formatting of timestamps and identifiers. The team first filtered out rows with negative prices, converting them to NaN and applying median imputation. Next, they standardized numeric precision to two decimal places using round(value, 2) and unified date strings with pd.to_datetime. Categorical fields were stripped of whitespace and encoded consistently. These steps transformed a chaotic source into a clean, analysis‑ready table.
Feature Engineering
Initially the pipeline generated dozens of technical indicators (e.g., MACD, Bollinger Bands, stochastic oscillators). After observing diminishing returns and overfitting, the team performed correlation analysis and feature importance ranking (using a preliminary Random Forest) to isolate the most predictive signals. The final feature set included:
Simple moving average (window = 5, 10)
Relative Strength Index (RSI)
Trading volume and its 7‑day rolling mean
Price momentum (percentage change over 1, 3, 5 days)
Bid‑ask spread
This “less‑is‑more” approach reduced dimensionality and improved model interpretability.
Model Selection
The team evaluated three families of models:
Ensemble methods (Gradient Boosting, XGBoost)
Mixture‑of‑Experts (MOE) architectures
Recurrent neural networks (LSTM)
Using 5‑fold cross‑validation on the cleaned data, Random Forest consistently achieved the highest validation stability (average R² ≈ 0.68) and the lowest variance across folds. The final hyper‑parameters were:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(
n_estimators=500,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)More complex models did not yield significant gains and required substantially longer training time, leading to the strategic pivot toward the simpler, robust Random Forest.
Custom Evaluation Metric
Mean Squared Error (MSE) alone does not reflect the practical utility of stock‑price predictions for traders. The team designed a composite metric that balances hit‑rate for the top‑10 predicted returns with ranking alignment:
# Top‑10 hit rate
hits = sum(pred_rank[:10] == true_rank[:10]) / 10
# Spearman rank correlation
from scipy.stats import spearmanr
rank_corr, _ = spearmanr(pred_rank, true_rank)
# Composite score (equal weight)
custom_score = 0.5 * hits + 0.5 * rank_corrThis metric better captures both absolute performance and the relative ordering of predictions, aligning evaluation with real‑world trading decisions.
Key Takeaways for Practitioners
Invest substantial effort in data cleaning; subtle anomalies can dominate model error.
Prioritize feature relevance over quantity; correlation and importance analysis help prune redundant indicators.
Simple, well‑tuned models (e.g., Random Forest) often outperform sophisticated architectures when data volume is limited.
Design evaluation metrics that reflect the end‑user’s objectives rather than relying on generic loss functions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
