Open-Source Kaggle Solution: Predicting Multi-Market Commodity Prices with Tree Models

An open-source, Kaggle‑ranked solution for the Mitsui Commodity Prediction Challenge details data preprocessing, feature engineering, and multiple tree‑based modeling strategies—including multi‑target, single‑target, and price‑difference models—with code, evaluation metrics, and suggestions for further improvements.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Open-Source Kaggle Solution: Predicting Multi-Market Commodity Prices with Tree Models

Kaggle Commodity Prediction Challenge Overview

The competition, hosted by Mitsui & Co., focuses on predicting multi‑market price spreads for commodities, metals, energy, stocks, and FX using time‑series data from LME, JPX, and US markets.

Data Description

Input data covers four global markets and includes standardized OHLCV metrics, settlement prices, and open interest where available. Core assets include:

Industrial metals (AH‑Aluminum, CA‑Copper, PB‑Lead, ZS‑Zinc)

Gold and platinum futures (Mini, Rolling‑Spot, Standard)

Rubber futures (RSS3)

US stock ETFs and individual stocks (e.g., ACWI, VT, VWO, GLD, SLV)

FX pairs (e.g., AUDJPY, EURUSD, USDJPY)

Most assets provide only closing prices; stocks and metals provide full OHLCV.

Data Preprocessing

All datasets are unified into a common OHLCV format before feature engineering.

import pandas as pd
import warnings
import os

warnings.filterwarnings("ignore", message="DeprecationWarning")
warnings.filterwarnings("ignore", message="See the caveats")
warnings.filterwarnings('ignore', category=FutureWarning, message="The behavior of DataFrame concatenation with empty or all-NA entries is deprecated.")

def preprocess(df):
    df = df.copy()
    df = df.rename(columns={'date_id': 'date'})
    result = pd.DataFrame(columns=['date','id','close','open','high','low','volume','sprice','interest'])
    # LME metals
    lme_metals = ['AH','CA','PB','ZS']
    for metal in lme_metals:
        temp_df = pd.DataFrame()
        temp_df['date'] = df['date']
        temp_df['id'] = f'LME_{metal}'
        temp_df['close'] = df[f'LME_{metal}_Close']
        temp_df[['open','high','low','volume','sprice','interest']] = None
        result = pd.concat([result, temp_df], ignore_index=True)
    # JPX futures
    jpx_products = {
        'Gold_Mini': ['Open','High','Low','Close','Volume','settlement_price','open_interest'],
        'Gold_Rolling-Spot': ['Open','High','Low','Close','Volume','settlement_price','open_interest'],
        'Gold_Standard': ['Open','High','Low','Close','Volume','open_interest'],
        'Platinum_Mini': ['Open','High','Low','Close','Volume','settlement_price','open_interest'],
        'Platinum_Standard': ['Open','High','Low','Close','Volume','open_interest'],
        'RSS3_Rubber': ['Open','High','Low','Close','Volume','settlement_price','open_interest']
    }
    for product, columns in jpx_products.items():
        temp_df = pd.DataFrame()
        temp_df['date'] = df['date']
        temp_df['id'] = f'JPX_{product}'
        if 'Close' in columns:
            temp_df['close'] = df[f'JPX_{product}_Futures_Close']
        if 'Open' in columns:
            temp_df['open'] = df[f'JPX_{product}_Futures_Open']
        if 'High' in columns:
            temp_df['high'] = df[f'JPX_{product}_Futures_High']
        if 'Low' in columns:
            temp_df['low'] = df[f'JPX_{product}_Futures_Low']
        if 'Volume' in columns:
            temp_df['volume'] = df[f'JPX_{product}_Futures_Volume']
        if 'settlement_price' in columns:
            temp_df['sprice'] = df[f'JPX_{product}_Futures_settlement_price']
        if 'open_interest' in columns:
            temp_df['interest'] = df[f'JPX_{product}_Futures_open_interest']
        result = pd.concat([result, temp_df], ignore_index=True)
    # US stocks
    us_stocks = ['ACWI','AEM','AG','AGG','ALB','AMP','BCS','BKR','BND','BNDX','BP','BSV','CAT','CCJ','CLF','COP','CVE','CVX','DE','DVN','EEM','EFA','EMB','ENB','EOG','EWJ','EWT','EWY','EWZ','FCX','FNV','FXI','GDX','GDXJ','GLD','GOLD','HAL','HES','HL','IAU','IEF','IEMG','IGSB','JNK','KGC','KMI','LQD','LYB','MBB','MPC','MS','NEM','NUE','NUGT','OIH','OKE','OXY','PAAS','RIO','RSP','RY','SCCO','SHEL','SHY','SLB','SLV','SPIB','SPTL','SPYV','STLD','TD','TECK','TIP','TRGP','URA','VALE','VCIT','VCSH','VEA','VGIT','VGK','VGLT','VGSH','VT','VTV','VWO','VXUS','VYM','WMB','WPM','X','XLB','XLE','XOM','YINN']
    for stock in us_stocks:
        temp_df = pd.DataFrame()
        temp_df['date'] = df['date']
        temp_df['id'] = f'US_Stock_{stock}'
        temp_df['close'] = df[f'US_Stock_{stock}_adj_close']
        temp_df['open'] = df[f'US_Stock_{stock}_adj_open']
        temp_df['high'] = df[f'US_Stock_{stock}_adj_high']
        temp_df['low'] = df[f'US_Stock_{stock}_adj_low']
        temp_df['volume'] = df[f'US_Stock_{stock}_adj_volume']
        temp_df[['sprice','interest']] = None
        result = pd.concat([result, temp_df], ignore_index=True)
    # FX pairs (close only)
    fx_pairs = ['AUDJPY','AUDUSD','CADJPY','CHFJPY','EURAUD','EURGBP','EURJPY','EURUSD','GBPAUD','GBPJPY','GBPUSD','NZDJPY','NZDUSD','USDCHF','USDJPY','ZARJPY','ZARUSD','NOKUSD','NOKEUR','CADUSD','AUDNZD','EURCHF','EURCAD','AUDCAD','GBPCHF','EURNZD','AUDCHF','GBPNZD','GBPCAD','CADCHF','NZDCAD','NZDCHF','ZAREUR','NOKGBP','NOKCHF','ZARCHF','NOKJPY','ZARGBP']
    for pair in fx_pairs:
        temp_df = pd.DataFrame()
        temp_df['date'] = df['date']
        temp_df['id'] = f'FX_{pair}'
        temp_df['close'] = df[f'FX_{pair}']
        temp_df[['open','high','low','volume','sprice','interest']] = None
        result = pd.concat([result, temp_df], ignore_index=True)
    result = result.reset_index(drop=True)
    return result

class CFG:
    if os.path.exists('./mitsui-commodity-prediction-challenge'):
        input_path = './mitsui-commodity-prediction-challenge/'
    else:
        input_path = '/kaggle/input/mitsui-commodity-prediction-challenge/'

config = CFG()

df = pd.read_csv(f'{config.input_path}/train.csv')

df_processed = preprocess(df)

Modeling Strategies

Three simple approaches are provided:

Multi‑target tree model predicting all 424 labels simultaneously (CatBoost with MultiRMSE loss).

Single‑target tree model predicting each asset’s return first, then computing differences.

Direct price‑difference tree model that concatenates features of paired samples.

Evaluation Metric

The metric computes daily Spearman rank correlation between predictions and true values, then calculates the Sharpe ratio of these correlations (risk‑adjusted return).

Feature Engineering

Basic features include price ratios, technical indicators, and rolling statistics.

import pandas as pd
import numpy as np

def create_features(df, windows=[5,10,20]):
    df = df.sort_values(['id','date']).reset_index(drop=True)
    grouped = df.groupby('id')
    features_list = []
    for asset_id, group in grouped:
        group = group.copy()
        for col1 in ['close','open','high','low']:
            for col2 in ['close','open','high','low']:
                if col1 > col2:
                    group[f'{col1}/{col2}'] = (group[col1]-group[col2])/(group[col1]+group[col2])
        group['open/close_shift1'] = group['open']/group['close'].shift(1)
        for window in windows:
            group[f'ret_{window}'] = group['close']/group['close'].shift(window) - 1
            group[f'vol_{window}'] = (group['close']/group['close'].shift(1) - 1).rolling(window).std()
            group[f'volume_{window}'] = group['volume'].rolling(window).mean()/group['volume'].rolling(window*2).mean()
            group[f'technical1_{window}'] = (group['close']>group['high'].ffill().shift(1)).astype(float) - (group['close']<group['low'].ffill().shift(1)).astype(float)
            group[f'technical2_{window}'] = (group['low']>group['high'].ffill().shift(1)).astype(float) - (group['high']<group['low'].ffill().shift(1)).astype(float)
        if 'sprice' in group.columns:
            group['sprice_change'] = group['sprice']/group['sprice'].ffill().shift(1) - 1
            group['premium_discount'] = (group['close']-group['sprice'])/group['sprice']
        if 'interest' in group.columns:
            group['volume_interest_ratio'] = group['volume']/(group['interest']+1)
        features_list.append(group)
    features_df = pd.concat(features_list, ignore_index=True)
    return features_df

df_features = create_features(df_processed)

Model Training Example

from catboost import CatBoostRegressor
model = CatBoostRegressor(
    loss_function='MultiRMSE',
    task_type='GPU',
    reg_lambda=2000
)
# model.fit(...)  # training code omitted for brevity

Model Fusion and Submission

A unified BaseModel interface (preprocess, train, predict) is defined so different models can be combined and submitted consistently.

Summary and Future Improvements

The solution unifies preprocessing and feature engineering across diverse assets and applies tree models in three ways, but challenges remain, especially for assets with only closing prices. Future work may include richer fundamental and alternative data, more advanced feature engineering, deep‑learning models, finer cross‑validation, and high‑frequency market micro‑structure features.

MachineLearningKaggleTimeSeriesCatBoostFeatureEngineeringCommodity
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.