How to Boost Kaggle NLP Scores with BERT, Tree Models, and Smart Post‑Processing
The article analyzes a recent Kaggle essay‑segmentation competition, explains why standard BERT‑based models plateau, and shows how a two‑stage pipeline that combines coarse BERT filtering with a feature‑rich tree model and post‑processing scaling can push scores well beyond the 70‑point barrier.
Task Overview
The Kaggle competition required automatic paragraph segmentation of student essays and labeling each sentence with argumentative components (claim, evidence, conclusion, etc.). Because every sentence receives a label, the BIO tagging scheme contains only B‑ and I‑tags—there is no "O" tag.
Baseline Approaches
Initial solutions applied long‑text transformer models such as BERT, Longformer and BigBird to predict segment boundaries directly. These models quickly saturated at a score around the low 70s, indicating that pure semantic modeling was insufficient for further gains.
Post‑processing Scaling Trick
A common post‑processing step rescales the predicted class probabilities before final decision making. By applying class‑specific scaling factors (e.g., multiplying the probability of the "B‑Claim" class by 1.2 while reducing the "B‑Evidence" probability), the adjusted predictions better match the competition’s evaluation metric, yielding a modest improvement.
Two‑Stage Pipeline
Stage 1 – Coarse BERT Filter
A lightweight BERT variant (often referred to as BERT‑689) is fine‑tuned to produce a high‑recall set of candidate labels. This stage raises the theoretical upper bound for the final system.
Stage 2 – Gradient‑Boosted Decision Tree (GBDT)
The second stage trains a tree‑based model (e.g., XGBoost or LightGBM) on engineered, non‑semantic features derived from the raw essay layout. The GBDT model outputs the final BIO tags. An open‑source implementation that combines BERT‑689 with a GBDT model achieved a top‑3 ranking on the leaderboard.
Feature Engineering Details
Sentence length at the beginning or end of a paragraph (short sentences often indicate a claim or summary).
Presence of explicit line‑break markers, which signal paragraph boundaries.
Relative position of a sentence within the essay (e.g., first 5 sentences, last 5 sentences).
Paragraph‑level statistics such as average sentence length and variance.
Binary flags for whether a sentence is the first or last in its paragraph.
These cues are strong predictors of argumentative structure but are ignored by pure transformer models that focus solely on semantic content.
Practical Usage
There are two common ways to exploit the engineered features:
Feature‑augmented model: Concatenate the handcrafted features with the transformer’s hidden states and fine‑tune the combined representation.
Two‑stage pipeline (recommended): Keep the transformer as a high‑recall filter and let the GBDT model handle the final classification, leveraging its strength with heterogeneous, non‑semantic inputs.
This combination of deep‑learning semantic representations and lightweight, feature‑rich tree models provides a practical strategy to break through score plateaus in essay‑structure NLP tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
