Artificial Intelligence 7 min read

Build a CART Decision Tree from Scratch in Python – Full Step‑by‑Step Guide

This article walks through a complete Python implementation of the CART decision‑tree algorithm on the Banknote dataset, covering data loading, cross‑validation splitting, Gini impurity calculation, recursive tree construction, prediction, and performance evaluation with concrete code examples.

AI Large-Model Wave and Transformation Guide

Oct 8, 2018

Build a CART Decision Tree from Scratch in Python – Full Step‑by‑Step Guide

This guide demonstrates how to implement the Classification and Regression Tree (CART) algorithm from the ground up using Python, applying it to the publicly available Banknote authentication dataset.

Data Loading and Preparation

The CSV file is read with a simple load_csv function, converting each column from string to float via str_column_to_float. The dataset is then ready for processing.

Cross‑Validation Split

A custom cross_validation_split function creates n_folds random folds, ensuring each fold has an equal number of samples. This enables robust evaluation of the model.

Accuracy Metric

The accuracy_metric function computes the percentage of correctly predicted class labels.

Gini Index Calculation

The gini_index function evaluates the impurity of a split. It first counts total instances, then for each group computes the weighted Gini score using class probabilities, finally aggregating the weighted scores.

Best Split Selection

get_split

iterates over every attribute and every possible split value, uses test_split to partition the data, and selects the split with the lowest Gini index. The result is a dictionary containing the best attribute index, split value, Gini score, and the resulting groups.

Terminal Node Creation

to_terminal

returns the most frequent class label in a group, forming a leaf node.

Recursive Tree Building

The split function recursively builds the tree. It stops when a node has no split, reaches the maximum depth, or contains fewer samples than min_size. Otherwise, it creates left and right child nodes by calling get_split and recurses deeper.

Tree Construction

build_tree

initiates the process by finding the root split and calling split with the user‑defined max_depth and min_size.

Prediction

The predict function traverses the tree for a given row, comparing the row’s attribute value with the node’s split value, and follows left or right branches until a terminal node is reached.

Decision‑Tree Classifier

decision_tree

builds the tree on the training set and generates predictions for each row in the test set.

Model Evaluation

Using a fixed random seed, the script loads the dataset from data_banknote_authentication.csv, converts all columns to floats, and evaluates the CART model with 5‑fold cross‑validation, a maximum depth of 5, and a minimum node size of 10. The resulting accuracy scores and mean accuracy are printed.

Execution Result

The printed output shows the list of accuracy scores for each fold and the overall mean accuracy, confirming that the implementation works as expected.