Genetic Algorithms: Theory, Steps, and Practical Implementation with TPOT for Data Science
This article introduces genetic algorithms, explains their biological inspiration, details each step of the algorithm, demonstrates solving the knapsack problem, and provides a complete Python implementation using the TPOT library for feature selection and regression on the Big Mart Sales dataset.
Introduction
Recently, an article titled "Introduction to Genetic Algorithm & Their Application in Data Science" was published on Analytics Vidhya, where the author Shubham Jain gave a concise overview of genetic algorithms and listed several real‑world applications, especially in data science.
A few days ago I tackled a practical problem – the large‑scale supermarket sales competition. After building a few simple models and performing feature engineering, I ranked 219th on the leaderboard. Wanting to improve, I turned to genetic algorithms, which boosted my rank dramatically to the top tier.
Using only the genetic algorithm, I jumped from 219th place directly to 15th, demonstrating the power of evolutionary optimization.
1. Origin of Genetic Algorithm Theory
"Those who survive are not necessarily the strongest or the smartest, but the most adaptable to the environment." – Charles Darwin
The whole concept of genetic algorithms is based on this idea of adaptability. To illustrate, imagine a king who wants to increase the number of good citizens by allowing them to reproduce over several generations; eventually the population consists mostly of good people. This simple analogy helps understand the core principle of genetic algorithms.
Identify all good individuals and let them reproduce.
Repeat the process for several generations.
The population gradually becomes dominated by the desirable traits.
2. Biological Inspiration
Every cell contains a set of chromosomes, which are polymers of DNA. Traditionally, a chromosome can be represented as a binary string of 0s and 1s. Each gene (a segment of DNA) encodes a specific trait such as hair or eye colour.
A chromosome consists of genes; each gene encodes a unique characteristic. Understanding this biological background sets the stage for defining genetic algorithms.
3. Definition of Genetic Algorithm
Summarising the earlier example, the algorithm works as follows:
Set the initial population size.
Define a fitness function that distinguishes good individuals from bad ones.
Select the good individuals and let them reproduce.
Replace part of the population with the offspring and repeat the process.
In essence, a genetic algorithm mimics the evolutionary process to search for optimal solutions.
4. Detailed Steps of Genetic Algorithm
4.1 Initialization
For the knapsack problem, the initial population consists of several binary chromosomes, where each bit indicates whether an item is included (1) or excluded (0).
4.2 Fitness Function
The fitness of a chromosome is evaluated by the total survival points of the selected items. For example, chromosome A1 = [100110] receives a higher fitness score than A2 = [001110] because it contains more valuable items.
4.3 Selection
Individuals are selected for mating using the roulette‑wheel selection method, where each chromosome occupies a slice of a wheel proportional to its fitness. A stochastic universal selection variant can pick multiple parents in a single spin.
4.4 Crossover
Selected parents exchange genetic material. The simplest form is single‑point crossover, where a random crossover point is chosen and the segments before and after that point are swapped, producing new offspring.
Using two crossover points yields multi‑point crossover, which can increase diversity.
4.5 Mutation
During reproduction, random mutations may flip bits in a chromosome, introducing new genetic material and preventing premature convergence.
After mutation, the new individuals replace less fit members of the population. The algorithm stops when a termination condition is met, such as a maximum number of generations, stagnation, or reaching a predefined fitness threshold.
5. Applications of Genetic Algorithms
5.1 Feature Selection
In data‑science competitions, selecting the most predictive features is crucial. Genetic algorithms can encode feature inclusion as a binary chromosome (1 = keep feature, 0 = discard) and use model accuracy as the fitness function.
5.2 Implementing with TPOT
TPOT (Tree‑based Pipeline Optimisation Technique) automates machine‑learning pipeline design using genetic programming. Below is a complete Python workflow that installs the required packages, preprocesses the Big Mart Sales dataset, and runs TPOT to find an optimal regression pipeline.
<code># installing DEAP, update_checker and tqdm
pip install deap update_checker tqdm
# installing TPOT
pip install tpot
# import basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
# preprocessing (example snippets)
train['Item_Weight'].fillna(train['Item_Weight'].mean(), inplace=True)
test['Item_Weight'].fillna(test['Item_Weight'].mean(), inplace=True)
train['Item_Fat_Content'] = train['Item_Fat_Content'].replace(['low fat','LF'], ['Low Fat','Low Fat'])
train['Item_Fat_Content'] = train['Item_Fat_Content'].replace(['reg'], ['Regular'])
# ... (additional preprocessing omitted for brevity)
# prepare data for TPOT
tpot_train = train.drop(['Outlet_Identifier','Item_Type','Item_Identifier'], axis=1)
tpot_test = test.drop(['Outlet_Identifier','Item_Type','Item_Identifier'], axis=1)
target = tpot_train['Item_Outlet_Sales']
tpot_train.drop('Item_Outlet_Sales', axis=1, inplace=True)
from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tpot_train, target, train_size=0.75, test_size=0.25)
tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')
# predict on test set
tpot_pred = tpot.predict(tpot_test)
sub1 = pd.DataFrame(data=tpot_pred, columns=['Item_Outlet_Sales'])
sub1['Item_Identifier'] = test['Item_Identifier']
sub1['Outlet_Identifier'] = test['Outlet_Identifier']
sub1 = sub1[['Item_Identifier','Outlet_Identifier','Item_Outlet_Sales']]
sub1.to_csv('tpot.csv', index=False)
</code>After the pipeline finishes, the exported script shows that ExtraTreeRegressor achieved the best performance for this problem.
6. Real‑world Applications
6.1 Engineering Design
Genetic algorithms are widely used in engineering design to optimise structures, reduce cost, and accelerate development cycles.
Paper: "Engineering design using genetic algorithms"
Link: http://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=16942&context=rtd
6.2 Travelling Salesman Problem (Routing)
The classic TSP can be tackled with genetic algorithms to find near‑optimal routes for logistics and shipping.
6.3 Robotics
Genetic algorithms help auto‑tune motion control parameters for mobile robots, enabling them to learn complex tasks such as cooking or cleaning.
Paper: "Genetic Algorithms for Auto‑tuning Mobile Robot Motion Control"
Link: https://pdfs.semanticscholar.org/7c8c/faa78795bcba8e72cd56f8b8e3b95c0df20c.pdf
7. Conclusion
By now you should have a solid understanding of genetic algorithms and be able to apply the TPOT library to your own data‑science projects. However, true mastery comes from hands‑on experimentation.
We encourage you to try these techniques in competitions or real‑world problems to fully appreciate their power.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.