Artificial Intelligence 11 min read

Can You Predict Speed‑Dating Success? A Data‑Driven AI Analysis

This article explores the classic Speed Dating dataset, performing data cleaning, exploratory analysis of match rates, gender and age effects, correlation studies, and finally building a logistic regression model with SVMSMOTE oversampling to predict matchmaking success, achieving around 83% accuracy.

Alibaba Cloud Developer

Sep 9, 2020

Can You Predict Speed‑Dating Success? A Data‑Driven AI Analysis

This article presents a data‑driven exploration of the Speed Dating dataset, aiming to assess matchmaking success and build a predictive model.

The data were collected from offline speed‑dating experiments, comprising over 8,000 sessions and nearly 200 features covering demographics, personal ratings, and preferences.

Data loading and initial inspection were performed with pandas:

import pandas as pd

df = pd.read_csv('Speed Dating Data.csv', encoding='gbk')
print(df.shape)

Missing‑value analysis revealed several incomplete columns, prompting a review of data completeness:

percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns, 'percent_missing': percent_missing})
missing_value_df.sort_values(by='percent_missing')

Exploratory analysis showed an overall match rate of only 16.47%:

size_of_groups = df.match.value_counts().values
single_percentage = round(size_of_groups[0] / sum(size_of_groups) * 100, 2)
matched_percentage = round(size_of_groups[1] / sum(size_of_groups) * 100, 2)
names = ['Single:' + str(single_percentage) + '%', 'Matched' + str(matched_percentage) + '%']
plt.pie(size_of_groups, labels=names, labeldistance=1.2, colors=Pastel1_3.hex_colors)
plt.show()

The gender‑specific analysis indicated that females have a slightly higher success probability (about 4% advantage):

# Female
size_of_groups = df[df.gender == 0].match.value_counts().values
# Male (replace 0 with 1)
size_of_groups = df[df.gender == 1].match.value_counts().values
# Compute percentages as above and plot pie charts

Age distribution analysis revealed that participants are mainly aged 22‑28, contrary to the expectation of older participants:

age = df[np.isfinite(df['age'])]['age']
plt.hist(age, bins=35)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

A correlation heatmap was generated to identify features most related to the match outcome. Features with strong positive correlation to match include attr_o, sinc_o, intel_o, fun_o, amb_o, and shar_o:

date_df = df[[
    'iid', 'gender', 'pid', 'match', 'int_corr', 'samerace', 'age_o',
    'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb',
    'pf_o_sha', 'dec_o', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'like_o',
    'prob_o', 'met_o', 'age', 'race', 'imprace', 'imprelig', 'goal', 'date',
    'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining',
    'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv',
    'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'attr1_1',
    'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'attr3_1', 'sinc3_1',
    'fun3_1', 'intel3_1', 'dec', 'attr', 'sinc', 'intel', 'fun', 'like',
    'prob', 'met'
]]
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title('Correlation Heatmap')
corr = date_df.corr()
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)
plt.show()

For modeling, the selected features and target were extracted, missing rows removed, and the dataset was oversampled using SVMSMOTE to address class imbalance:

clean_df = df[['attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o','match']]
clean_df.dropna(inplace=True)
X = clean_df[['attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o']]
y = clean_df['match']
oversample = imblearn.over_sampling.SVMSMOTE()
X, y = oversample.fit_resample(X, y)

The data were split into training and test sets, and a logistic regression classifier was trained:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)
model = LogisticRegression(C=1, random_state=0)
lrc = model.fit(X_train, y_train)
predict_train_lrc = lrc.predict(X_train)
predict_test_lrc = lrc.predict(X_test)
print('Training Accuracy:', metrics.accuracy_score(y_train, predict_train_lrc))
print('Validation Accuracy:', metrics.accuracy_score(y_test, predict_test_lrc))

The resulting validation accuracy was approximately 0.83, demonstrating a reasonably effective predictor for speed‑dating match outcomes.

Key visual results (pie charts, age histogram, correlation heatmap, and model performance) are shown below:

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning Python data analysis logistic regression Pandas speed dating SVMSMOTE

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.