Can You Predict Speed‑Dating Success? A Data‑Driven AI Analysis
This article explores the classic Speed Dating dataset, performing data cleaning, exploratory analysis of match rates, gender and age effects, correlation studies, and finally building a logistic regression model with SVMSMOTE oversampling to predict matchmaking success, achieving around 83% accuracy.
This article presents a data‑driven exploration of the Speed Dating dataset, aiming to assess matchmaking success and build a predictive model.
The data were collected from offline speed‑dating experiments, comprising over 8,000 sessions and nearly 200 features covering demographics, personal ratings, and preferences.
Data loading and initial inspection were performed with pandas:
import pandas as pd
df = pd.read_csv('Speed Dating Data.csv', encoding='gbk')
print(df.shape)Missing‑value analysis revealed several incomplete columns, prompting a review of data completeness:
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns, 'percent_missing': percent_missing})
missing_value_df.sort_values(by='percent_missing')Exploratory analysis showed an overall match rate of only 16.47%:
size_of_groups = df.match.value_counts().values
single_percentage = round(size_of_groups[0] / sum(size_of_groups) * 100, 2)
matched_percentage = round(size_of_groups[1] / sum(size_of_groups) * 100, 2)
names = ['Single:' + str(single_percentage) + '%', 'Matched' + str(matched_percentage) + '%']
plt.pie(size_of_groups, labels=names, labeldistance=1.2, colors=Pastel1_3.hex_colors)
plt.show()The gender‑specific analysis indicated that females have a slightly higher success probability (about 4% advantage):
# Female
size_of_groups = df[df.gender == 0].match.value_counts().values
# Male (replace 0 with 1)
size_of_groups = df[df.gender == 1].match.value_counts().values
# Compute percentages as above and plot pie chartsAge distribution analysis revealed that participants are mainly aged 22‑28, contrary to the expectation of older participants:
age = df[np.isfinite(df['age'])]['age']
plt.hist(age, bins=35)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()A correlation heatmap was generated to identify features most related to the match outcome. Features with strong positive correlation to match include attr_o, sinc_o, intel_o, fun_o, amb_o, and shar_o:
date_df = df[[
'iid', 'gender', 'pid', 'match', 'int_corr', 'samerace', 'age_o',
'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb',
'pf_o_sha', 'dec_o', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'like_o',
'prob_o', 'met_o', 'age', 'race', 'imprace', 'imprelig', 'goal', 'date',
'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining',
'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv',
'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'attr1_1',
'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'attr3_1', 'sinc3_1',
'fun3_1', 'intel3_1', 'dec', 'attr', 'sinc', 'intel', 'fun', 'like',
'prob', 'met'
]]
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title('Correlation Heatmap')
corr = date_df.corr()
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)
plt.show()For modeling, the selected features and target were extracted, missing rows removed, and the dataset was oversampled using SVMSMOTE to address class imbalance:
clean_df = df[['attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o','match']]
clean_df.dropna(inplace=True)
X = clean_df[['attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o']]
y = clean_df['match']
oversample = imblearn.over_sampling.SVMSMOTE()
X, y = oversample.fit_resample(X, y)The data were split into training and test sets, and a logistic regression classifier was trained:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)
model = LogisticRegression(C=1, random_state=0)
lrc = model.fit(X_train, y_train)
predict_train_lrc = lrc.predict(X_train)
predict_test_lrc = lrc.predict(X_test)
print('Training Accuracy:', metrics.accuracy_score(y_train, predict_train_lrc))
print('Validation Accuracy:', metrics.accuracy_score(y_test, predict_test_lrc))The resulting validation accuracy was approximately 0.83, demonstrating a reasonably effective predictor for speed‑dating match outcomes.
Key visual results (pie charts, age histogram, correlation heatmap, and model performance) are shown below:
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
