Can You Predict Speed‑Dating Success? A Data‑Driven Exploration
This article walks through loading the Speed Dating dataset, examining its features and missing values, visualizing match rates by gender and age, performing correlation analysis, and building a logistic regression model with SMOTE oversampling to predict whether a pair will successfully match.
In this tutorial we explore the publicly available Speed Dating dataset, which contains over 8,000 rapid‑dating sessions and nearly 200 features describing participants and their evaluations.
We start by loading the CSV file with import pandas as pd and inspecting its shape. Missing‑value percentages are calculated to identify incomplete columns that may affect analysis.
import pandas as pd
df = pd.read_csv('Speed Dating Data.csv', encoding='gbk')
print(df.shape)
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({
'column_name': df.columns,
'percent_missing': percent_missing
})
missing_value_df.sort_values(by='percent_missing')Exploratory analysis shows that only about 16.47% of the sessions resulted in a mutual match. Gender‑specific match rates are visualized with pie charts, revealing that females have a slightly higher success rate than males (difference ≈ 0.04).
Age distribution is examined using a histogram, indicating that most participants are aged 22‑28, contrary to the expectation of older participants.
age = df[np.isfinite(df['age'])]['age']
plt.hist(age, bins=35)
plt.xlabel('Age')
plt.ylabel('Frequency')For correlation analysis we select numeric features with low missing rates and plot a heatmap. Notable findings include a strong negative correlation between perceived attractiveness (pf_o_att) and traits such as intelligence, ambition, and sincerity, while attractiveness correlates positively with humor (pf_o_fun).
date_df = df[[
'iid','gender','pid','match','int_corr','samerace','age_o','race_o',
'pf_o_att','pf_o_sin','pf_o_int','pf_o_fun','pf_o_amb','pf_o_sha',
'dec_o','attr_o','sinc_o','intel_o','fun_o','like_o','prob_o','met_o',
'age','race','imprace','imprelig','goal','date','go_out','career_c',
'sports','tvsports','exercise','dining','museums','art','hiking',
'gaming','clubbing','reading','tv','theater','movies','concerts',
'music','shopping','yoga','attr1_1','sinc1_1','intel1_1','fun1_1',
'amb1_1','attr3_1','sinc3_1','fun3_1','intel3_1','dec','attr','sinc',
'intel','fun','like','prob','met'
]]
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title('Correlation Heatmap')
corr = date_df.corr()
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)The features most strongly correlated with the target variable match are attr_o, sinc_o, intel_o, fun_o, amb_o, and shar_o. These become the inputs for a predictive model.
We prepare the data by selecting these columns, dropping rows with missing values, and applying SVMSMOTE oversampling to address the class imbalance (only ~16% positive matches).
clean_df = df[['attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o','match']]
clean_df.dropna(inplace=True)
X = clean_df[['attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o']]
y = clean_df['match']
oversample = imblearn.over_sampling.SVMSMOTE()
X, y = oversample.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)A logistic regression classifier is trained and evaluated, achieving an accuracy of approximately 0.83 on the validation set.
model = LogisticRegression(C=1, random_state=0)
lrc = model.fit(X_train, y_train)
predict_train_lrc = lrc.predict(X_train)
predict_test_lrc = lrc.predict(X_test)
print('Training Accuracy:', metrics.accuracy_score(y_train, predict_train_lrc))
print('Validation Accuracy:', metrics.accuracy_score(y_test, predict_test_lrc))The resulting model demonstrates that participants' self‑reported attributes (appearance, sincerity, intelligence, humor, ambition, and shared interests) can reasonably predict the likelihood of a successful speed‑dating match.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
