Concrete Strength Data Analysis Using Pandas: A Step‑by‑Step Tutorial
This tutorial walks through a complete pandas‑based workflow for analyzing a concrete‑strength dataset, covering data loading, cleaning, exploratory visualizations, correlation analysis, and targeted sub‑group investigations to uncover factors influencing product strength and suggest improvement measures.
The article explains why many learners struggle with pandas despite following examples, emphasizing the need for a systematic analysis mindset that tells a coherent story with data.
It outlines the objectives of a concrete‑strength case study: verify whether product strength varies significantly, identify key factors affecting strength, and propose improvement suggestions.
First, essential libraries are imported and their versions printed:
# 先导入必要的计算包并查看版本,最好将pandas升级到0.24以上
import numpy as np
import pandas as pd
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt
for model in np,pd,mpl,sns:
print(model.__name__, model.__version__)Warnings are suppressed and a plotting style (e.g., plt.style.use('bmh') ) is set for nicer figures.
The dataset is loaded, column names are simplified, and df.head() is displayed to inspect the first rows.
# 简化字段名称
df.columns = ['水泥含量','高炉矿渣含量','粉煤灰量','含水量','减水剂含量','粗骨料含量','细骨料含量','龄期','强度/Mpa']
df.head()Basic data inspection uses df.info() and df.describe() to reveal data types, missing values, and summary statistics.
>>>df.info()
... (output omitted for brevity) ...Strength distribution is visualized with a histogram and a box plot:
plt.figure(figsize=(15,6))
plt.subplot(121)
df['强度/Mpa'].plot(kind='hist', width=3.5)
plt.xlabel('强度/Mpa')
plt.title('产品强度的概率密度分布')
plt.subplot(122)
plt.boxplot(df['强度/Mpa'])
plt.title('强度的箱线图')
plt.show()Box‑plot analysis shows a normal‑like distribution with a long tail of high‑strength outliers, confirming the client’s complaint about strength instability.
Next, the tutorial examines each predictor variable with box plots and scatter plots against strength, revealing positive correlations for cement and super‑plasticizer, a negative correlation for water content, and ambiguous relationships for slag and age.
# 单变量箱线图
plt.figure(figsize=(20,16))
for i, feature in enumerate(list(df.columns[:-1])):
plt.subplot(2,4,i+1)
plt.boxplot(df[feature])
plt.title(feature)
# 散点图
plt.figure(figsize=(20,16))
for i, feature in enumerate(list(df.columns[:-1])):
plt.subplot(3,3,i+1)
plt.scatter(df[feature], df['强度/Mpa'])
plt.xlabel(feature)
plt.ylabel('强度/Mpa')
plt.show()Quantitative correlation is computed with df.corr() , highlighting that cement and super‑plasticizer have notable positive Pearson coefficients with strength, while age shows a weaker but present relationship.
To address noisy age values, samples with age > 56 days are filtered out, and the analysis is repeated, resulting in a clearer positive correlation between age and strength.
df_age56 = df[df['龄期'] <= 56]
df_age56.shape
# repeat visualizations and correlation on df_age56The article concludes that combining qualitative visual inspection with quantitative correlation yields a robust understanding of factors influencing concrete strength, and encourages readers to apply similar grouping and variable‑selection strategies to other datasets.
Additional resources and links to other data‑analysis case studies are provided at the end of the article.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.