Unlock Insights from 3.4GB Brazilian Car Service Sales Data with Python & Tableau
This article walks through a comprehensive analysis of a 3.43 GB sales dataset from a Brazilian automotive service chain, covering data loading, cleaning, exploratory visualizations, time‑series forecasting with ARIMA, RFM customer segmentation, product clustering, and key business insights using Python and Tableau.
1. Analysis Background
This is a real sales dataset from a Brazilian traditional offline automotive service chain on Kaggle, about 3.43 GB, covering roughly 26 million records from 2017‑03‑31 to 2020‑04‑01. Analyzing it can reveal overall sales, product distribution, customer segmentation, refined sales strategies, and employee productivity.
2. Analysis Framework
3. Data Cleaning
3.1 Load Data and Overview
The dataset is large; loading it into memory with an 8 GB environment quickly exhausts RAM. It is recommended to use 8‑12 GB of memory or close unnecessary applications.
# Import packages
import numpy as np
import pandas as pd
# Read data with delimiter
file_path = r'F:\ales Report.csv\Sales Report.csv'
df = pd.read_csv(file_path, iterator=True, sep=';')
data = df.get_chunk(30000000)
data.info()The Sale Date Time column is of type object and needs conversion to datetime.
# Check NULL values
data.isnull().sum()The data is clean; no NULL values were found.
# Describe statistics
data.describe()There is a negative value in Product Cost. These 20 rows are treated as anomalies and removed.
# Find negative product cost rows
data[data['Product Cost'] <= 0] # Delete negative cost rows
data.drop(index=data[data['Product Cost'] < 0].index, inplace=True)3.2 Remove Duplicate Records
# Identify duplicates (489,567 rows)
data[data.duplicated()] # Delete duplicates
data.drop(index=data[data.duplicated()].index, inplace=True)3.3 Convert Date Format
# Convert to datetime
data['Sale Date Time'] = pd.to_datetime(data['Sale Date Time'])
data.info()Data cleaning is now complete.
4. Analysis
4.1 Overall Situation
4.2 Time‑Based Analysis
4.2.1 Annual Sales
2017 has only the first nine months, 2020 only the first four months. 2019 sales reached 718,306,933, a 5.6 % increase over 2018 (680,191,151).
4.2.2 Quarterly Sales
From Q2 2017 to the end of 2018, order volume rose sharply; 2019 was relatively stable. Q4 2019 peaked with 208,548 orders and sales of 206,513,981.
4.2.3 Monthly Sales
2017’s second half shows a clear upward trend; 2018‑19 are relatively stable. December sales rise due to performance pushes.
4.2.4 Weekly Sales
Weekly sales amounts first rise then stabilize; weekly order volume remains dynamic, with average purchase amount increasing over time.
4.2.5 Daily Sales
Only the period around 2018‑06‑01 shows abnormal spikes and troughs; likely related to local consumer behavior.
Brazilian consumers tend to purchase less on Sundays.
4.2.6 Four‑Year Daily UV
Daily unique visitors (UV) show an upward trend year over year, with a 7‑day periodic pattern.
4.2.7 Weekday Sales Weight
Friday has the highest sales weight; Sunday the lowest.
4.2.8 Time‑Slot Sales
Sales and order volume peak between 07:00‑20:00, with a dip at 12:00.
4.2.9 Weekday Sales Weight Calculation
Using 2019 full‑year data, average sales per weekday are computed and weighted against the minimum weekday average.
4.2.10 Daily Sales Forecast
Data from 2019‑01‑01 to 2020‑02‑29 is used; 2020‑02 data serves for prediction comparison.
# Set index to sales time
data.set_index('Sale Date Time', inplace=True, drop=True)
# Resample to daily totals
day_data = data.resample('d').sum()['Total'] # Select training period
train_day_data = day_data[(day_data.index >= '2019-01-01') & (day_data.index <= '2020-02-29')]
train_day_data.to_excel('./日销售数据.xlsx') # Load for modeling
data = pd.read_excel('./日销售数据.xlsx')
data.rename(columns={'Sale Date Time':'date1'}, inplace=True) # Scale sales (unit = 100k)
data['Total'] = round(data['Total'] / 100000, 4) # Plot time series
import matplotlib.pyplot as plt
plt.figure(figsize=(18,8), dpi=256)
data['Total'][:-30].plot()# ACF plot
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(data['Total'][:-30])# PACF plot
from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(data['Total'][:-30])# Augmented Dickey‑Fuller test
from statsmodels.tsa.stattools import adfuller as ADF
print(ADF(data['Total'][:-30]))The p‑value (0.347) indicates a non‑stationary series; first‑order differencing is applied.
# First difference
D_data = data['Total'][:-30].diff().dropna()
print('First‑difference test:', ADF(D_data))After differencing, the series is stationary, allowing ARIMA modeling.
# Ljung‑Box white‑noise test
from statsmodels.stats.diagnostic import acorr_ljungbox
print('Ljung‑Box test:', acorr_ljungbox(D_data, lags=1))White‑noise test p‑value < 0.05, confirming a stationary non‑white‑noise series suitable for ARIMA.
# Grid search for best ARIMA (p,q) using AIC
from statsmodels.tsa.arima_model import ARIMA
from itertools import product
ps = range(0,5)
qs = range(0,5)
parameters = list(product(ps, qs))
best_aic = float('inf')
for param in parameters:
try:
model = ARIMA(data['Total'][:-30], order=(param[0],1,param[1])).fit()
except ValueError:
continue
if model.aic < best_aic:
best_aic = model.aic
best_model = model
best_param = param
print('Best model', best_model.summary())# Forecast next 30 days
forecast = best_model.forecast(30)[0]# Model evaluation
from sklearn.metrics import mean_absolute_error
pred_y = best_model.forecast(30)[0]
test_y = data['Total'][-30:].values
mae = mean_absolute_error(test_y, pred_y)
print('MAE:', mae)MAE of 2.38 (in 100 k units) suggests reasonable accuracy; thresholds should be set based on business needs.
# Plot actual vs. predicted
plt.figure(figsize=(14,7), dpi=256)
plt.plot(data['date1'][-30:], test_y, label='Actual')
plt.plot(data['date1'][-30:], pred_y, label='Forecast')
plt.xticks(data['date1'][-30:], rotation=70)
plt.legend(loc=3)4.3 Customer Perspective
4.3.1 City Distribution
10.98 % of customers are concentrated in Agirrie; overall city distribution is dispersed.
4.3.2 Top 10 Customers by Purchase Amount & Frequency
Customer Barry Barrett spent over 15 M and placed more than 100 K orders, representing a high‑value client.
4.3.3 Repurchase Rate
Overall repurchase rate (customers with >2 purchases) is 87.91 %, indicating strong loyalty.
# Compute purchase count per customer
client_data = data.groupby('Client').nunique()['Order Number']
client_data = client_data.reset_index().rename(columns={'Order Number':'user_num'}).sort_values('user_num', ascending=False)# Total repurchase rate
print('Total repurchase rate:', round((len(client_data[client_data['user_num'] > 1]) - 1) / (len(client_data) - 1) * 100, 4), "%")One‑Month Repurchase Rate
Defined as the proportion of customers who repurchased within the same month.
# Add year_month column
def parse_year_month(x):
return f"{x.year}-{x.month:02d}"
data['year_month'] = data['Sale Date Time'].apply(parse_year_month)# Compute monthly repurchase rate
y_m_data = data.groupby(['year_month','Client']).nunique()['Order Number'].reset_index()
month_list, rate_list = [], []
for every_m in y_m_data['year_month'].unique():
temp = y_m_data[y_m_data['year_month'] == every_m]
rate = round((len(temp[temp['Order Number'] > 1]) - 1) / (len(temp) - 1), 4)
month_list.append(every_m)
rate_list.append(rate)
rate_data = pd.DataFrame({'month': month_list, 'rate': rate_list})
rate_data.to_excel('./rate_data.xlsx', index=False)4.3.4 Purchase Frequency Distribution (April 2018)
# Filter April 2018 data
data_201804 = y_m_data[y_m_data['year_month'] == '2018-04']
data_201804.rename(columns={'Order Number':'buy_frequency'}, inplace=True)
# Bin frequencies
bins = [0,1,2,5,10,50,100,100000]
per_frequency = pd.cut(data_201804['buy_frequency'], bins)
per_frequency.value_counts().plot(kind='bar')4.3.5 RFM Customer Segmentation (April 2018)
# Aggregate R, F, M per customer for April 2018
RFM_data_all = data.groupby(['year_month','Client']).agg({'Order Number':'nunique','Sale Date Time':'max','Total':'sum'}).reset_index()
RFM_data_201804 = RFM_data_all[RFM_data_all['year_month'] == '2018-04']
reference_time = datetime.strptime('2018-05-01 23:59:59', "%Y-%m-%d %H:%M:%S")
RFM_data_201804['R'] = (reference_time - RFM_data_201804['Sale Date Time']).dt.days
RFM_data_201804.rename(columns={'Order Number':'F','Total':'M'}, inplace=True)
# Remove anonymous customer
RFM_data_201804.drop(index=26015, inplace=True)
# Standardize and K‑Means (5 clusters)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
ss = StandardScaler()
train = ss.fit_transform(RFM_data_201804[['R','F','M']])
kmeans = KMeans(n_clusters=5, random_state=0).fit(train)
RFM_data_201804['sk5_label'] = kmeans.labels_Cluster interpretation:
Cluster 0: low R, low F, low M – general customers.
Cluster 1: high R, low F, low M – potential growth customers.
Cluster 2: low R, high F, high M – key retained customers.
Clusters 3 & 4: low R, high F/M – high‑value customers.
4.4 Product Perspective
4.4.1 Top 10 Products by Sales & Orders
Special Gasoline, Gasoline, and Diesel Auto Clean dominate sales and order volume.
4.4.2 Product Clustering (April 2018)
# Aggregate cost (C), order count (F), sales (M) per product
month_product_data = data.groupby(['year_month','Product']).agg({'Product Cost':'mean','Order Number':'nunique','Total':'sum'}).reset_index()
month_product_data.rename(columns={'Product Cost':'C','Order Number':'F','Total':'M'}, inplace=True)
month_product_201804 = month_product_data[month_product_data['year_month'] == '2018-04']
# Standardize and DBSCAN (min_samples=4)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
ss = StandardScaler()
X = month_product_201804[['C','F','M']]
train_X = ss.fit_transform(X)
dbscan = DBSCAN(min_samples=4).fit(train_X)
month_product_201804['labels'] = dbscan.labels_Label ‑1 indicates outliers, split into:
A‑type: low cost, high order volume, high sales – priority products.
B‑type: high cost, low order volume, low sales – candidates for discontinuation.
4.4.3 Pareto Analysis of Product Sales
The top three products contribute over 82 % of total sales, confirming the 80/20 rule.
4.4.4 Product Cost Distribution
Most products have costs below 12; the next largest group falls between 12‑36, followed by 36‑95, with few above 95.
4.4.5 Product Count per Category
The "Filters" category has the most products (933), followed by "Chewing Gum And Candy" (839). "Extinguisher" has the fewest (6).
5. Payment Methods
Nearly 70 % of customers choose cash payments.
6. Conclusion
6.1 Three‑year sales reached 1.78 billion.
6.2 2017 was a growth phase; 2018‑19 stabilized.
6.3 Sales show weekly cycles; Sunday is the lowest.
6.4 Overall repurchase rate is 87.91 % – high loyalty.
6.5 Monthly repurchase rate exceeds 66 %.
6.6 Monthly user retention rate exceeds 74 %.
6.7 Top three products account for 82 % of sales (Pareto).
6.8 80 % of products have cost below 36.
6.9 About 70 % of users prefer cash.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
