Big Data 28 min read

Unlock Insights from 3.4GB Brazilian Car Service Sales Data with Python & Tableau

This article walks through a comprehensive analysis of a 3.43 GB sales dataset from a Brazilian automotive service chain, covering data loading, cleaning, exploratory visualizations, time‑series forecasting with ARIMA, RFM customer segmentation, product clustering, and key business insights using Python and Tableau.

Python Crawling & Data Mining

Sep 15, 2020

Unlock Insights from 3.4GB Brazilian Car Service Sales Data with Python & Tableau

1. Analysis Background

This is a real sales dataset from a Brazilian traditional offline automotive service chain on Kaggle, about 3.43 GB, covering roughly 26 million records from 2017‑03‑31 to 2020‑04‑01. Analyzing it can reveal overall sales, product distribution, customer segmentation, refined sales strategies, and employee productivity.

2. Analysis Framework

3. Data Cleaning

3.1 Load Data and Overview

The dataset is large; loading it into memory with an 8 GB environment quickly exhausts RAM. It is recommended to use 8‑12 GB of memory or close unnecessary applications.

# Import packages
import numpy as np
import pandas as pd

# Read data with delimiter
file_path = r'F:\ales Report.csv\Sales Report.csv'
df = pd.read_csv(file_path, iterator=True, sep=';')
data = df.get_chunk(30000000)
data.info()

The Sale Date Time column is of type object and needs conversion to datetime.

# Check NULL values
data.isnull().sum()

The data is clean; no NULL values were found.

# Describe statistics
data.describe()

There is a negative value in Product Cost. These 20 rows are treated as anomalies and removed.

# Find negative product cost rows
data[data['Product Cost'] <= 0]

# Delete negative cost rows
data.drop(index=data[data['Product Cost'] < 0].index, inplace=True)

3.2 Remove Duplicate Records

# Identify duplicates (489,567 rows)
data[data.duplicated()]

# Delete duplicates
data.drop(index=data[data.duplicated()].index, inplace=True)

3.3 Convert Date Format

# Convert to datetime
data['Sale Date Time'] = pd.to_datetime(data['Sale Date Time'])
data.info()

Data cleaning is now complete.

4. Analysis

4.1 Overall Situation

4.2 Time‑Based Analysis

4.2.1 Annual Sales

2017 has only the first nine months, 2020 only the first four months. 2019 sales reached 718,306,933, a 5.6 % increase over 2018 (680,191,151).

4.2.2 Quarterly Sales

From Q2 2017 to the end of 2018, order volume rose sharply; 2019 was relatively stable. Q4 2019 peaked with 208,548 orders and sales of 206,513,981.

4.2.3 Monthly Sales

2017’s second half shows a clear upward trend; 2018‑19 are relatively stable. December sales rise due to performance pushes.

4.2.4 Weekly Sales

Weekly sales amounts first rise then stabilize; weekly order volume remains dynamic, with average purchase amount increasing over time.

4.2.5 Daily Sales

Only the period around 2018‑06‑01 shows abnormal spikes and troughs; likely related to local consumer behavior.

Brazilian consumers tend to purchase less on Sundays.

4.2.6 Four‑Year Daily UV

Daily unique visitors (UV) show an upward trend year over year, with a 7‑day periodic pattern.

4.2.7 Weekday Sales Weight

Friday has the highest sales weight; Sunday the lowest.

4.2.8 Time‑Slot Sales

Sales and order volume peak between 07:00‑20:00, with a dip at 12:00.

4.2.9 Weekday Sales Weight Calculation

Using 2019 full‑year data, average sales per weekday are computed and weighted against the minimum weekday average.

4.2.10 Daily Sales Forecast

Data from 2019‑01‑01 to 2020‑02‑29 is used; 2020‑02 data serves for prediction comparison.

# Set index to sales time
data.set_index('Sale Date Time', inplace=True, drop=True)
# Resample to daily totals
day_data = data.resample('d').sum()['Total']

# Select training period
train_day_data = day_data[(day_data.index >= '2019-01-01') & (day_data.index <= '2020-02-29')]
train_day_data.to_excel('./日销售数据.xlsx')

# Load for modeling
data = pd.read_excel('./日销售数据.xlsx')
data.rename(columns={'Sale Date Time':'date1'}, inplace=True)

# Scale sales (unit = 100k)
data['Total'] = round(data['Total'] / 100000, 4)

# Plot time series
import matplotlib.pyplot as plt
plt.figure(figsize=(18,8), dpi=256)
data['Total'][:-30].plot()

# ACF plot
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(data['Total'][:-30])

# PACF plot
from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(data['Total'][:-30])

# Augmented Dickey‑Fuller test
from statsmodels.tsa.stattools import adfuller as ADF
print(ADF(data['Total'][:-30]))

The p‑value (0.347) indicates a non‑stationary series; first‑order differencing is applied.

# First difference
D_data = data['Total'][:-30].diff().dropna()
print('First‑difference test:', ADF(D_data))

After differencing, the series is stationary, allowing ARIMA modeling.

# Ljung‑Box white‑noise test
from statsmodels.stats.diagnostic import acorr_ljungbox
print('Ljung‑Box test:', acorr_ljungbox(D_data, lags=1))

White‑noise test p‑value < 0.05, confirming a stationary non‑white‑noise series suitable for ARIMA.

# Grid search for best ARIMA (p,q) using AIC
from statsmodels.tsa.arima_model import ARIMA
from itertools import product
ps = range(0,5)
qs = range(0,5)
parameters = list(product(ps, qs))
best_aic = float('inf')
for param in parameters:
    try:
        model = ARIMA(data['Total'][:-30], order=(param[0],1,param[1])).fit()
    except ValueError:
        continue
    if model.aic < best_aic:
        best_aic = model.aic
        best_model = model
        best_param = param
print('Best model', best_model.summary())

# Forecast next 30 days
forecast = best_model.forecast(30)[0]

# Model evaluation
from sklearn.metrics import mean_absolute_error
pred_y = best_model.forecast(30)[0]
test_y = data['Total'][-30:].values
mae = mean_absolute_error(test_y, pred_y)
print('MAE:', mae)

MAE of 2.38 (in 100 k units) suggests reasonable accuracy; thresholds should be set based on business needs.

# Plot actual vs. predicted
plt.figure(figsize=(14,7), dpi=256)
plt.plot(data['date1'][-30:], test_y, label='Actual')
plt.plot(data['date1'][-30:], pred_y, label='Forecast')
plt.xticks(data['date1'][-30:], rotation=70)
plt.legend(loc=3)

4.3 Customer Perspective

4.3.1 City Distribution

10.98 % of customers are concentrated in Agirrie; overall city distribution is dispersed.

4.3.2 Top 10 Customers by Purchase Amount & Frequency

Customer Barry Barrett spent over 15 M and placed more than 100 K orders, representing a high‑value client.

4.3.3 Repurchase Rate

Overall repurchase rate (customers with >2 purchases) is 87.91 %, indicating strong loyalty.

# Compute purchase count per customer
client_data = data.groupby('Client').nunique()['Order Number']
client_data = client_data.reset_index().rename(columns={'Order Number':'user_num'}).sort_values('user_num', ascending=False)

# Total repurchase rate
print('Total repurchase rate:', round((len(client_data[client_data['user_num'] > 1]) - 1) / (len(client_data) - 1) * 100, 4), "%")

One‑Month Repurchase Rate

Defined as the proportion of customers who repurchased within the same month.

# Add year_month column
def parse_year_month(x):
    return f"{x.year}-{x.month:02d}"
data['year_month'] = data['Sale Date Time'].apply(parse_year_month)

# Compute monthly repurchase rate
y_m_data = data.groupby(['year_month','Client']).nunique()['Order Number'].reset_index()
month_list, rate_list = [], []
for every_m in y_m_data['year_month'].unique():
    temp = y_m_data[y_m_data['year_month'] == every_m]
    rate = round((len(temp[temp['Order Number'] > 1]) - 1) / (len(temp) - 1), 4)
    month_list.append(every_m)
    rate_list.append(rate)
rate_data = pd.DataFrame({'month': month_list, 'rate': rate_list})
rate_data.to_excel('./rate_data.xlsx', index=False)

4.3.4 Purchase Frequency Distribution (April 2018)

# Filter April 2018 data
data_201804 = y_m_data[y_m_data['year_month'] == '2018-04']
data_201804.rename(columns={'Order Number':'buy_frequency'}, inplace=True)
# Bin frequencies
bins = [0,1,2,5,10,50,100,100000]
per_frequency = pd.cut(data_201804['buy_frequency'], bins)
per_frequency.value_counts().plot(kind='bar')

4.3.5 RFM Customer Segmentation (April 2018)

# Aggregate R, F, M per customer for April 2018
RFM_data_all = data.groupby(['year_month','Client']).agg({'Order Number':'nunique','Sale Date Time':'max','Total':'sum'}).reset_index()
RFM_data_201804 = RFM_data_all[RFM_data_all['year_month'] == '2018-04']
reference_time = datetime.strptime('2018-05-01 23:59:59', "%Y-%m-%d %H:%M:%S")
RFM_data_201804['R'] = (reference_time - RFM_data_201804['Sale Date Time']).dt.days
RFM_data_201804.rename(columns={'Order Number':'F','Total':'M'}, inplace=True)
# Remove anonymous customer
RFM_data_201804.drop(index=26015, inplace=True)
# Standardize and K‑Means (5 clusters)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
ss = StandardScaler()
train = ss.fit_transform(RFM_data_201804[['R','F','M']])
kmeans = KMeans(n_clusters=5, random_state=0).fit(train)
RFM_data_201804['sk5_label'] = kmeans.labels_

Cluster interpretation:

Cluster 0: low R, low F, low M – general customers.

Cluster 1: high R, low F, low M – potential growth customers.

Cluster 2: low R, high F, high M – key retained customers.

Clusters 3 & 4: low R, high F/M – high‑value customers.

4.4 Product Perspective

4.4.1 Top 10 Products by Sales & Orders

Special Gasoline, Gasoline, and Diesel Auto Clean dominate sales and order volume.

4.4.2 Product Clustering (April 2018)

# Aggregate cost (C), order count (F), sales (M) per product
month_product_data = data.groupby(['year_month','Product']).agg({'Product Cost':'mean','Order Number':'nunique','Total':'sum'}).reset_index()
month_product_data.rename(columns={'Product Cost':'C','Order Number':'F','Total':'M'}, inplace=True)
month_product_201804 = month_product_data[month_product_data['year_month'] == '2018-04']
# Standardize and DBSCAN (min_samples=4)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
ss = StandardScaler()
X = month_product_201804[['C','F','M']]
train_X = ss.fit_transform(X)
dbscan = DBSCAN(min_samples=4).fit(train_X)
month_product_201804['labels'] = dbscan.labels_

Label ‑1 indicates outliers, split into:

A‑type: low cost, high order volume, high sales – priority products.

B‑type: high cost, low order volume, low sales – candidates for discontinuation.

4.4.3 Pareto Analysis of Product Sales

The top three products contribute over 82 % of total sales, confirming the 80/20 rule.

4.4.4 Product Cost Distribution

Most products have costs below 12; the next largest group falls between 12‑36, followed by 36‑95, with few above 95.

4.4.5 Product Count per Category

The "Filters" category has the most products (933), followed by "Chewing Gum And Candy" (839). "Extinguisher" has the fewest (6).

5. Payment Methods

Nearly 70 % of customers choose cash payments.

6. Conclusion

6.1 Three‑year sales reached 1.78 billion.

6.2 2017 was a growth phase; 2018‑19 stabilized.

6.3 Sales show weekly cycles; Sunday is the lowest.

6.4 Overall repurchase rate is 87.91 % – high loyalty.

6.5 Monthly repurchase rate exceeds 66 %.

6.6 Monthly user retention rate exceeds 74 %.

6.7 Top three products account for 82 % of sales (Pareto).

6.8 80 % of products have cost below 36.

6.9 About 70 % of users prefer cash.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python ARIMA Customer Segmentation Tableau Sales Forecasting RFM

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Analysis Background

2. Analysis Framework

3. Data Cleaning

3.1 Load Data and Overview

3.2 Remove Duplicate Records

3.3 Convert Date Format

4. Analysis

4.1 Overall Situation

4.2 Time‑Based Analysis

4.2.1 Annual Sales

4.2.2 Quarterly Sales

4.2.3 Monthly Sales

4.2.4 Weekly Sales

4.2.5 Daily Sales

4.2.6 Four‑Year Daily UV

4.2.7 Weekday Sales Weight

4.2.8 Time‑Slot Sales

4.2.9 Weekday Sales Weight Calculation

4.2.10 Daily Sales Forecast

4.3 Customer Perspective

4.3.1 City Distribution

4.3.2 Top 10 Customers by Purchase Amount & Frequency

4.3.3 Repurchase Rate

One‑Month Repurchase Rate

4.3.4 Purchase Frequency Distribution (April 2018)

4.3.5 RFM Customer Segmentation (April 2018)

4.4 Product Perspective

4.4.1 Top 10 Products by Sales & Orders

4.4.2 Product Clustering (April 2018)

4.4.3 Pareto Analysis of Product Sales

4.4.4 Product Cost Distribution

4.4.5 Product Count per Category

5. Payment Methods

6. Conclusion

6.1 Three‑year sales reached 1.78 billion.

6.2 2017 was a growth phase; 2018‑19 stabilized.

6.3 Sales show weekly cycles; Sunday is the lowest.

6.4 Overall repurchase rate is 87.91 % – high loyalty.

6.5 Monthly repurchase rate exceeds 66 %.

6.6 Monthly user retention rate exceeds 74 %.

6.7 Top three products account for 82 % of sales (Pareto).

6.8 80 % of products have cost below 36.

6.9 About 70 % of users prefer cash.

Python Crawling & Data Mining

How this landed with the community

Was this worth your time?

0 Comments

4.3.4 Purchase Frequency Distribution (April 2018)

4.3.5 RFM Customer Segmentation (April 2018)

4.4.2 Product Clustering (April 2018)

6.1 Three‑year sales reached 1.78 billion.

6.4 Overall repurchase rate is 87.91 % – high loyalty.

6.5 Monthly repurchase rate exceeds 66 %.

6.6 Monthly user retention rate exceeds 74 %.

6.7 Top three products account for 82 % of sales (Pareto).

6.8 80 % of products have cost below 36.

6.9 About 70 % of users prefer cash.