Big Data 8 min read

Mastering Data Mining: A Deep Dive into CRISP‑DM and SEMMA Methodologies

This article explains the two most common data‑mining frameworks—CRISP‑DM and SEMMA—detailing their six and five stages respectively, illustrating each phase with diagrams and highlighting how the iterative nature of data mining drives continuous improvement.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Mastering Data Mining: A Deep Dive into CRISP‑DM and SEMMA Methodologies

CRISP‑DM Methodology

CRISP‑DM, originally summarized from projects by NCR, Clementine, OHRA and Daimler‑Benz and promoted by SPSS, divides the data‑mining project lifecycle into six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment (often called “Preparation Work”). The diagram (Figure 1) illustrates these stages, noting that the order can be flexible depending on results and tasks.

CRISP-DM diagram
CRISP-DM diagram

The outer loop in the figure represents the iterative nature of data mining, emphasizing that insights from one project can guide the next.

1) Business Understanding: Define project goals and requirements from a business perspective, identify actionable data‑mining problems, and draft an initial plan.

2) Data Understanding: Collect raw data, explore its quality, identify interesting subsets, and form hypotheses about relationships.

3) Data Preparation: Transform and clean raw data to create the dataset needed for mining, often repeating tasks without a fixed order.

4) Modeling: Select and apply appropriate modeling techniques, calibrate parameters, and possibly return to data preparation if required.

5) Evaluation: Assess the model against business objectives and ensure all critical issues have been considered.

6) Deployment (Preparation Work): After the model is built, the user decides how to apply it in the operational environment.

SEMMA Methodology

Developed by SAS, SEMMA (Sample, Explore, Modify, Model, Assess) follows a similar flow to CRISP‑DM. It starts with defining the business problem, assessing the environment, preparing data, then iteratively performing mining steps, and finally deploying and reviewing the model.

SEMMA diagram
SEMMA diagram

1) Sample: Gather and combine data to construct the analysis dataset.

2) Explore: Examine data quality (errors, missing values, inconsistencies) and variable distributions, deciding which variables need cleaning or transformation.

3) Modify: Apply corrections, fill missing values, unify units, and perform transformations or standardization as required for modeling.

4) Model: Choose suitable models based on analysis goals (details omitted here).

5) Assess: Perform in‑sample validation, using metrics such as ROC curves or lift charts to evaluate predictive performance.

This article is excerpted from “Financial Business Algorithm Modeling: Based on Python and SAS”, published with permission.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AnalyticsBig Datadata miningMethodologyCRISP-DMSEMMA
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.