Mastering Data Mining: A Deep Dive into CRISP‑DM and SEMMA Methodologies
This article explains the two most common data‑mining frameworks—CRISP‑DM and SEMMA—detailing their six and five stages respectively, illustrating each phase with diagrams and highlighting how the iterative nature of data mining drives continuous improvement.
CRISP‑DM Methodology
CRISP‑DM, originally summarized from projects by NCR, Clementine, OHRA and Daimler‑Benz and promoted by SPSS, divides the data‑mining project lifecycle into six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment (often called “Preparation Work”). The diagram (Figure 1) illustrates these stages, noting that the order can be flexible depending on results and tasks.
The outer loop in the figure represents the iterative nature of data mining, emphasizing that insights from one project can guide the next.
1) Business Understanding: Define project goals and requirements from a business perspective, identify actionable data‑mining problems, and draft an initial plan.
2) Data Understanding: Collect raw data, explore its quality, identify interesting subsets, and form hypotheses about relationships.
3) Data Preparation: Transform and clean raw data to create the dataset needed for mining, often repeating tasks without a fixed order.
4) Modeling: Select and apply appropriate modeling techniques, calibrate parameters, and possibly return to data preparation if required.
5) Evaluation: Assess the model against business objectives and ensure all critical issues have been considered.
6) Deployment (Preparation Work): After the model is built, the user decides how to apply it in the operational environment.
SEMMA Methodology
Developed by SAS, SEMMA (Sample, Explore, Modify, Model, Assess) follows a similar flow to CRISP‑DM. It starts with defining the business problem, assessing the environment, preparing data, then iteratively performing mining steps, and finally deploying and reviewing the model.
1) Sample: Gather and combine data to construct the analysis dataset.
2) Explore: Examine data quality (errors, missing values, inconsistencies) and variable distributions, deciding which variables need cleaning or transformation.
3) Modify: Apply corrections, fill missing values, unify units, and perform transformations or standardization as required for modeling.
4) Model: Choose suitable models based on analysis goals (details omitted here).
5) Assess: Perform in‑sample validation, using metrics such as ROC curves or lift charts to evaluate predictive performance.
This article is excerpted from “Financial Business Algorithm Modeling: Based on Python and SAS”, published with permission.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
