How Do Process Mining Algorithms Compare? A Deep Dive into Control‑Flow Techniques
This article examines the rapid growth of process‑mining techniques, explains the core concepts and modeling languages, evaluates model quality criteria, and provides a detailed comparison of direct, heuristic, genetic, and log‑classification algorithms from the perspective of control‑flow analysis.
Introduction
Enterprises increasingly focus on process governance, driving rapid development of process‑mining techniques and a proliferation of algorithms. Effective analysis requires understanding each algorithm’s strengths and weaknesses and comparing them using the control‑flow view.
Process‑Mining Overview
Process mining extracts knowledge from event logs recorded by information systems to discover, monitor, and improve actual business processes. It comprises three perspectives: process discovery, conformance checking, and process enhancement.
Control‑flow View and Modeling Languages
The control‑flow view captures the order of activities and the branching structure of a process. Common modeling formalisms include:
Petri nets – bipartite graphs of places, transitions and tokens, supporting concurrency.
Workflow nets (WF‑nets) – a subclass of Petri nets with a single source place and a single sink place, suitable for α‑algorithm discovery.
Causal nets (C‑nets) – activity‑centric representation using input and output bindings without explicit places.
BPMN – graphical notation with activities, events and gateways (AND, XOR, OR).
Process trees – hierarchical representation using four operators: sequence (→), exclusive choice (×), parallel (∧) and redo loop (↺).
Event Log Characteristics
An event log consists of a set of traces, each trace being an ordered list of events. Each event records at least an activity name, a timestamp, a case identifier and optionally resources, costs, or data objects. Two practical issues affect mining:
Noise – low‑frequency events caused by logging errors or exceptional executions. Some algorithms (heuristic miner, genetic miner) incorporate noise handling.
Completeness – the degree to which the log contains all possible behavior. Formal completeness levels include global completeness (GC), direct succession (DS), short‑loop (DS+), long‑loop (DS++) and frequency completeness (FS).
Model‑quality Criteria
Four dimensions are used to evaluate discovered models:
Reproducibility (Fitness) – proportion of log traces that can be reproduced by the model.
Precision – extent to which the model does not allow behavior not seen in the log.
Generality – ability of the model to allow correct but unseen behavior (avoid over‑fitting).
Simplicity – preference for fewer elements while preserving the other criteria.
Algorithm Families
1. Direct (α‑type) Algorithms
These algorithms derive four ordering relations (direct succession, causality, parallel, unrelated) from the log and construct five control‑flow patterns (sequence, exclusive choice, inclusive choice, parallel split, parallel join) to produce a WF‑net. They assume noise‑free, complete logs and cannot handle non‑free‑choice or short‑loop constructs.
2. Heuristic Miner
The heuristic miner builds a dependency/frequency matrix. For each activity pair (a,b) it computes a dependency measure:
dep(a,b) = (|a→b| - |b→a|) / (|a→b| + |b→a| + 1)Pairs with dep(a,b) above a user‑defined threshold are kept; low‑frequency relations are treated as noise. The resulting dependency graph is interpreted as sequence, choice or parallel structures, yielding a C‑net or Petri net.
3. Genetic Algorithms
Process models are encoded as individuals (often as process trees). A fitness function combines the four quality criteria:
fitness = w1·Reproducibility + w2·Precision + w3·Generality + w4·SimplicityTypical weights (w1…w4) are set by the user. The algorithm iterates selection, crossover and mutation to evolve a population toward higher fitness. This approach is robust to noise and incomplete logs but requires more computation.
4. Log‑classification (Cluster‑based) Approaches
To reduce log diversity, each trace is transformed into a feature vector (e.g., binary activity presence, frequency counts, data‑object attributes). Standard similarity measures (Euclidean, Hamming, Jaccard) are used to cluster traces into sub‑logs. A chosen mining algorithm (α‑algorithm, heuristic, genetic) is applied to each sub‑log, producing local models that can be merged into a global view. This improves understandability for highly heterogeneous processes.
Algorithm Comparison
Direct and heuristic miners are fast but struggle with noisy or highly diverse logs. Genetic algorithms handle noise and incompleteness at the cost of higher runtime. Log‑classification techniques mitigate diversity but depend on effective clustering and subsequent merging.
Technical Trends
Future work focuses on hybrid methods that combine the efficiency of direct/heuristic algorithms with the robustness of genetic or causal techniques, and on leveraging machine‑learning and increased computational power to improve model quality while scaling to large, noisy logs.
References
Aalst, W. Process Mining: Discovery, Conformance and Enhancement of Business Processes , Springer, 2011.
Weijters, A., Aalst, W. “Process mining with the heuristics miner‑algorithm”, 2006.
Buijs, J., Dongen, B., Aalst, W. “A genetic algorithm for discovering process trees”, 2012 IEEE Congress on Evolutionary Computation.
AsiaInfo Technology: New Tech Exploration
AsiaInfo's cutting‑edge ICT viewpoints and industry insights, featuring its latest technology and product case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
