Big Data 20 min read

URL Normalization and Statistical Analysis in MDAP Using Probabilistic and Machine Learning Techniques

MDAP normalizes URLs by automatically learning pattern‑tree rule models using entropy‑based splits, gibberish and numeric detection, and scalable Flink processing, which groups millions of raw URLs into concise patterns for accurate statistical monitoring, dramatically reducing data noise while still facing latency and model‑iteration challenges.

Shopee Tech Team
Shopee Tech Team
Shopee Tech Team
URL Normalization and Statistical Analysis in MDAP Using Probabilistic and Machine Learning Techniques

In traditional client‑side monitoring, URLs are counted individually. When an application accesses thousands of distinct URLs, the results become noisy and it is hard to pinpoint problematic URLs.

The MDAP (Multi‑Dimension Analysis Platform) addresses this by applying probabilistic statistics and machine learning to group similar URLs into a single rule model. Statistics are then performed on the url_pattern field instead of the raw url , greatly improving the usability and accuracy of URL‑based monitoring.

1. Introduction

Counting raw URLs directly (e.g., with a simple SQL query) produces poor results:

SELECT `url`, count(1) as `cnt`
FROM `web_analysis_tab`
WHERE `app_name` = 'app_1'
GROUP BY `url`;

The issues include:

Developers cannot quickly and accurately locate potential problems.

Results are overly scattered, reducing user interest.

Processing such dispersed data incurs high system overhead (query efficiency, network transfer, UI rendering).

For example, an app may visit 1,000,000 distinct URLs while its URL rule model contains fewer than 100 entries.

2. Problem Thinking

Three solution directions were evaluated.

2.1 User‑Configured Scheme

Allowing users to upload URL rule configurations was found infeasible because:

Configuration is cumbersome and error‑prone.

Different languages/frameworks have vastly different routing syntaxes (e.g., GET http://example.com/users/:user_name for Go/Gin, GET http://example.com/{name=users/*} for grpc‑gateway, GET http://example.com/users/{user_name} for Java/Spring).

External URLs (Google, Facebook, etc.) constitute 10‑30% of traffic and users rarely know their patterns.

Thus, a user‑configuration approach is not viable .

2.2 Machine‑Learning Scheme

2.2.1 URL Grammar Overview

A URL can be decomposed into components: schema , authority , path , query , and fragment . For example:

schema: http
authority: example.com
path: {"path0": "books", "path1": "search"}
query: {"name": "go", "isbn": "1234"}

MDAP focuses on path and query , converting a URL into a tuple (authority, Array[Tuple(K, V)]) :

(
    "example.com",
    [
        ("path0", "books"),
        ("path1", "search"),
        ("name", "go"),
        ("isbn", "1234")
    ]
)

2.2.2 Bottom‑Up Pairing Strategy

Eight sample URLs are transformed into KV structures (e.g., U1 → {"K1": "a", "K2": "b", "K3": "y", "K4": "c", "K5": "*"} ). Pairing proceeds as:

U5 + U6 → P1 : ({"K1": "a", "K2": "b", "K3": "y", "K4": "c", "K5": "*"})

U7 + U8 → P2 : ({"K1": "a", "K2": "b", "K3": "z", "K4": "c", "K5": "*"})

P1 + P2 → P3 : ({"K1": "a", "K2": "b", "K3": "*", "K4": "c", "K5": "*"})

If any intermediate URL (e.g., U6) is missing, the chain breaks, illustrating the fragility of pure pairing.

2.2.3 Pattern‑Tree Approach

Pattern trees leverage the entire training set, making the learning process more robust against missing URLs. They also enable direct rule extraction without intermediate pair steps.

3. Framework Overview

The MDAP pipeline consists of a URL model learner and a URL model matcher.

The learner ingests reported URLs, automatically discovers rule models, and stores them.

The matcher applies the learned models to incoming URLs, generating Tuple(url, url_pattern) records for downstream analytics.

Given the massive volume of URL data, MDAP uses Apache Flink for scalable processing:

Read URL data from the source.

Transform each URL into (authority, Array[Tuple(K, V)]) .

Group by authority + salt (the salt mitigates data skew, e.g., using length(Array[Tuple(K, V)]) ).

Within each group, build a pattern‑tree to generate URL rule models.

Re‑group by authority and merge models.

Persist the final URL rule models.

The matcher employs a customized Trie structure for fast rule lookup.

4. Algorithm Description

The pattern‑tree construction follows six steps:

Initialize the root node containing all URLs.

Identify the URL key (either path_index or query_key ) whose value distribution yields the lowest entropy.

Separate the values of that key into salient values and trivial values .

Retain salient values, replace trivial values with * , and split the tree accordingly.

Repeat steps 2‑4 until all keys are processed.

Collect rule models from leaf nodes.

4.1 Entropy‑Based Key Selection

Entropy measures randomness of a key’s value distribution. The key with the smallest entropy is chosen for splitting. The entropy formula is:

where V is the number of distinct values, N is the total count, and vi is the frequency of the i‑th value.

4.2 Distinguishing Salient and Trivial Values

4.2.1 Gaussian‑Based Separation

Assuming values follow a Gaussian distribution, values are sorted by frequency. The two points with the greatest drop in frequency define a boundary; values left of the boundary are salient, right are trivial.

["index", "users", "books", "videos"]
["0", "12323", "a3df56", "bher43"]

4.2.2 Markov‑Chain and Density‑Function Pruning

Gaussian separation can misclassify when the distribution is flat. MDAP therefore adds a pruning stage using:

Gibberish detection (non‑human strings, e.g., hashes, base64).

Numeric‑heavy strings (e.g., IDs).

The pruning algorithm:

Check if the string appears in a predefined abbreviation list – if yes, keep as salient.

Apply a Markov‑chain‑based gibberish detector (2‑gram model) – if gibberish, mark as trivial.

Assess numeric density – if high, mark as trivial; otherwise, keep as salient.

Markov‑Chain Gibberish Detection

A 2‑gram model is trained on large corpora to build a probability matrix. The matrix is then used to compute a likelihood score for a given string; scores below a threshold indicate gibberish.

Numeric Density Detection

Strings dominated by digits (e.g., 1234 ) are treated as trivial, whereas version‑like strings (e.g., v1 ) remain salient. The decision uses a density‑based formula.

5. Optimization Tests and Results

5.1 Algorithm Optimization Tests

5.1.1 Compression Rate Test

Training data of 100 k–2 M raw URLs were reduced to 10–16 k distinct URLs after simple deduplication. After applying the pattern‑tree algorithm, only 1–85 URL rule models remained.

5.1.2 Matching Rate Test

Two test sets were used: Test‑1 (same day, different time) achieved ≥99.99% hit rate; Test‑2 (one week later) achieved 80.89%–100% hit rate.

5.1.3 Test Conclusions

Pattern‑tree generated rule models dramatically improve result convergence.

Matching accuracy degrades with increasing time gap between training and testing data.

5.2 Effect Demonstration

MDAP uses a T + 1 matching strategy, yielding an average failure rate of 0.3%. Visual comparison shows that distinct‑deduplicated URLs (~8 110 entries) produce a cluttered view, whereas rule‑model‑based statistics (~60 entries) are concise and insightful.

6. Summary and Outlook

MDAP’s pattern‑tree approach successfully normalizes URLs and enhances statistical analysis. Remaining challenges include:

Long rule‑learning latency, limiting near‑real‑time processing.

Incomplete model‑iteration capabilities.

Future work will focus on reducing learning cycles and strengthening model iteration.

References

A pattern tree‑based approach to learning URL normalization rules

Gibberish‑Detector GitHub repository

URL – Wikipedia

Big Datamachine learningflinkpattern treeprobabilistic analysisURL normalization
Shopee Tech Team
Written by

Shopee Tech Team

How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.