Artificial Intelligence 11 min read

Phishing Website Detection Using Machine Learning Models in R

This article presents a step‑by‑step machine‑learning analysis of the UCI Phishing Websites dataset in R, loading the data, training boosted logistic regression, SVM, tree‑bagging, and random‑forest models, comparing their accuracies, and identifying the most important predictive features for phishing detection.

Architects Research Society

Oct 28, 2016

Phishing Website Detection Using Machine Learning Models in R

Phishing website detection is crucial for online security, and this article demonstrates a practical analysis using the UCI Phishing Websites dataset in R.

First, the CSV file is loaded, column names are assigned, and all features are treated as factors. The dataset contains 30 features and a binary target indicating phishing or legitimate sites.

After splitting the data with createDataPartition (75% training, 25% testing), several models are trained using the caret package with repeated 5‑fold cross‑validation.

Example code to load libraries and read the data:

library(caret)<br/>library(doMC)<br/>registerDoMC(4)<br/>data <- read.csv('Datasets/phising.csv', header=F, colClasses="factor")<br/>names(data) <- c("has_ip","long_url","short_service","has_at","double_slash_redirect","pref_suf","has_sub_domain","ssl_state","long_domain","favicon","port","https_token","req_url","url_of_anchor","tag_links","SFH","submit_to_email","abnormal_url","redirect","mouseover","right_click","popup","iframe","domain_Age","dns_record","traffic","page_rank","google_index","links_to_page","stats_report","target")

Models trained include:

# Boosted Logistic Regression<br/>fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)<br/>log.fit <- train(target ~ ., data=training, method='LogitBoost', trControl=fitControl, tuneLength=5)<br/>log.predict <- predict(log.fit, testing[,-31])<br/>confusionMatrix(log.predict, testing$target)

# SVM with RBF kernel<br/>fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)<br/>rbfsvm.fit <- train(target ~ ., data=training, method='svmRadial', trControl=fitControl, tuneLength=5)<br/>rbfsvm.predict <- predict(rbfsvm.fit, testing[,-31])<br/>confusionMatrix(rbfsvm.predict, testing$target)

# Tree bagging<br/>fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)<br/>treebag.fit <- train(target ~ ., data=training, method='treebag', importance=TRUE, trControl=fitControl)<br/>treebag.predict <- predict(treebag.fit, testing[,-31])<br/>confusionMatrix(treebag.predict, testing$target)

# Random Forest<br/>fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)<br/>rf.fit <- train(target ~ ., data=training, method='rf', importance=TRUE, trControl=fitControl, tuneLength=5)<br/>rf.predict <- predict(rf.fit, testing[,-31])<br/>confusionMatrix(rf.predict, testing$target)

The test accuracies are 93.57% for boosted logistic regression, 97.06% for the SVM, 97.39% for tree bagging, and 97.39% for the random forest.

The random forest model is used to extract feature importance. The top ten most important predictors are:

pref_suf-1 (100.00)<br/>url_of_anchor-1 (85.89)<br/>ssl_state1 (84.59)<br/>has_sub_domain-1 (69.18)<br/>traffic1 (64.39)<br/>req_url-1 (43.23)<br/>url_of_anchor1 (37.58)<br/>long_domain-1 (36.00)<br/>domain_Age-1 (34.68)<br/>domain_Age1 (29.54)

These results illustrate that relatively simple machine‑learning models can achieve high detection rates, and the identified features provide valuable insight for building effective phishing detection systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

svm Random Forest Feature Importance R caret phishing detection

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.