Artificial Intelligence 11 min read

Phishing Website Detection Using Machine Learning Models in R

This article presents a step‑by‑step machine‑learning analysis of the UCI Phishing Websites dataset in R, loading the data, training boosted logistic regression, SVM, tree‑bagging, and random‑forest models, comparing their accuracies, and identifying the most important predictive features for phishing detection.

Architects Research Society
Architects Research Society
Architects Research Society
Phishing Website Detection Using Machine Learning Models in R

Phishing website detection is crucial for online security, and this article demonstrates a practical analysis using the UCI Phishing Websites dataset in R.

First, the CSV file is loaded, column names are assigned, and all features are treated as factors. The dataset contains 30 features and a binary target indicating phishing or legitimate sites.

After splitting the data with createDataPartition (75% training, 25% testing), several models are trained using the caret package with repeated 5‑fold cross‑validation.

Example code to load libraries and read the data:

library(caret)
library(doMC)
registerDoMC(4)
data <- read.csv('Datasets/phising.csv', header=F, colClasses="factor")
names(data) <- c("has_ip","long_url","short_service","has_at","double_slash_redirect","pref_suf","has_sub_domain","ssl_state","long_domain","favicon","port","https_token","req_url","url_of_anchor","tag_links","SFH","submit_to_email","abnormal_url","redirect","mouseover","right_click","popup","iframe","domain_Age","dns_record","traffic","page_rank","google_index","links_to_page","stats_report","target")

Models trained include:

# Boosted Logistic Regression
fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)
log.fit <- train(target ~ ., data=training, method='LogitBoost', trControl=fitControl, tuneLength=5)
log.predict <- predict(log.fit, testing[,-31])
confusionMatrix(log.predict, testing$target)
# SVM with RBF kernel
fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)
rbfsvm.fit <- train(target ~ ., data=training, method='svmRadial', trControl=fitControl, tuneLength=5)
rbfsvm.predict <- predict(rbfsvm.fit, testing[,-31])
confusionMatrix(rbfsvm.predict, testing$target)
# Tree bagging
fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)
treebag.fit <- train(target ~ ., data=training, method='treebag', importance=TRUE, trControl=fitControl)
treebag.predict <- predict(treebag.fit, testing[,-31])
confusionMatrix(treebag.predict, testing$target)
# Random Forest
fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)
rf.fit <- train(target ~ ., data=training, method='rf', importance=TRUE, trControl=fitControl, tuneLength=5)
rf.predict <- predict(rf.fit, testing[,-31])
confusionMatrix(rf.predict, testing$target)

The test accuracies are 93.57% for boosted logistic regression, 97.06% for the SVM, 97.39% for tree bagging, and 97.39% for the random forest.

The random forest model is used to extract feature importance. The top ten most important predictors are:

pref_suf-1 (100.00)
url_of_anchor-1 (85.89)
ssl_state1 (84.59)
has_sub_domain-1 (69.18)
traffic1 (64.39)
req_url-1 (43.23)
url_of_anchor1 (37.58)
long_domain-1 (36.00)
domain_Age-1 (34.68)
domain_Age1 (29.54)

These results illustrate that relatively simple machine‑learning models can achieve high detection rates, and the identified features provide valuable insight for building effective phishing detection systems.

machine learningSVMrandom forestfeature importanceRcaretphishing detection
Architects Research Society
Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.