Phishing Website Detection Using Machine Learning Models in R
This article presents a step‑by‑step machine‑learning analysis of the UCI Phishing Websites dataset in R, loading the data, training boosted logistic regression, SVM, tree‑bagging, and random‑forest models, comparing their accuracies, and identifying the most important predictive features for phishing detection.
Phishing website detection is crucial for online security, and this article demonstrates a practical analysis using the UCI Phishing Websites dataset in R.
First, the CSV file is loaded, column names are assigned, and all features are treated as factors. The dataset contains 30 features and a binary target indicating phishing or legitimate sites.
After splitting the data with createDataPartition (75% training, 25% testing), several models are trained using the caret package with repeated 5‑fold cross‑validation.
Example code to load libraries and read the data:
library(caret)
library(doMC)
registerDoMC(4)
data <- read.csv('Datasets/phising.csv', header=F, colClasses="factor")
names(data) <- c("has_ip","long_url","short_service","has_at","double_slash_redirect","pref_suf","has_sub_domain","ssl_state","long_domain","favicon","port","https_token","req_url","url_of_anchor","tag_links","SFH","submit_to_email","abnormal_url","redirect","mouseover","right_click","popup","iframe","domain_Age","dns_record","traffic","page_rank","google_index","links_to_page","stats_report","target")Models trained include:
# Boosted Logistic Regression
fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)
log.fit <- train(target ~ ., data=training, method='LogitBoost', trControl=fitControl, tuneLength=5)
log.predict <- predict(log.fit, testing[,-31])
confusionMatrix(log.predict, testing$target) # SVM with RBF kernel
fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)
rbfsvm.fit <- train(target ~ ., data=training, method='svmRadial', trControl=fitControl, tuneLength=5)
rbfsvm.predict <- predict(rbfsvm.fit, testing[,-31])
confusionMatrix(rbfsvm.predict, testing$target) # Tree bagging
fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)
treebag.fit <- train(target ~ ., data=training, method='treebag', importance=TRUE, trControl=fitControl)
treebag.predict <- predict(treebag.fit, testing[,-31])
confusionMatrix(treebag.predict, testing$target) # Random Forest
fitControl <- trainControl(method='repeatedcv', repeats=5, number=5, verboseIter=TRUE)
rf.fit <- train(target ~ ., data=training, method='rf', importance=TRUE, trControl=fitControl, tuneLength=5)
rf.predict <- predict(rf.fit, testing[,-31])
confusionMatrix(rf.predict, testing$target)The test accuracies are 93.57% for boosted logistic regression, 97.06% for the SVM, 97.39% for tree bagging, and 97.39% for the random forest.
The random forest model is used to extract feature importance. The top ten most important predictors are:
pref_suf-1 (100.00)
url_of_anchor-1 (85.89)
ssl_state1 (84.59)
has_sub_domain-1 (69.18)
traffic1 (64.39)
req_url-1 (43.23)
url_of_anchor1 (37.58)
long_domain-1 (36.00)
domain_Age-1 (34.68)
domain_Age1 (29.54)These results illustrate that relatively simple machine‑learning models can achieve high detection rates, and the identified features provide valuable insight for building effective phishing detection systems.
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.