News Page Identification Using Machine Learning: Feature Engineering, Model Selection, and Evaluation
To accurately distinguish news pages from other web page types, this study formulates the task as a binary classification problem, extracts 19 engineered features from HTML, evaluates logistic regression and SVM models with cross‑validation, and achieves over 90% precision, recall, and F1‑score using LR with Newton method.
Background – Web crawlers collect HTML pages from many international sites, including homepages, forums, news, lists, videos, downloads, galleries, etc. Simple heuristics based on word count and image size often misclassify novel, gallery, or video pages as news pages.
Goal – Treat news‑page identification as a binary classification problem, building a model that labels pages as news (positive) or non‑news (negative) with an F1‑score above 90%.
Metrics
Accuracy: proportion of correctly predicted positive samples among all predicted positives.
Recall: proportion of actual positives correctly identified.
F1‑score: harmonic mean of precision and recall; can be adjusted (F0.5, F2) depending on emphasis.
Identification Algorithm Flow
Step Details
5.1 Data Selection – Randomly sample diverse HTML pages from the crawl, ensuring coverage of homepages, forums, downloads, galleries, news, lists, videos, etc. The final dataset contains about 1,000 pages.
5.2 Data Cleaning – Remove malformed HTML, strip useless tags (script, style, comments), and keep only clean, parsable content.
5.3 Data Labeling – Manually label pages as news (positive) or non‑news (negative). Approximately 40% of the dataset are news pages, yielding a 4:6 positive‑negative ratio.
5.4 Feature Engineering – Iteratively select effective features, ending with 19 key attributes (see table below).
Index
Feature Name
Meaning
Remark
1
is_exist_author
Whether the page contains an author
Discriminative power: 16.5% contain
2
is_exist_title
Whether the page contains a title
Discriminative power: 41.7% contain
3
is_exist_publish_time
Whether the page contains a publish time
Discriminative power: 19.4% contain
4
is_include_date_URL
Whether the URL includes a date
Discriminative power: 10.3% contain
5
is_include_news_URL
Whether the URL includes the word "news"
Discriminative power: 11.2% contain
6
is_include_forum_URL
Whether the URL includes "forum" or "bbs"
Discriminative power: 12.6% contain
7
is_include_music_URL
Whether the URL includes music indicators (mp3, music, etc.)
Discriminative power: 6.1% contain
8
is_include_download_URL
Whether the URL includes "download"
Discriminative power: 5.8% contain
9
is_include_media_URL
Whether the URL includes media indicators (video, mp4, movie, etc.)
Discriminative power: 19.2% contain
10
is_include_image_set_URL
Whether the URL includes gallery/novel indicators
Discriminative power: 2.6% contain
11
is_include_torrent_URL
Whether the URL includes torrent indicators
Discriminative power: 1% contain
12
content_media_count
Number of media tags (
<img>,
<video>,
<audio>) in the content
13
page_media_count
Number of media tags on the whole page
14
content_text_count
Number of text tags (
<p>,
<i>,
<b>,
<u>) in the content
15
page_text_count
Number of text tags on the whole page
16
content_link_count
Number of hyperlink tags (
<a>) in the content
17
page_link_count
Number of hyperlink tags on the whole page
18
content_text_percent
Proportion of textual content in the page that belongs to the main content
19
content_html_percent
Proportion of HTML markup belonging to the main content
5.5 Model Selection – Binary classification algorithms considered: SVM, Logistic Regression (LR), Decision Trees, Random Forest, GBDT, Naïve Bayes, Neural Networks. Continuous features required discretization for tree‑based models, so the final candidates were LR and SVM.
5.5.1 Logistic Regression (LR) – Tested with Newton method (L2 regularization) and coordinate descent (L1 regularization). Since the dataset is modest, Newton and coordinate methods were sufficient.
5.5.2 Support Vector Machine (SVM) – Used RBF kernel with L2 regularization. Linear SVM would be equivalent to LR, so the kernel was kept for potential non‑linear benefits.
5.6 Cross‑Validation – 5‑fold cross‑validation was performed. Results:
LR – Newton method (L2)
Label
Precision
Recall
F1‑score
Sample Ratio
Non‑news
0.92
0.96
0.94
62%
News
0.89
0.78
0.83
38%
Total
0.91
0.91
0.91
100%
LR – Coordinate descent (L1)
Label
Precision
Recall
F1‑score
Sample Ratio
Non‑news
0.87
0.96
0.91
64.5%
News
0.90
0.75
0.82
35.5%
Total
0.88
0.88
0.88
100%
SVM – RBF kernel (L2)
Label
Precision
Recall
F1‑score
Sample Ratio
Non‑news
0.78
1.00
0.87
63%
News
1.00
0.16
0.28
37%
Total
0.83
0.78
0.72
100%
Conclusion – Extracting 19 effective features from HTML and applying LR with Newton method yields the best performance, achieving precision, recall, and F1‑score above 90%, meeting the project’s target for reliable news‑page identification.
UC Tech Team
We provide high-quality technical articles on client, server, algorithms, testing, data, front-end, and more, including both original and translated content.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.