Artificial Intelligence 13 min read

News Page Identification Using Machine Learning: Feature Engineering, Model Selection, and Evaluation

To accurately distinguish news pages from other web page types, this study formulates the task as a binary classification problem, extracts 19 engineered features from HTML, evaluates logistic regression and SVM models with cross‑validation, and achieves over 90% precision, recall, and F1‑score using LR with Newton method.

UC Tech Team

Nov 5, 2018

News Page Identification Using Machine Learning: Feature Engineering, Model Selection, and Evaluation

Background – Web crawlers collect HTML pages from many international sites, including homepages, forums, news, lists, videos, downloads, galleries, etc. Simple heuristics based on word count and image size often misclassify novel, gallery, or video pages as news pages.

Goal – Treat news‑page identification as a binary classification problem, building a model that labels pages as news (positive) or non‑news (negative) with an F1‑score above 90%.

Metrics

Accuracy: proportion of correctly predicted positive samples among all predicted positives.

Recall: proportion of actual positives correctly identified.

F1‑score: harmonic mean of precision and recall; can be adjusted (F0.5, F2) depending on emphasis.

Identification Algorithm Flow

Step Details

5.1 Data Selection – Randomly sample diverse HTML pages from the crawl, ensuring coverage of homepages, forums, downloads, galleries, news, lists, videos, etc. The final dataset contains about 1,000 pages.

5.2 Data Cleaning – Remove malformed HTML, strip useless tags (script, style, comments), and keep only clean, parsable content.

5.3 Data Labeling – Manually label pages as news (positive) or non‑news (negative). Approximately 40% of the dataset are news pages, yielding a 4:6 positive‑negative ratio.

5.4 Feature Engineering – Iteratively select effective features, ending with 19 key attributes (see table below).

Index

Feature Name

Meaning

Remark

is_exist_author

Whether the page contains an author

Discriminative power: 16.5% contain

is_exist_title

Whether the page contains a title

Discriminative power: 41.7% contain

is_exist_publish_time

Whether the page contains a publish time

Discriminative power: 19.4% contain

is_include_date_URL

Whether the URL includes a date

Discriminative power: 10.3% contain

is_include_news_URL

Whether the URL includes the word "news"

Discriminative power: 11.2% contain

is_include_forum_URL

Whether the URL includes "forum" or "bbs"

Discriminative power: 12.6% contain

is_include_music_URL

Whether the URL includes music indicators (mp3, music, etc.)

Discriminative power: 6.1% contain

is_include_download_URL

Whether the URL includes "download"

Discriminative power: 5.8% contain

is_include_media_URL

Whether the URL includes media indicators (video, mp4, movie, etc.)

Discriminative power: 19.2% contain

is_include_image_set_URL

Whether the URL includes gallery/novel indicators

Discriminative power: 2.6% contain

is_include_torrent_URL

Whether the URL includes torrent indicators

Discriminative power: 1% contain

content_media_count

Number of media tags ( <img>, <video>, <audio>) in the content

page_media_count

Number of media tags on the whole page

content_text_count

Number of text tags ( <p>, <i>, <b>, <u>) in the content

page_text_count

Number of text tags on the whole page

content_link_count

Number of hyperlink tags ( <a>) in the content

page_link_count

Number of hyperlink tags on the whole page

content_text_percent

Proportion of textual content in the page that belongs to the main content

content_html_percent

Proportion of HTML markup belonging to the main content

5.5 Model Selection – Binary classification algorithms considered: SVM, Logistic Regression (LR), Decision Trees, Random Forest, GBDT, Naïve Bayes, Neural Networks. Continuous features required discretization for tree‑based models, so the final candidates were LR and SVM.

5.5.1 Logistic Regression (LR) – Tested with Newton method (L2 regularization) and coordinate descent (L1 regularization). Since the dataset is modest, Newton and coordinate methods were sufficient.

5.5.2 Support Vector Machine (SVM) – Used RBF kernel with L2 regularization. Linear SVM would be equivalent to LR, so the kernel was kept for potential non‑linear benefits.

5.6 Cross‑Validation – 5‑fold cross‑validation was performed. Results:

LR – Newton method (L2)

Label

Precision

Recall

F1‑score

Sample Ratio

Non‑news

0.92

0.96

0.94

62%

News

0.89

0.78

0.83

38%

Total

0.91

100%

LR – Coordinate descent (L1)

Label

Precision

Recall

F1‑score

Sample Ratio

Non‑news

0.87

0.96

0.91

64.5%

News

0.90

0.75

0.82

35.5%

Total

0.88

100%

SVM – RBF kernel (L2)

Label

Precision

Recall

F1‑score

Sample Ratio

Non‑news

0.78

1.00

0.87

63%

News

1.00

0.16

0.28

37%

Total

0.83

0.78

0.72

100%

Conclusion – Extracting 19 effective features from HTML and applying LR with Newton method yields the best performance, achieving precision, recall, and F1‑score above 90%, meeting the project’s target for reliable news‑page identification.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning feature engineering logistic regression svm binary classification web crawling page classification

Written by

UC Tech Team

We provide high-quality technical articles on client, server, algorithms, testing, data, front-end, and more, including both original and translated content.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.