Tagged articles
6 articles
Page 1 of 1
Architect
Architect
Mar 4, 2025 · Backend Development

Apache Tika: Extract Multi-Format Content & Detect Sensitive Data in Spring Boot

This article introduces Apache Tika's capabilities for parsing a wide range of file formats, automatic type detection, OCR and language detection, then demonstrates how to integrate Tika into a Spring Boot service to extract text and identify sensitive information such as ID numbers, credit cards, and phone numbers.

Apache TikaContent ExtractionFile Parsing
0 likes · 22 min read
Apache Tika: Extract Multi-Format Content & Detect Sensitive Data in Spring Boot
Java Web Project
Java Web Project
Feb 11, 2025 · Information Security

How to Use Apache Tika in Spring Boot for Automatic Sensitive Data Detection

This article explains Apache Tika’s core features and architecture, outlines common use‑cases, and provides a step‑by‑step Spring Boot tutorial—including Maven/Gradle setup, a service that extracts text with Tika, regex‑based sensitive‑info detection, a REST controller, optional front‑end, testing instructions, expected output, and extension ideas.

Apache TikaContent ExtractionInformation Security
0 likes · 24 min read
How to Use Apache Tika in Spring Boot for Automatic Sensitive Data Detection
Laravel Tech Community
Laravel Tech Community
Apr 2, 2023 · Backend Development

QueryList: A Modern PHP Content Scraping Library – Features, Installation, and Usage Guide

This article introduces QueryList, a modern PHP content‑scraping tool that uses CSS selectors instead of regex, explains its two versions (V3 and V4), shows how to install it via Composer, demonstrates basic crawling code and various collection methods such as flatten, take, reverse, filter, map, and multi‑request concurrency.

Content ExtractionWeb Scrapingdata-processing
0 likes · 7 min read
QueryList: A Modern PHP Content Scraping Library – Features, Installation, and Usage Guide
Sohu Tech Products
Sohu Tech Products
May 18, 2022 · Fundamentals

Overview of a Web Page Content Extraction Algorithm and Its Practical Demo

This article introduces a web page content extraction algorithm that automatically structures titles, timestamps, body text, authors, and sources from arbitrary news pages, explains how to use an online demo, compares it with existing solutions, and discusses its broader applications and limitations.

Content ExtractionGNEWeb Scraping
0 likes · 8 min read
Overview of a Web Page Content Extraction Algorithm and Its Practical Demo
MaGe Linux Operations
MaGe Linux Operations
Oct 21, 2018 · Backend Development

Mastering Web Crawlers: Core Modules, HTTP Strategies, and Scaling Tips

This article explains the fundamentals of web crawlers, covering their three main modules, HTTP request composition, flow‑control techniques for large‑scale scraping, content extraction methods for static and dynamic pages, and the current challenges such as interaction hurdles, JavaScript parsing, and IP restrictions.

Content ExtractionHTTP requestsdistributed scraping
0 likes · 13 min read
Mastering Web Crawlers: Core Modules, HTTP Strategies, and Scaling Tips