Tagged articles

6 articles

Page 1 of 1

Mar 4, 2025 · Backend Development

Apache Tika: Extract Multi-Format Content & Detect Sensitive Data in Spring Boot

This article introduces Apache Tika's capabilities for parsing a wide range of file formats, automatic type detection, OCR and language detection, then demonstrates how to integrate Tika into a Spring Boot service to extract text and identify sensitive information such as ID numbers, credit cards, and phone numbers.

Apache TikaContent ExtractionFile Parsing

0 likes · 22 min read

Apache Tika: Extract Multi-Format Content & Detect Sensitive Data in Spring Boot

Java Web Project

Feb 11, 2025 · Information Security

How to Use Apache Tika in Spring Boot for Automatic Sensitive Data Detection

This article explains Apache Tika’s core features and architecture, outlines common use‑cases, and provides a step‑by‑step Spring Boot tutorial—including Maven/Gradle setup, a service that extracts text with Tika, regex‑based sensitive‑info detection, a REST controller, optional front‑end, testing instructions, expected output, and extension ideas.

Apache TikaContent ExtractionInformation Security

0 likes · 24 min read

How to Use Apache Tika in Spring Boot for Automatic Sensitive Data Detection

Laravel Tech Community

Apr 2, 2023 · Backend Development

QueryList: A Modern PHP Content Scraping Library – Features, Installation, and Usage Guide

This article introduces QueryList, a modern PHP content‑scraping tool that uses CSS selectors instead of regex, explains its two versions (V3 and V4), shows how to install it via Composer, demonstrates basic crawling code and various collection methods such as flatten, take, reverse, filter, map, and multi‑request concurrency.

Content ExtractionWeb Scrapingdata-processing

0 likes · 7 min read

QueryList: A Modern PHP Content Scraping Library – Features, Installation, and Usage Guide

Sohu Tech Products

May 18, 2022 · Fundamentals

Overview of a Web Page Content Extraction Algorithm and Its Practical Demo

This article introduces a web page content extraction algorithm that automatically structures titles, timestamps, body text, authors, and sources from arbitrary news pages, explains how to use an online demo, compares it with existing solutions, and discusses its broader applications and limitations.

Content ExtractionGNEWeb Scraping

0 likes · 8 min read

Overview of a Web Page Content Extraction Algorithm and Its Practical Demo

Ctrip Technology

Oct 11, 2019 · Artificial Intelligence

Intelligent Content Extraction and Generation Practices on Ctrip's Marco Polo AI Platform

This article details Ctrip's AI‑driven Marco Polo platform, describing how large‑scale NLP pipelines combine extraction, richness evaluation, semantic matching and deep‑learning generation (CopyNet, TA‑seq2seq) to produce high‑quality recommendation reasons across multiple product scenarios.

Content ExtractionNLPSpark

0 likes · 16 min read

Intelligent Content Extraction and Generation Practices on Ctrip's Marco Polo AI Platform

MaGe Linux Operations

Oct 21, 2018 · Backend Development

Mastering Web Crawlers: Core Modules, HTTP Strategies, and Scaling Tips

This article explains the fundamentals of web crawlers, covering their three main modules, HTTP request composition, flow‑control techniques for large‑scale scraping, content extraction methods for static and dynamic pages, and the current challenges such as interaction hurdles, JavaScript parsing, and IP restrictions.

Content ExtractionHTTP requestsdistributed scraping

0 likes · 13 min read

Mastering Web Crawlers: Core Modules, HTTP Strategies, and Scaling Tips