Integrating AI with Node.js Crawling Using the x‑crawl Library
This article explains how combining AI with Node.js crawling through the x‑crawl library makes web data collection more intelligent, flexible, and resilient to website changes, and provides detailed feature descriptions, code examples, and practical usage scenarios.
Why AI‑Assisted Crawling Is Needed
Websites update frequently, causing class names or DOM structures to change and breaking traditional rule‑based crawlers; AI can understand page semantics and adapt to these changes, improving accuracy and robustness.
What Is x‑crawl?
x‑crawl is a flexible Node.js library that adds AI assistance (currently powered by OpenAI models) to web crawling, allowing both pure crawling and AI‑enhanced operations.
It consists of two parts:
Crawler API with various functions that work even without AI.
AI module that leverages large language models to simplify complex tasks.
x‑crawl GitHub: https://github.com/coder-hxl/x-crawl
Documentation: https://coder-hxl.github.io/x-crawl/cn/
x‑crawl Features
🤖 AI Assistance – enhances efficiency and intelligence.
🖋️ Flexible Syntax – each crawl API supports multiple configurations.
⚙️ Multiple Use Cases – supports dynamic pages, static pages, API data, and file data.
⚒️ Page Control – automates interactions, keyboard input, and events.
👀 Device Fingerprint – avoids detection with zero‑config or custom settings.
🔥 Async/Sync – switch between asynchronous and synchronous crawling without changing APIs.
⏱️ Interval Crawling – supports no interval, fixed interval, or random interval for high concurrency.
🔄 Retry on Failure – customizable retry counts.
➡️ Proxy Rotation – auto‑rotate proxies based on errors or status codes.
🚀 Priority Queue – prioritize specific crawl targets.
🧾 Crawl Logging – colored terminal output.
🦾 TypeScript – full type support via generics.
AI + x‑crawl Example
The following code demonstrates using AI to extract high‑scoring Airbnb house images and then download them:
import { createCrawl, createCrawlOpenAI } from 'x-crawl'
// Create crawl app
const crawlApp = createCrawl({
maxRetry: 3,
intervalTime: { max: 2000, min: 1000 }
})
// Create AI app
const crawlOpenAIApp = createCrawlOpenAI({
clientOptions: { apiKey: process.env['OPENAI_API_KEY'] },
defaultModel: { chatModel: 'gpt-4-turbo-preview' }
})
// Crawl page
crawlApp.crawlPage('https://www.airbnb.cn/s/select_homes').then(async (res) => {
const { page, browser } = res.data
const targetSelector = '[data-tracking-id="TOP_REVIEWED_LISTINGS"]'
await page.waitForSelector(targetSelector)
const highlyHTML = await page.$eval(targetSelector, el => el.innerHTML)
// Let AI extract image URLs and deduplicate
const srcResult = await crawlOpenAIApp.parseElements(
highlyHTML,
'获取图片链接, 不要source里面的, 并去重'
)
browser.close()
// Download images
crawlApp.crawlFile({
targets: srcResult.elements.map(item => item.src),
storeDirs: './upload'
})
})AI can also analyze arbitrary HTML to extract specific elements, generate CSS selectors, or answer crawling‑related questions.
AI‑Driven Element Analysis
import { createCrawlOpenAI } from 'x-crawl'
const crawlOpenAIApp = createCrawlOpenAI({
clientOptions: { apiKey: '你的 API Key' }
})
const HTMLContent = `
<div class="scroll-list">
<div class="list-item">女装带帽卫衣</div>
<div class="list-item">男装卫衣</div>
...
</div>
`
crawlOpenAIApp.parseElements(HTMLContent, '获取男装, 并去重').then(res => console.log(res))AI‑Generated Selectors
import { createCrawlOpenAI } from 'x-crawl'
const crawlOpenAIApp = createCrawlOpenAI({ clientOptions: { apiKey: '你的 API Key' } })
const HTMLContent = `...`
crawlOpenAIApp.getElementSelectors(HTMLContent, '获取所有女装').then(res => console.log(res))AI‑Powered Q&A
import { createCrawlOpenAI } from 'x-crawl'
const crawlOpenAIApp = createCrawlOpenAI({ clientOptions: { apiKey: '你的 API Key' } })
crawlOpenAIApp.help('x-crawl 是什么').then(res => console.log(res))
crawlOpenAIApp.help('爬虫的三大注意事项').then(res => console.log(res))Summary
1. Intelligent On‑Demand Element Analysis – AI automatically parses HTML and extracts required element attributes without manual inspection.
2. Automatic Selector Generation – AI creates accurate CSS selectors based on page structure, simplifying target identification.
3. AI‑Assisted Problem Solving – Developers can ask AI for crawling strategies, anti‑scraping tips, and data‑handling advice.
Overall, AI‑enhanced crawling with x‑crawl adapts to website changes, reduces maintenance effort, and improves data collection efficiency.
x‑crawl GitHub: https://github.com/coder-hxl/x-crawl
Documentation: https://coder-hxl.github.io/x-crawl/cn/
If you find x‑crawl helpful, please star the repository on GitHub – your support drives continuous improvement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
