Integrating AI with Node.js Crawling Using the x‑crawl Library

This article explains how combining AI with Node.js crawling through the x‑crawl library makes web data collection more intelligent, flexible, and resilient to website changes, and provides detailed feature descriptions, code examples, and practical usage scenarios.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Integrating AI with Node.js Crawling Using the x‑crawl Library

Why AI‑Assisted Crawling Is Needed

Websites update frequently, causing class names or DOM structures to change and breaking traditional rule‑based crawlers; AI can understand page semantics and adapt to these changes, improving accuracy and robustness.

What Is x‑crawl?

x‑crawl is a flexible Node.js library that adds AI assistance (currently powered by OpenAI models) to web crawling, allowing both pure crawling and AI‑enhanced operations.

It consists of two parts:

Crawler API with various functions that work even without AI.

AI module that leverages large language models to simplify complex tasks.

x‑crawl GitHub: https://github.com/coder-hxl/x-crawl

Documentation: https://coder-hxl.github.io/x-crawl/cn/

x‑crawl Features

🤖 AI Assistance – enhances efficiency and intelligence.

🖋️ Flexible Syntax – each crawl API supports multiple configurations.

⚙️ Multiple Use Cases – supports dynamic pages, static pages, API data, and file data.

⚒️ Page Control – automates interactions, keyboard input, and events.

👀 Device Fingerprint – avoids detection with zero‑config or custom settings.

🔥 Async/Sync – switch between asynchronous and synchronous crawling without changing APIs.

⏱️ Interval Crawling – supports no interval, fixed interval, or random interval for high concurrency.

🔄 Retry on Failure – customizable retry counts.

➡️ Proxy Rotation – auto‑rotate proxies based on errors or status codes.

🚀 Priority Queue – prioritize specific crawl targets.

🧾 Crawl Logging – colored terminal output.

🦾 TypeScript – full type support via generics.

AI + x‑crawl Example

The following code demonstrates using AI to extract high‑scoring Airbnb house images and then download them:

import { createCrawl, createCrawlOpenAI } from 'x-crawl'

// Create crawl app
const crawlApp = createCrawl({
  maxRetry: 3,
  intervalTime: { max: 2000, min: 1000 }
})

// Create AI app
const crawlOpenAIApp = createCrawlOpenAI({
  clientOptions: { apiKey: process.env['OPENAI_API_KEY'] },
  defaultModel: { chatModel: 'gpt-4-turbo-preview' }
})

// Crawl page
crawlApp.crawlPage('https://www.airbnb.cn/s/select_homes').then(async (res) => {
  const { page, browser } = res.data
  const targetSelector = '[data-tracking-id="TOP_REVIEWED_LISTINGS"]'
  await page.waitForSelector(targetSelector)
  const highlyHTML = await page.$eval(targetSelector, el => el.innerHTML)

  // Let AI extract image URLs and deduplicate
  const srcResult = await crawlOpenAIApp.parseElements(
    highlyHTML,
    '获取图片链接, 不要source里面的, 并去重'
  )

  browser.close()

  // Download images
  crawlApp.crawlFile({
    targets: srcResult.elements.map(item => item.src),
    storeDirs: './upload'
  })
})

AI can also analyze arbitrary HTML to extract specific elements, generate CSS selectors, or answer crawling‑related questions.

AI‑Driven Element Analysis

import { createCrawlOpenAI } from 'x-crawl'

const crawlOpenAIApp = createCrawlOpenAI({
  clientOptions: { apiKey: '你的 API Key' }
})

const HTMLContent = `
  <div class="scroll-list">
    <div class="list-item">女装带帽卫衣</div>
    <div class="list-item">男装卫衣</div>
    ...
  </div>
`

crawlOpenAIApp.parseElements(HTMLContent, '获取男装, 并去重').then(res => console.log(res))

AI‑Generated Selectors

import { createCrawlOpenAI } from 'x-crawl'

const crawlOpenAIApp = createCrawlOpenAI({ clientOptions: { apiKey: '你的 API Key' } })

const HTMLContent = `...`

crawlOpenAIApp.getElementSelectors(HTMLContent, '获取所有女装').then(res => console.log(res))

AI‑Powered Q&A

import { createCrawlOpenAI } from 'x-crawl'

const crawlOpenAIApp = createCrawlOpenAI({ clientOptions: { apiKey: '你的 API Key' } })

crawlOpenAIApp.help('x-crawl 是什么').then(res => console.log(res))

crawlOpenAIApp.help('爬虫的三大注意事项').then(res => console.log(res))

Summary

1. Intelligent On‑Demand Element Analysis – AI automatically parses HTML and extracts required element attributes without manual inspection.

2. Automatic Selector Generation – AI creates accurate CSS selectors based on page structure, simplifying target identification.

3. AI‑Assisted Problem Solving – Developers can ask AI for crawling strategies, anti‑scraping tips, and data‑handling advice.

Overall, AI‑enhanced crawling with x‑crawl adapts to website changes, reduces maintenance effort, and improves data collection efficiency.

x‑crawl GitHub: https://github.com/coder-hxl/x-crawl

Documentation: https://coder-hxl.github.io/x-crawl/cn/

If you find x‑crawl helpful, please star the repository on GitHub – your support drives continuous improvement.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AINode.jsOpenAIWeb Scrapingx-crawl
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.