Backend Development 13 min read

Crawling and Downloading Thousands of Images from Sogou Using Java

This article explains how to programmatically fetch and save thousands of images from Sogou by analyzing the XHR request parameters, constructing the appropriate URL, extracting image URLs from the JSON response, and using a multithreaded Java downloader with custom HTTP utilities.

Architecture Digest
Architecture Digest
Architecture Digest
Crawling and Downloading Thousands of Images from Sogou Using Java

Purpose: Retrieve and locally store thousands of beauty images from Sogou Image Search.

Preparation: The target URL is https://pic.sogou.com/pics?query=%E7%BE%8E%E5%A5%B3 . By opening the page, using the browser's DevTools (Network → XHR) and scrolling, the request URL pattern is discovered:

https://pic.sogou.com/napi/pc/searchList?mode=1&start=48&xml_len=48&query=%E7%BE%8E%E5%A5%B3

Key parameters: start (starting index), xml_len (number of images per request), and query (search keyword, URL‑encoded).

Analysis: The JSON response contains the desired image URLs in the picUrl field.

Approach: The workflow consists of four steps – set URL parameters, request the URL to obtain image URLs, store URLs in a list, and download the images concurrently using a thread pool.

Configure request parameters.

Fetch the URL and parse the JSON to collect picUrl values.

Accumulate URLs in a list.

Iterate the list with a thread pool to download each image to a local directory.

Code: The core implementation is provided in two Java classes.

import com.alibaba.fastjson.JSONObject;
import us.codecraft.webmagic.utils.HttpClientUtils;
import victor.chang.crawler.pipeline.SougouImgPipeline;
import java.util.ArrayList;
import java.util.List;
/**
 * A simple PageProcessor.
 */
public class SougouImgProcessor {
    private String url;
    private SougouImgPipeline pipeline;
    private List
dataList;
    private List
urlList;
    private String word;
    public SougouImgProcessor(String url,String word) {
        this.url = url;
        this.word = word;
        this.pipeline = new SougouImgPipeline();
        this.dataList = new ArrayList<>();
        this.urlList = new ArrayList<>();
    }
    public void process(int idx, int size) {
        String res = HttpClientUtils.get(String.format(this.url, idx, size, this.word));
        JSONObject object = JSONObject.parseObject(res);
        List
items = (List
)((JSONObject)object.get("data")).get("items");
        for(JSONObject item : items){
            this.urlList.add(item.getString("picUrl"));
        }
        this.dataList.addAll(items);
    }
    // download
    public void pipelineData(){
        // multithread
        pipeline.processSync(this.urlList, this.word);
    }
    public static void main(String[] args) {
        String url = "https://pic.sogou.com/napi/pc/searchList?mode=1&start=%s&xml_len=%s&query=%s";
        SougouImgProcessor processor = new SougouImgProcessor(url,"美女");
        int start = 0, size = 50, limit = 1000; // start index, batch size, total
        for(int i=start;i
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
/**
 * Store results in files.
 */
public class SougouImgPipeline {
    private String extension = ".jpg";
    private String path;
    private volatile AtomicInteger suc;
    private volatile AtomicInteger fails;
    public SougouImgPipeline() {
        setPath("E:/pipeline/sougou");
        suc = new AtomicInteger();
        fails = new AtomicInteger();
    }
    // ... (methods for downloadImg, process, processSync, etc.)
}

Running the program may not download every image due to network issues, but repeated executions increase the success rate.

Conclusion: By analyzing the Sogou API, extracting image URLs, and employing a multithreaded Java downloader, large‑scale image collection can be automated efficiently.

JavaHTTPMultithreadingweb crawlingImage DownloadSogou
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.