Crawling and Downloading Thousands of Images from Sogou Using Java

This article explains how to programmatically fetch and save thousands of images from Sogou by analyzing the XHR request parameters, constructing the appropriate URL, extracting image URLs from the JSON response, and using a multithreaded Java downloader with custom HTTP utilities.

Architecture Digest
Architecture Digest
Architecture Digest
Crawling and Downloading Thousands of Images from Sogou Using Java

Purpose: Retrieve and locally store thousands of beauty images from Sogou Image Search.

Preparation: The target URL is https://pic.sogou.com/pics?query=%E7%BE%8E%E5%A5%B3. By opening the page, using the browser's DevTools (Network → XHR) and scrolling, the request URL pattern is discovered:

https://pic.sogou.com/napi/pc/searchList?mode=1&start=48&xml_len=48&query=%E7%BE%8E%E5%A5%B3

Key parameters: start (starting index), xml_len (number of images per request), and query (search keyword, URL‑encoded).

Analysis: The JSON response contains the desired image URLs in the picUrl field.

Approach: The workflow consists of four steps – set URL parameters, request the URL to obtain image URLs, store URLs in a list, and download the images concurrently using a thread pool.

Configure request parameters.

Fetch the URL and parse the JSON to collect picUrl values.

Accumulate URLs in a list.

Iterate the list with a thread pool to download each image to a local directory.

Code: The core implementation is provided in two Java classes.

import com.alibaba.fastjson.JSONObject;
import us.codecraft.webmagic.utils.HttpClientUtils;
import victor.chang.crawler.pipeline.SougouImgPipeline;
import java.util.ArrayList;
import java.util.List;
/**
 * A simple PageProcessor.
 */
public class SougouImgProcessor {
    private String url;
    private SougouImgPipeline pipeline;
    private List<JSONObject> dataList;
    private List<String> urlList;
    private String word;
    public SougouImgProcessor(String url,String word) {
        this.url = url;
        this.word = word;
        this.pipeline = new SougouImgPipeline();
        this.dataList = new ArrayList<>();
        this.urlList = new ArrayList<>();
    }
    public void process(int idx, int size) {
        String res = HttpClientUtils.get(String.format(this.url, idx, size, this.word));
        JSONObject object = JSONObject.parseObject(res);
        List<JSONObject> items = (List<JSONObject>)((JSONObject)object.get("data")).get("items");
        for(JSONObject item : items){
            this.urlList.add(item.getString("picUrl"));
        }
        this.dataList.addAll(items);
    }
    // download
    public void pipelineData(){
        // multithread
        pipeline.processSync(this.urlList, this.word);
    }
    public static void main(String[] args) {
        String url = "https://pic.sogou.com/napi/pc/searchList?mode=1&start=%s&xml_len=%s&query=%s";
        SougouImgProcessor processor = new SougouImgProcessor(url,"美女");
        int start = 0, size = 50, limit = 1000; // start index, batch size, total
        for(int i=start;i<start+limit;i+=size)
            processor.process(i, size);
        processor.pipelineData();
    }
}
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
/**
 * Store results in files.
 */
public class SougouImgPipeline {
    private String extension = ".jpg";
    private String path;
    private volatile AtomicInteger suc;
    private volatile AtomicInteger fails;
    public SougouImgPipeline() {
        setPath("E:/pipeline/sougou");
        suc = new AtomicInteger();
        fails = new AtomicInteger();
    }
    // ... (methods for downloadImg, process, processSync, etc.)
}

Running the program may not download every image due to network issues, but repeated executions increase the success rate.

Conclusion: By analyzing the Sogou API, extracting image URLs, and employing a multithreaded Java downloader, large‑scale image collection can be automated efficiently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

javaImage DownloadSogou
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.