Crawling and Downloading Thousands of Images from Sogou Using Java
This tutorial explains how to crawl thousands of images from Sogou using Java, detailing the request URL analysis, parameter extraction, multithreaded downloading logic, and providing complete source code for the image processor, pipeline, and HTTP utility classes.
Purpose
Crawl thousands of images of a given keyword from Sogou Image and save them locally.
Preparation
Target URL: https://pic.sogou.com/pics?query=美女
Analysis
Open the URL in a browser, use the developer tools (Network → XHR) while scrolling down to capture the XHR request. The request URL looks like:
https://pic.sogou.com/napi/pc/searchList?mode=1&start=48&xml_len=48&query=美女
Key parameters:
start=48 – start index of the image batch.
xml_len=48 – number of images to fetch per request.
query=美女 – search keyword (URL‑encoded automatically).
Approach
Based on the analysis, the download process follows these steps:
Set the request URL and its parameters.
Send HTTP requests to obtain image URLs.
Collect the URLs into a list.
Iterate over the list and download images concurrently using a thread pool.
Code
SougouImgProcessor.java – image crawling class
import com.alibaba.fastjson.JSONObject;
import us.codecraft.webmagic.utils.HttpClientUtils;
import victor.chang.crawler.pipeline.SougouImgPipeline;
import java.util.ArrayList;
import java.util.List;
/**
* A simple PageProcessor.
* @author [email protected]
* @since 0.1.0
*/
public class SougouImgProcessor {
private String url;
private SougouImgPipeline pipeline;
private List
dataList;
private List
urlList;
private String word;
public SougouImgProcessor(String url, String word) {
this.url = url;
this.word = word;
this.pipeline = new SougouImgPipeline();
this.dataList = new ArrayList<>();
this.urlList = new ArrayList<>();
}
public void process(int idx, int size) {
String res = HttpClientUtils.get(String.format(this.url, idx, size, this.word));
JSONObject object = JSONObject.parseObject(res);
List
items = (List
) ((JSONObject) object.get("data")).get("items");
for (JSONObject item : items) {
this.urlList.add(item.getString("picUrl"));
}
this.dataList.addAll(items);
}
// Download
public void pipelineData() {
// Multi‑threaded download
pipeline.processSync(this.urlList, this.word);
}
public static void main(String[] args) {
String url = "https://pic.sogou.com/napi/pc/searchList?mode=1&start=%s&xml_len=%s&query=%s";
SougouImgProcessor processor = new SougouImgProcessor(url, "美女");
int start = 0, size = 50, limit = 1000; // start index, batch size, total number
for (int i = start; i < start + limit; i += size) {
processor.process(i, size);
}
processor.pipelineData();
}
}SougouImgPipeline.java – image download class
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
/**
* Store results in files.
* @author [email protected]
* @since 0.1.0
*/
public class SougouImgPipeline {
private String extension = ".jpg";
private String path;
private volatile AtomicInteger suc;
private volatile AtomicInteger fails;
public SougouImgPipeline() {
setPath("E:/pipeline/sougou");
suc = new AtomicInteger();
fails = new AtomicInteger();
}
public SougouImgPipeline(String path) {
setPath(path);
suc = new AtomicInteger();
fails = new AtomicInteger();
}
public SougouImgPipeline(String path, String extension) {
setPath(path);
this.extension = extension;
suc = new AtomicInteger();
fails = new AtomicInteger();
}
public void setPath(String path) {
this.path = path;
}
/**
* Download a single image.
*/
private void downloadImg(String url, String cate, String name) throws Exception {
String path = this.path + "/" + cate + "/";
File dir = new File(path);
if (!dir.exists()) {
dir.mkdirs(); // create directory if not exists
}
String realExt = url.substring(url.lastIndexOf(".")); // file extension
String fileName = name + realExt;
fileName = fileName.replace("-", "");
String filePath = path + fileName;
File img = new File(filePath);
if (img.exists()) {
System.out.println(String.format("File %s already exists", fileName));
return;
}
URLConnection con = new URL(url).openConnection();
con.setConnectTimeout(5000);
con.setReadTimeout(5000);
InputStream inputStream = con.getInputStream();
byte[] bs = new byte[1024];
FileOutputStream os = new FileOutputStream(filePath, true);
int len;
while ((len = inputStream.read(bs)) != -1) {
os.write(bs, 0, len);
}
System.out.println("picUrl: " + url);
System.out.println(String.format("Downloading image %s", suc.getAndIncrement()));
}
// Single‑thread processing
public void process(List
data, String word) {
long start = System.currentTimeMillis();
for (String picUrl : data) {
if (picUrl == null) continue;
try {
downloadImg(picUrl, word, picUrl);
} catch (Exception e) {
fails.incrementAndGet();
}
}
System.out.println("Success: " + suc);
System.out.println("Failed: " + fails);
long end = System.currentTimeMillis();
System.out.println("Time elapsed: " + (end - start) / 1000 + " seconds");
}
// Multi‑threaded processing
public void processSync(List
data, String word) {
long start = System.currentTimeMillis();
ExecutorService executorService = Executors.newCachedThreadPool();
for (int i = 0; i < data.size(); i++) {
String picUrl = data.get(i);
if (picUrl == null) continue;
String name = "";
if (i < 10) name = "000" + i;
else if (i < 100) name = "00" + i;
else if (i < 1000) name = "0" + i;
final String finalName = name;
executorService.execute(() -> {
try {
downloadImg(picUrl, word, finalName);
} catch (Exception e) {
fails.incrementAndGet();
}
});
}
executorService.shutdown();
try {
if (!executorService.awaitTermination(60, TimeUnit.SECONDS)) {
// timeout handling
}
System.out.println("AwaitTermination Finished");
System.out.println("Total URLs: " + data.size());
System.out.println("Success: " + suc);
System.out.println("Failed: " + fails);
File dir = new File(this.path + "/" + word + "/");
int len = Objects.requireNonNull(dir.list()).length;
System.out.println("Current file count: " + len);
long end = System.currentTimeMillis();
System.out.println("Time elapsed: " + (end - start) / 1000.0 + " seconds");
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}HttpClientUtils.java – HTTP request utility class (excerpt)
import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpUriRequest;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.util.EntityUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import javax.net.ssl.HostnameVerifier;
import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLSession;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.security.cert.X509Certificate;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* HTTP client utilities.
*/
public abstract class HttpClientUtils {
public static String get(String url) {
return get(url, "UTF-8");
}
public static String get(String url, String charset) {
HttpGet httpGet = new HttpGet(url);
return executeRequest(httpGet, charset);
}
public static String ajaxGet(String url) {
return ajaxGet(url, "UTF-8");
}
public static String ajaxGet(String url, String charset) {
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("X-Requested-With", "XMLHttpRequest");
return executeRequest(httpGet, charset);
}
public static String post(String url, Map
dataMap) {
return post(url, dataMap, "UTF-8");
}
public static String post(String url, Map
dataMap, String charset) {
HttpPost httpPost = new HttpPost(url);
try {
if (dataMap != null) {
List
nvps = new ArrayList<>();
for (Map.Entry
entry : dataMap.entrySet()) {
nvps.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));
}
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(nvps, charset);
formEntity.setContentEncoding(charset);
httpPost.setEntity(formEntity);
}
} catch (Exception e) {
e.printStackTrace();
}
return executeRequest(httpPost, charset);
}
// ... (other utility methods omitted for brevity)
public static String executeRequest(HttpUriRequest httpRequest, String charset) {
CloseableHttpClient httpclient;
if ("https".equals(httpRequest.getURI().getScheme())) {
httpclient = createSSLInsecureClient();
} else {
httpclient = HttpClients.createDefault();
}
String result = "";
try {
CloseableHttpResponse response = httpclient.execute(httpRequest);
HttpEntity entity = response.getEntity();
result = EntityUtils.toString(entity, charset);
EntityUtils.consume(entity);
response.close();
} catch (IOException ex) {
ex.printStackTrace();
} finally {
try { httpclient.close(); } catch (IOException ignored) {}
}
return result;
}
public static CloseableHttpClient createSSLInsecureClient() {
try {
SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial((X509Certificate[] chain, String authType) -> true).build();
SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext, (hostname, session) -> true);
return HttpClients.custom().setSSLSocketFactory(sslsf).build();
} catch (GeneralSecurityException ex) {
throw new RuntimeException(ex);
}
}
}Run
Due to network instability, some downloads may fail; re‑run the program multiple times to achieve a higher success rate.
Enjoy the results!
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.