How to Scrape and Download Thousands of Sogou Images with Java
This guide explains how to analyze Sogou's image search XHR request, extract image URLs from the JSON response, and use Java with multithreaded HTTP requests to download thousands of pictures efficiently, including full source code and execution tips.
Purpose
Scrape thousands of beautiful images from Sogou and download them locally.
Preparation
Target URL: https://pic.sogou.com/pics?query=美女
Analysis
Open the above URL, use the browser developer tools (Network → XHR) and scroll down to see a request like:
Request URL: https://pic.sogou.com/napi/pc/searchList?mode=1&start=48&xml_len=48&query=美女
Key parameters:
start=48 – start index of images
xml_len=48 – number of images to fetch
query – search keyword (e.g., 美女)
The response JSON contains the image URLs in the picUrl field.
Use a JSON formatter to view the structure.
Approach
Set URL request parameters.
Send the request and extract image URLs.
Store URLs in a list.
Iterate the list and download images using a thread pool.
Code
SougouImgProcessor.java – image crawling class
import com.alibaba.fastjson.JSONObject;
import us.codecraft.webmagic.utils.HttpClientUtils;
import victor.chang.crawler.pipeline.SougouImgPipeline;
import java.util.ArrayList;
import java.util.List;
public class SougouImgProcessor {
private String url;
private SougouImgPipeline pipeline;
private List<JSONObject> dataList;
private List<String> urlList;
private String word;
public SougouImgProcessor(String url, String word) {
this.url = url;
this.word = word;
this.pipeline = new SougouImgPipeline();
this.dataList = new ArrayList<>();
this.urlList = new ArrayList<>();
}
public void process(int idx, int size) {
String res = HttpClientUtils.get(String.format(this.url, idx, size, this.word));
JSONObject object = JSONObject.parseObject(res);
List<JSONObject> items = (List<JSONObject>) ((JSONObject) object.get("data")).get("items");
for (JSONObject item : items) {
this.urlList.add(item.getString("picUrl"));
}
this.dataList.addAll(items);
}
public void pipelineData() {
pipeline.processSync(this.urlList, this.word);
}
public static void main(String[] args) {
String url = "https://pic.sogou.com/napi/pc/searchList?mode=1&start=%s&xml_len=%s&query=%s";
SougouImgProcessor processor = new SougouImgProcessor(url, "美女");
int start = 0, size = 50, limit = 1000;
for (int i = start; i < start + limit; i += size) {
processor.process(i, size);
}
processor.pipelineData();
}
}SougouImgPipeline.java – image download class
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
public class SougouImgPipeline {
private String extension = ".jpg";
private String path;
private volatile AtomicInteger suc;
private volatile AtomicInteger fails;
public SougouImgPipeline() {
setPath("E:/pipeline/sougou");
suc = new AtomicInteger();
fails = new AtomicInteger();
}
public SougouImgPipeline(String path) {
setPath(path);
suc = new AtomicInteger();
fails = new AtomicInteger();
}
public SougouImgPipeline(String path, String extension) {
setPath(path);
this.extension = extension;
suc = new AtomicInteger();
fails = new AtomicInteger();
}
public void setPath(String path) {
this.path = path;
}
private void downloadImg(String url, String cate, String name) throws Exception {
String path = this.path + "/" + cate + "/";
File dir = new File(path);
if (!dir.exists()) {
dir.mkdirs();
}
String realExt = url.substring(url.lastIndexOf("."));
String fileName = name + realExt;
fileName = fileName.replace("-", "");
String filePath = path + fileName;
File img = new File(filePath);
if (img.exists()) {
System.out.println(String.format("文件%s已存在本地目录", fileName));
return;
}
URLConnection con = new URL(url).openConnection();
con.setConnectTimeout(5000);
con.setReadTimeout(5000);
InputStream inputStream = con.getInputStream();
byte[] bs = new byte[1024];
FileOutputStream os = new FileOutputStream(new File(filePath), true);
int len;
while ((len = inputStream.read(bs)) != -1) {
os.write(bs, 0, len);
}
System.out.println("picUrl: " + url);
System.out.println(String.format("正在下载第%s张图片", suc.getAndIncrement()));
}
public void process(List<String> data, String word) {
long start = System.currentTimeMillis();
for (String picUrl : data) {
if (picUrl == null) continue;
try {
downloadImg(picUrl, word, picUrl);
} catch (Exception e) {
fails.incrementAndGet();
}
}
System.out.println("下载成功: " + suc.get());
System.out.println("下载失败: " + fails.get());
long end = System.currentTimeMillis();
System.out.println("耗时:" + (end - start) / 1000 + "秒");
}
public void processSync(List<String> data, String word) {
long start = System.currentTimeMillis();
ExecutorService executorService = Executors.newCachedThreadPool();
for (int i = 0; i < data.size(); i++) {
String picUrl = data.get(i);
if (picUrl == null) continue;
String name = "";
if (i < 10) name = "000" + i;
else if (i < 100) name = "00" + i;
else if (i < 1000) name = "0" + i;
final String finalName = name;
executorService.execute(() -> {
try {
downloadImg(picUrl, word, finalName);
} catch (Exception e) {
fails.incrementAndGet();
}
});
}
executorService.shutdown();
try {
if (!executorService.awaitTermination(60, TimeUnit.SECONDS)) {
// timeout handling
}
System.out.println("AwaitTermination Finished");
System.out.println("共有URL: " + data.size());
System.out.println("下载成功: " + suc);
System.out.println("下载失败: " + fails);
File dir = new File(this.path + "/" + word + "/");
int len = Objects.requireNonNull(dir.list()).length;
System.out.println("当前共有文件: " + len);
long end = System.currentTimeMillis();
System.out.println("耗时:" + (end - start) / 1000.0 + "秒");
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}HttpClientUtils.java – HTTP request utility
import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpUriRequest;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.util.EntityUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import javax.net.ssl.HostnameVerifier;
import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLSession;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public abstract class HttpClientUtils {
public static Map<String, List<String>> convertHeaders(Header[] headers) {
Map<String, List<String>> results = new HashMap<>();
for (Header header : headers) {
List<String> list = results.get(header.getName());
if (list == null) {
list = new ArrayList<>();
results.put(header.getName(), list);
}
list.add(header.getValue());
}
return results;
}
public static String get(String url) {
return get(url, "UTF-8");
}
public static Logger logger = LoggerFactory.getLogger(HttpClientUtils.class);
public static String get(String url, String charset) {
HttpGet httpGet = new HttpGet(url);
return executeRequest(httpGet, charset);
}
public static String ajaxGet(String url) {
return ajaxGet(url, "UTF-8");
}
public static String ajaxGet(String url, String charset) {
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("X-Requested-With", "XMLHttpRequest");
return executeRequest(httpGet, charset);
}
public static String post(String url, Map<String, String> dataMap) {
return post(url, dataMap, "UTF-8");
}
public static String post(String url, Map<String, String> dataMap, String charset) {
HttpPost httpPost = new HttpPost(url);
try {
if (dataMap != null) {
List<NameValuePair> nvps = new ArrayList<>();
for (Map.Entry<String, String> entry : dataMap.entrySet()) {
nvps.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));
}
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(nvps, charset);
formEntity.setContentEncoding(charset);
httpPost.setEntity(formEntity);
}
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return executeRequest(httpPost, charset);
}
public static String ajaxPost(String url, Map<String, String> dataMap) {
return ajaxPost(url, dataMap, "UTF-8");
}
public static String ajaxPost(String url, Map<String, String> dataMap, String charset) {
HttpPost httpPost = new HttpPost(url);
httpPost.setHeader("X-Requested-With", "XMLHttpRequest");
try {
if (dataMap != null) {
List<NameValuePair> nvps = new ArrayList<>();
for (Map.Entry<String, String> entry : dataMap.entrySet()) {
nvps.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));
}
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(nvps, charset);
formEntity.setContentEncoding(charset);
httpPost.setEntity(formEntity);
}
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return executeRequest(httpPost, charset);
}
public static String ajaxPostJson(String url, String jsonString) {
return ajaxPostJson(url, jsonString, "UTF-8");
}
public static String ajaxPostJson(String url, String jsonString, String charset) {
HttpPost httpPost = new HttpPost(url);
httpPost.setHeader("X-Requested-With", "XMLHttpRequest");
StringEntity stringEntity = new StringEntity(jsonString, charset);
stringEntity.setContentEncoding(charset);
stringEntity.setContentType("application/json");
httpPost.setEntity(stringEntity);
return executeRequest(httpPost, charset);
}
public static String executeRequest(HttpUriRequest httpRequest) {
return executeRequest(httpRequest, "UTF-8");
}
public static String executeRequest(HttpUriRequest httpRequest, String charset) {
CloseableHttpClient httpclient;
if ("https".equals(httpRequest.getURI().getScheme())) {
httpclient = createSSLInsecureClient();
} else {
httpclient = HttpClients.createDefault();
}
String result = "";
try {
CloseableHttpResponse response = httpclient.execute(httpRequest);
HttpEntity entity = null;
try {
entity = response.getEntity();
result = EntityUtils.toString(entity, charset);
} finally {
EntityUtils.consume(entity);
response.close();
}
} catch (IOException ex) {
ex.printStackTrace();
} finally {
try { httpclient.close(); } catch (IOException ignored) {}
}
return result;
}
public static CloseableHttpClient createSSLInsecureClient() {
try {
SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial((X509Certificate[] chain, String authType) -> true).build();
SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext, (String hostname, SSLSession session) -> true);
return HttpClients.custom().setSSLSocketFactory(sslsf).build();
} catch (GeneralSecurityException ex) {
throw new RuntimeException(ex);
}
}
}Run
Because of network instability some downloads may fail; running the program multiple times improves the overall success rate.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
