以下是一个简单的Java程序,用于爬取网站上的图片并下载到本地文件夹:
java
import java.io.*;
import java.net.*;
public class ImageSpider {
public static void main(String[] args) {
// 确定要爬取的网站URL和本地保存目录
String url = "https://www.example.com";
String saveDir = "D:/images/";
// 读取网页HTML源代码
String html = getHtml(url);
// 提取其中的图片URL
String[] imgUrls = extractImgUrls(html);
// 下载图片到本地
for (String imgUrl : imgUrls) {
downloadImage(imgUrl, saveDir);
}
}
// 读取网页HTML源代码
public static String getHtml(String url) {
StringBuilder html = new StringBuilder();
try {
URLConnection conn = new URL(url).openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while ((line = reader.readLine()) != null) {
html.append(line);
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
return html.toString();
}
// 提取图片URL
public static String[] extractImgUrls(String html) {
String[] imgUrls = html.split("<img");
String[] result = new String[imgUrls.length - 1];
for (int i = 1; i < imgUrls.length; i++) {
String imgUrl = imgUrls[i].substring(imgUrls[i].indexOf("src=") + 5);
imgUrl = imgUrl.substring(0, imgUrl.indexOf("\""));
result[i - 1] = imgUrl;
}
return result;
}
// 下载图片到本地
public static void downloadImage(String imgUrl, String saveDir) {
try {
URL url = new URL(imgUrl);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setConnectTimeout(5 * 1000);
InputStream inStream = conn.getInputStream();
byte[] data = readInputStream(inStream);
File file = new File(saveDir + imgUrl.substring(imgUrl.lastIndexOf("/") + 1));
FileOutputStream outStream = new FileOutputStream(file);
outStream.write(data);
outStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
// 读取输入流并转换为字节数组
public static byte[] readInputStream(InputStream inputStream) throws IOException {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int len;
while ((len = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, len);
}
inputStream.close();
return outputStream.toByteArray();
}
}
这个程序首先读取指定网址的HTML源码,然后从中提取出所有的图片URL。最后,程序利用 Java 的 IO 功能下载这些图片并保存到指定的本地文件夹中。
需要注意的是,该程序只是一个简单的演示,实际使用时需要注意合法性和合理性问题,比如设置超时时间、判断文件是否存在等。此外,爬虫在爬取网站时需要遵守网站的爬取协议和法律法规,不能侵犯他人的权益。以下是一个简单的Java程序,用于爬取网站上的图片并下载到本地文件夹:
java
import java.io.*;
import java.net.*;
public class ImageSpider {
public static void main(String[] args) {
// 确定要爬取的网站URL和本地保存目录
String url = "https://www.example.com";
String saveDir = "D:/images/";
// 读取网页HTML源代码
String html = getHtml(url);
// 提取其中的图片URL
String[] imgUrls = extractImgUrls(html);
// 下载图片到本地
for (String imgUrl : imgUrls) {
downloadImage(imgUrl, saveDir);
}
}
// 读取网页HTML源代码
public static String getHtml(String url) {
StringBuilder html = new StringBuilder();
try {
URLConnection conn = new URL(url).openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while ((line = reader.readLine()) != null) {
html.append(line);
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
return html.toString();
}
// 提取图片URL
public static String[] extractImgUrls(String html) {
String[] imgUrls = html.split("<img");
String[] result = new String[imgUrls.length - 1];
for (int i = 1; i < imgUrls.length; i++) {
String imgUrl = imgUrls[i].substring(imgUrls[i].indexOf("src=") + 5);
imgUrl = imgUrl.substring(0, imgUrl.indexOf("\""));
result[i - 1] = imgUrl;
}
return result;
}
// 下载图片到本地
public static void downloadImage(String imgUrl, String saveDir) {
try {
URL url = new URL(imgUrl);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setConnectTimeout(5 * 1000);
InputStream inStream = conn.getInputStream();
byte[] data = readInputStream(inStream);
File file = new File(saveDir + imgUrl.substring(imgUrl.lastIndexOf("/") + 1));
FileOutputStream outStream = new FileOutputStream(file);
outStream.write(data);
outStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
// 读取输入流并转换为字节数组
public static byte[] readInputStream(InputStream inputStream) throws IOException {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int len;
while ((len = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, len);
}
inputStream.close();
return outputStream.toByteArray();
}
}
这个程序首先读取指定网址的HTML源码,然后从中提取出所有的图片URL。最后,程序利用 Java 的 IO 功能下载这些图片并保存到指定的本地文件夹中。
需要注意的是,该程序只是一个简单的演示,实际使用时需要注意合法性和合理性问题,比如设置超时时间、判断文件是否存在等。此外,爬虫在爬取网站时需要遵守网站的爬取协议和法律法规,不能侵犯他人的权益。