引言
爬虫功能实现,要求爬取页面数据至少100条,这里以豆瓣音乐为示例编写代码豆瓣音乐标签: 民谣 (douban.com)。
功能实现
除了爬虫功能增加,代码其他内容原理和之前博客发布是一致的,只不过这里为了区分,我们重新创建数据库,名称为music,依旧是vo包中存放数据信息,也就是java可自动生成的构造函数。dao包中存放数据库功能实现函数,主要为增删改查四大基础功能。util包中存放数据库连接函数,用于java和数据库的连接。ui包中存放主函数内容,即实现各类函数调用。service包中存放爬虫相关函数,用于实现对指定页面的数据信息爬取。、
该类定义了几个列表来保存有关正在抓取的音乐记录的不同数据:
musicName
:存储音乐专辑的名称。musicURLaddress
:存储相册的 URL。musicScore
:存储专辑的评分(分数)。musicPeople
:存储对相册进行评分的人数。musicSinger
:存储歌手或艺术家的姓名。musicTime
:存储专辑的发行日期。musicType
:存储音乐的流派或类型。musicMedium
:存储专辑的介质(例如,CD、黑胶唱片)。musicSect
:存储有关相册的其他信息(可选)。musicBarcode
:存储条形码信息(可选)。这些列表用于收集抓取的数据,然后用于将数据插入数据库。
getData()
方法该方法是启动 Web 抓取过程的主要方法:
getData()
User Agent:该字符串模拟浏览器请求,使其看起来像是来自真实浏览器。这有助于避免被网站阻止。
Loop Over Pages :该方法循环 5 个页面(即 100 个项目,假设每个页面有 20 个项目)。对于每次迭代,它都会构建当前页面的 URL,并调用
getMusicInfo()
以从该页面抓取数据。睡眠 1 秒 :
Thread.sleep(1000)
是添加的延迟,以防止网站被请求淹没(一种常见的反抓取措施)。将数据插入数据库 : 从所有页面抓取数据后,它会调用
insertMusicInfoToDB()
将收集的数据存储在数据库中。
对应html:
点击链接,进入每首歌的详细信息页面:
getMusicInfo()
方法此方法处理从给定页面中实际抓取的数据:
文档检索 :该方法用于
Jsoup
连接到 URL 并检索 HTML 文档。选择元素 :然后,它会选择所有带有类
.item
的元素,这些元素代表单独的音乐记录。提取数据: 对于每张音乐唱片,它提取名称、URL、分数、评分人数以及歌手、发行日期、类型、媒体等各种其他详细信息,并将它们添加到相应的列表中。
insertMusicInfoToDB()
方法此方法将收集的数据插入到数据库中:
Looping Over Data :该方法遍历
Information
所有收集的数据(从列表中),并为每个音乐记录创建一个对象。解析数据:它尝试将分数和人数从字符串解析为适当的类型(float 和 int)。如果解析失败,它会设置默认值(0.0f 表示 score 和 0 表示 people)。
Inserting into Database :然后调用
InformationDAO.insert(info)
将数据插入数据库。插入的结果存储在 a 中,该 a 将音乐名称映射到Map
指示插入是否成功的布尔值。记录结果:每次插入后,它会记录插入是否成功。
总结
- 网页抓取 :
getData()
和getMusicInfo()
方法负责从特定网页抓取数据。- 数据收集:数据收集到各种列表中。
- 数据库插入 :该方法处理将
insertMusicInfoToDB()
收集的数据插入数据库,确保每条数据都得到正确解析和存储。
结果展示
完整代码
ui---Driver
package ui;
import service.MusicService;
import java.io.IOException;
public class Driver {
public static void main(String[] args) throws IOException, InterruptedException {
MusicService.getData();
}
}
vo---Information
package vo;
public class Information {
private int id;
private String musicName;
private String singer;
private String time;
private String type;
private String medium;
private String sect;
private String barCode;
private float score;
private int people;
private String urlAddress;
public Information() {
}
public Information(int id, String musicName, String singer, String time, String type, String medium, String sect, String barCode, float score, int people, String urlAddress) {
this.id = id;
this.musicName = musicName;
this.singer = singer;
this.time = time;
this.type = type;
this.medium = medium;
this.sect = sect;
this.barCode = barCode;
this.score = score;
this.people = people;
this.urlAddress = urlAddress;
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public String getMusicName() {
return musicName;
}
public void setMusicName(String musicName) {
this.musicName = musicName;
}
public String getSinger() {
return singer;
}
public void setSinger(String singer) {
this.singer = singer;
}
public String getTime() {
return time;
}
public void setTime(String time) {
this.time = time;
}
public String getType() {
return type;
}
public void setType(String type) {
this.type = type;
}
public String getMedium() {
return medium;
}
public void setMedium(String medium) {
this.medium = medium;
}
public String getSect() {
return sect;
}
public void setSect(String sect) {
this.sect = sect;
}
public String getBarCode() {
return barCode;
}
public void setBarCode(String barCode) {
this.barCode = barCode;
}
public float getScore() {
return score;
}
public void setScore(float score) {
this.score = score;
}
public int getPeople() {
return people;
}
public void setPeople(int people) {
this.people = people;
}
public String getUrlAddress() {
return urlAddress;
}
public void setUrlAddress(String urlAddress) {
this.urlAddress = urlAddress;
}
@Override
public String toString() {
return "Information{" +
"id=" + id +
", musicName='" + musicName + '\'' +
", singer='" + singer + '\'' +
", time='" + time + '\'' +
", type='" + type + '\'' +
", medium='" + medium + '\'' +
", sect='" + sect + '\'' +
", barCode='" + barCode + '\'' +
", score=" + score +
", people=" + people +
", urlAddress='" + urlAddress + '\'' +
'}';
}
public static class Info {
private String singer;
private String time;
private String type;
private double medium;
}
}
dao---InformationDAO
package dao;
import util.DBConnection;
import vo.Information;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
public class InformationDAO {
//按歌名查询
public static Information queryByName(String musicName) {
Connection con = null;
PreparedStatement pst = null;
ResultSet rs = null;
Information information = null;
try {
con = DBConnection.getConnection();
String sql = "SELECT * FROM music_information WHERE musicName = ?";
pst = con.prepareStatement(sql);
pst.setString(1, musicName);
rs = pst.executeQuery();
if (rs.next()) {
information = new Information();
information.setId(rs.getInt("id"));
information.setMusicName(rs.getString("musicName"));
information.setSinger(rs.getString("singer"));
information.setTime(rs.getString("time"));
information.setType(rs.getString("type"));
information.setSect(rs.getString("medium"));
information.setSect(rs.getString("sect"));
information.setBarCode(rs.getString("barcode"));
information.setScore(rs.getFloat("score"));
information.setPeople(rs.getInt("people"));
information.setUrlAddress(rs.getString("URLaddress"));
}
} catch (SQLException e) {
throw new RuntimeException(e);
} finally {
DBConnection.close(con, pst);
}
return information;
}
public static List<Information> queryBySinger(String singer) {
List<Information> infoList = new ArrayList<>();
Connection con = null;
PreparedStatement pst = null;
ResultSet rs = null;
try {
con = DBConnection.getConnection();
String sql = "SELECT * FROM music_information WHERE singer = ?";
pst = con.prepareStatement(sql);
pst.setString(1, singer);
rs = pst.executeQuery();
while (rs.next()) {
Information info = new Information();
info.setId(rs.getInt("id"));
info.setMusicName(rs.getString("musicName"));
info.setSinger(rs.getString("singer"));
info.setTime(rs.getString("time"));
info.setType(rs.getString("type"));
info.setMedium(rs.getString("medium"));
info.setSect(rs.getString("sect"));
info.setBarCode(rs.getString("barcode"));
info.setScore(rs.getFloat("score"));
info.setPeople(rs.getInt("people"));
info.setUrlAddress(rs.getString("URLaddress"));
infoList.add(info);
}
} catch (SQLException e) {
e.printStackTrace();
} finally {
DBConnection.close(con, pst);
}
return infoList;
}
public static int getTotalPeople() {
String query = "SELECT SUM(people) AS totalPeople FROM music_information";
try (Connection conn = DBConnection.getConnection();
PreparedStatement pst = conn.prepareStatement(query);
ResultSet rs = pst.executeQuery()) {
if (rs.next()) {
return rs.getInt("totalPeople");
}
} catch (SQLException e) {
e.printStackTrace();
}
return 0;
}
public static float getAverageScore(String singer) {
String query = "SELECT AVG(score) AS averageScore FROM music_information WHERE singer = ? AND sect = '民谣'";
float averageScore = -1; // 默认值,表示没有找到数据
Connection con = null;
PreparedStatement pst = null;
ResultSet rs = null;
try {
con = DBConnection.getConnection();
pst = con.prepareStatement(query);
pst.setString(1, singer);
rs = pst.executeQuery();
if (rs.next()) {
averageScore = rs.getFloat("averageScore");
}
} catch (SQLException e) {
e.printStackTrace();
} finally {
// 关闭资源
try {
rs.close();
pst.close();
con.close();
} catch (SQLException e) {
throw new RuntimeException(e);
}
}
return averageScore;
}
//query 任意条件查寻
public static ArrayList<Information> query(Information information) {
Connection con = null;
PreparedStatement pst = null;
ResultSet rs = null;
ArrayList<Information> informationArrayList = new ArrayList<>();
try {
con = DBConnection.getConnection();
StringBuilder sql = new StringBuilder("SELECT * FROM music_information WHERE 1 = 1");
if (information.getId() != 0) {
sql.append(" AND id = ?");
}
if (information.getMusicName() != null) {
sql.append(" AND musicName = ?");
}
if (information.getSinger() != null) {
sql.append(" AND signer = ?");
}
if (information.getTime() != null) {
sql.append(" AND time = ?");
}
if (information.getType() != null) {
sql.append(" AND type = ?");
}
if (information.getMedium() != null) {
sql.append(" AND medium = ?");
}
if (information.getSect() != null) {
sql.append(" AND sect = ?");
}
if (information.getBarCode() != null) {
sql.append(" AND barCode = ?");
}
if (information.getScore() != 0) {
sql.append(" AND score = ?");
}
if (information.getPeople() != 0) {
sql.append(" AND people = ?");
}
if (information.getUrlAddress() != null) {
sql.append(" AND URLaddress = ?");
}
pst = con.prepareStatement(sql.toString());
int paramIndex = 1;
if (information.getId() != 0) {
pst.setInt(paramIndex++, information.getId());
}
if (information.getMusicName() != null) {
pst.setString(paramIndex++, information.getMusicName());
}
if (information.getSinger() != null) {
pst.setString(paramIndex++, information.getSinger());
}
if (information.getTime() != null) {
pst.setString(paramIndex++, information.getTime());
}
if (information.getType() != null) {
pst.setString(paramIndex++, information.getType());
}
if (information.getMedium() != null) {
pst.setString(paramIndex++, information.getMedium());
}
if (information.getSect() != null) {
pst.setString(paramIndex++, information.getSect());
}
if (information.getBarCode() != null) {
pst.setString(paramIndex++, information.getBarCode());
}
if (information.getScore() != 0) {
pst.setFloat(paramIndex++, information.getScore());
}
if (information.getPeople() != 0) {
pst.setInt(paramIndex++, information.getPeople());
}
if (information.getUrlAddress() != null) {
pst.setString(paramIndex++, information.getUrlAddress());
}
rs = pst.executeQuery();
while (rs.next()) {
Information i = new Information();
i.setId(rs.getInt("id"));
i.setMusicName(rs.getString("musicName"));
i.setSinger(rs.getString("singer"));
i.setTime(rs.getString("time"));
i.setType(rs.getString("type"));
i.setMedium(rs.getString("medium"));
i.setSect(rs.getString("sect"));
i.setBarCode(rs.getString("barcode"));
i.setScore(rs.getFloat("score"));
i.setPeople(rs.getInt("people"));
i.setUrlAddress(rs.getString("URLaddress"));
informationArrayList.add(i);
}
} catch (SQLException e) {
throw new RuntimeException(e);
} finally {
DBConnection.close(con, pst);
}
return informationArrayList;
}
//insert
public static boolean insert(Information information) {
Connection con = null;
PreparedStatement pst = null;
boolean success = false;
try {
con = DBConnection.getConnection();
String sql = "INSERT INTO music_information (id,musicName,singer,time,type,medium,sect,barcode,score,people,URLaddress)" +
"VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?,?,?)";
pst = con.prepareStatement(sql);
pst.setInt(1,information.getId());
pst.setString(2, information.getMusicName());
pst.setString(3, information.getSinger());
pst.setString(4, information.getTime());
pst.setString(5, information.getType());
pst.setString(6, information.getMedium());
pst.setString(7, information.getSect());
pst.setString(8, information.getBarCode());
pst.setFloat(9, information.getScore());
pst.setInt(10, information.getPeople());
pst.setString(11, information.getUrlAddress());
int rowsAffected = pst.executeUpdate();
if (rowsAffected > 0) {
success = true;
}
} catch (SQLException e) {
throw new RuntimeException(e);
} finally {
DBConnection.close(con, pst);
}
return success;
}
//update 更新商品信息
public static boolean update(Information information) {
Connection con = null;
PreparedStatement pst = null;
boolean success = false;
try {
con = DBConnection.getConnection();
String sql = "UPDATE music_information SET singer=?, time=?, type=?, medium=?, sect=?, barcode=?, score=?, people=?, URLaddress=? WHERE musicName=?";
pst = con.prepareStatement(sql);
pst.setString(1, information.getSinger());
pst.setString(2, information.getTime());
pst.setString(3, information.getType());
pst.setString(4, information.getMedium());
pst.setString(5, information.getSect());
pst.setString(6, information.getBarCode());
pst.setFloat(7, information.getScore());
pst.setInt(8, information.getPeople());
pst.setString(9, information.getUrlAddress());
pst.setString(10, information.getMusicName()); // musicName 作为最后一个参数
System.out.println("执行 SQL: " + pst.toString()); // 添加日志以调试
int rowsAffected = pst.executeUpdate();
if (rowsAffected > 0) {
success = true;
} else {
System.out.println("更新失败,没有匹配的记录被更新"); // 添加日志
}
} catch (SQLException e) {
throw new RuntimeException(e);
} finally {
DBConnection.close(con, pst);
}
return success;
}
//delete 删除商品信息
public static boolean delete(Information information) {
Connection con = null;
PreparedStatement pst = null;
boolean success = false;
try {
con = DBConnection.getConnection();
String sql = "DELETE FROM music_information WHERE musicName = ?";
pst = con.prepareStatement(sql);
pst.setString(1, information.getMusicName());
int rowsAffected = pst.executeUpdate();
if (rowsAffected > 0) {
success = true;
}
} catch (SQLException e) {
throw new RuntimeException(e);
} finally {
DBConnection.close(con, pst);
}
return success;
}
}
util---DBConnection
package util;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
public class DBConnection {
private static String driverName;
private static String url;
private static String user;
private static String password;
//驱动加载,只需执行一次
static{
driverName = "com.mysql.cj.jdbc.Driver";
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
throw new RuntimeException(e);
}
}
//获取链接
public static Connection getConnection(){
url = "jdbc:mysql://localhost:3306/music?useUnicode=true&characterEncoding=utf-8";
user = "root";
password = "123456";
Connection con = null;
try {
con = DriverManager.getConnection(url,user,password);
} catch (SQLException e) {
throw new RuntimeException(e);
}
return con;
}
//关闭资源
public static void close(Connection con, PreparedStatement pst){
if(con!=null) {
try {
con.close();
} catch (SQLException e) {
throw new RuntimeException(e);
}
}
if(pst!=null) {
try {
pst.close();
} catch (SQLException e) {
throw new RuntimeException(e);
}
}
}
}
service---MusicService
package service;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import dao.InformationDAO;
import vo.Information;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class MusicService {
private static List<String> musicName = new ArrayList<>();
private static List<String> musicURLaddress = new ArrayList<>();
private static List<String> musicScore = new ArrayList<>();
private static List<String> musicPeople = new ArrayList<>();
private static List<String> musicSinger = new ArrayList<>();
private static List<String> musicTime = new ArrayList<>();
private static List<String> musicType = new ArrayList<>();
private static List<String> musicMedium = new ArrayList<>();
private static List<String> musicSect = new ArrayList<>();
private static List<String> musicBarcode = new ArrayList<>();
public static void getData() throws IOException, InterruptedException {
String userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36";
for (int i = 0; i < 5; i++) { // 爬取共10页,每页20条数据
String pageUrl = "https://music.douban.com/tag/民谣?start=" + (i * 20) + "&type=T";
System.out.println("开始爬取第" + (i + 1) + "页,地址是:" + pageUrl);
getMusicInfo(pageUrl, userAgent);
Thread.sleep(1000); // 等待1秒(防止反爬)
}
// 插入数据库
insertMusicInfoToDB();
}
public static void getMusicInfo(String url, String userAgent) throws IOException {
Document document = Jsoup.connect(url).userAgent(userAgent).get();
//获取<tr
Elements musicElements = document.select(".item");
for (Element music : musicElements) {
// 专辑名称
String name = music.select(".pl2 a").text().replace("\n", "").replace(" ", " ").trim();
musicName.add(name);
// 专辑链接
String URLaddress = music.select(".pl2 a").attr("href");
musicURLaddress.add(URLaddress);
// 音乐评分
String score;
try {
score = music.select(".rating_nums").text();
} catch (Exception e) {
score = "";
}
musicScore.add(score);
//评分人数
String people = music.select(".pl").get(1).text().replace(" ", "").replace("人评价", "").replace("(", "").replace(")", ""); // 评分人数
musicPeople.add(people);
String[] musicInfos = music.select(".pl").get(0).text().trim().split(" / ");
if (musicInfos.length >= 4) {
musicSinger.add(musicInfos[0]);
musicTime.add(musicInfos[1]);
musicType.add(musicInfos[2]);
musicMedium.add(musicInfos[3]);
musicSect.add(musicInfos.length > 4 ? musicInfos[4] : "");
musicBarcode.add(musicInfos.length > 5 ? musicInfos[5] : "");
} else {
// 处理信息不完整的情况
musicSinger.add(musicInfos[0]);
musicTime.add(musicInfos.length > 1 ? musicInfos[1] : "");
musicType.add(musicInfos.length > 2 ? musicInfos[2] : "");
musicMedium.add(musicInfos.length > 3 ? musicInfos[3] : "");
musicSect.add("");
musicBarcode.add("");
}
}
}
public static Map<String, Object> insertMusicInfoToDB() {
Map<String, Object> resultMap = new HashMap<>();
for (int i = 0; i < musicName.size(); i++) {
Information info = new Information();
info.setMusicName(musicName.get(i));
info.setSinger(musicSinger.get(i));
info.setTime(musicTime.get(i));
info.setType(musicType.get(i));
info.setMedium(musicMedium.get(i));
info.setSect(musicSect.get(i));
info.setBarCode(musicBarcode.get(i));
try {
info.setScore(Float.parseFloat(musicScore.get(i)));
} catch (NumberFormatException e) {
info.setScore(0.0f);
}
try {
info.setPeople(Integer.parseInt(musicPeople.get(i)));
} catch (NumberFormatException e) {
info.setPeople(0);
}
info.setUrlAddress(musicURLaddress.get(i));
boolean success = InformationDAO.insert(info);
resultMap.put(musicName.get(i), success); // 将结果添加到Map中
if (success) {
System.out.println("成功插入: " + info.getMusicName());
} else {
System.out.println("插入失败: " + info.getMusicName());
}
}
return resultMap;
}
}
mysql
create database music;
use music;
CREATE TABLE `music_information` (
`id` INT ,
`musicName` VARCHAR(255) PRIMARY KEY,
`singer` VARCHAR(255),
`time` varchar(50), # 发行日期
`type` VARCHAR(255), # 专辑类型
`medium` VARCHAR(100),
`sect` varchar(50), # 流派
`barcode` VARCHAR(50),
`score` DECIMAL(3, 1),
`people` INT,
`URLaddress` VARCHAR(500)
);
INSERT INTO `music_information` (`id`,`musicName`,`singer`,`time`,`type`,`medium`,`sect`,`barcode`,`score`,`people`,`URLaddress`)VALUES
('1','Song Title 1', 'Artist Name 1', '2023-01-01', 'Album Type 1','md1', '民谣', '123456789012', 4.5, 1000, 'https://example.com/song1'),
('3','st1', 'Artist Name 1', '2023-01-01', 'Album Type 1','md1', 'Pop', '123456789012', 4.5, 1000, 'https://example.com/song1'),
('4','st2', 'Artist Name 1', '2023-01-01', 'Album Type 1','md1', '民谣', '123456789012', 4.2, 1000, 'https://example.com/song1'),
('2','Song Title 2', 'Artist Name 2', '2022-05-15', 'Album Type 2','md2', 'Rock', '234567890123', 4.2, 500, 'https://example.com/song2');
drop table music_information;
select*from music_information;