Java超市收银系统(十、爬虫)

引言

爬虫功能实现,要求爬取页面数据至少100条,这里以豆瓣音乐为示例编写代码豆瓣音乐标签: 民谣 (douban.com)

功能实现

除了爬虫功能增加,代码其他内容原理和之前博客发布是一致的,只不过这里为了区分,我们重新创建数据库,名称为music,依旧是vo包中存放数据信息,也就是java可自动生成的构造函数。dao包中存放数据库功能实现函数,主要为增删改查四大基础功能。util包中存放数据库连接函数,用于java和数据库的连接。ui包中存放主函数内容,即实现各类函数调用。service包中存放爬虫相关函数,用于实现对指定页面的数据信息爬取。、

该类定义了几个列表来保存有关正在抓取的音乐记录的不同数据:

  • musicName:存储音乐专辑的名称。
  • musicURLaddress:存储相册的 URL。
  • musicScore:存储专辑的评分(分数)。
  • musicPeople:存储对相册进行评分的人数。
  • musicSinger:存储歌手或艺术家的姓名。
  • musicTime:存储专辑的发行日期。
  • musicType:存储音乐的流派或类型。
  • musicMedium:存储专辑的介质(例如,CD、黑胶唱片)。
  • musicSect:存储有关相册的其他信息(可选)。
  • musicBarcode:存储条形码信息(可选)。

这些列表用于收集抓取的数据,然后用于将数据插入数据库。

getData() 方法

该方法是启动 Web 抓取过程的主要方法:getData()

  • User Agent:该字符串模拟浏览器请求,使其看起来像是来自真实浏览器。这有助于避免被网站阻止。

  • Loop Over Pages :该方法循环 5 个页面(即 100 个项目,假设每个页面有 20 个项目)。对于每次迭代,它都会构建当前页面的 URL,并调用getMusicInfo()以从该页面抓取数据。

  • 睡眠 1 秒Thread.sleep(1000)是添加的延迟,以防止网站被请求淹没(一种常见的反抓取措施)。

  • 将数据插入数据库 : 从所有页面抓取数据后,它会调用insertMusicInfoToDB()将收集的数据存储在数据库中。

对应html:

点击链接,进入每首歌的详细信息页面:

getMusicInfo() 方法

此方法处理从给定页面中实际抓取的数据:

  • 文档检索 :该方法用于Jsoup连接到 URL 并检索 HTML 文档。

  • 选择元素 :然后,它会选择所有带有类 .item 的元素,这些元素代表单独的音乐记录。

  • 提取数据: 对于每张音乐唱片,它提取名称、URL、分数、评分人数以及歌手、发行日期、类型、媒体等各种其他详细信息,并将它们添加到相应的列表中。

insertMusicInfoToDB() 方法

此方法将收集的数据插入到数据库中:

  • Looping Over Data :该方法遍历Information所有收集的数据(从列表中),并为每个音乐记录创建一个对象。

  • 解析数据:它尝试将分数和人数从字符串解析为适当的类型(float 和 int)。如果解析失败,它会设置默认值(0.0f 表示 score 和 0 表示 people)。

  • Inserting into Database :然后调用InformationDAO.insert(info)将数据插入数据库。插入的结果存储在 a 中,该 a 将音乐名称映射到Map指示插入是否成功的布尔值。

  • 记录结果:每次插入后,它会记录插入是否成功。

总结

  • 网页抓取getData()getMusicInfo() 方法负责从特定网页抓取数据。
  • 数据收集:数据收集到各种列表中。
  • 数据库插入 :该方法处理将insertMusicInfoToDB()收集的数据插入数据库,确保每条数据都得到正确解析和存储。

结果展示

完整代码

ui---Driver

package ui;

import service.MusicService;

import java.io.IOException;


public class Driver {
    public static void main(String[] args) throws IOException, InterruptedException {
        MusicService.getData();
    }
}

vo---Information

package vo;

public class Information {
    private int id;
    private String musicName;
    private String singer;
    private String time;
    private String type;
    private String medium;
    private String sect;
    private String barCode;
    private float score;
    private int people;
    private String urlAddress;

    public Information() {
    }

    public Information(int id, String musicName, String singer, String time, String type, String medium, String sect, String barCode, float score, int people, String urlAddress) {
        this.id = id;
        this.musicName = musicName;
        this.singer = singer;
        this.time = time;
        this.type = type;
        this.medium = medium;
        this.sect = sect;
        this.barCode = barCode;
        this.score = score;
        this.people = people;
        this.urlAddress = urlAddress;
    }

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getMusicName() {
        return musicName;
    }

    public void setMusicName(String musicName) {
        this.musicName = musicName;
    }

    public String getSinger() {
        return singer;
    }

    public void setSinger(String singer) {
        this.singer = singer;
    }

    public String getTime() {
        return time;
    }

    public void setTime(String time) {
        this.time = time;
    }

    public String getType() {
        return type;
    }

    public void setType(String type) {
        this.type = type;
    }

    public String getMedium() {
        return medium;
    }

    public void setMedium(String medium) {
        this.medium = medium;
    }

    public String getSect() {
        return sect;
    }

    public void setSect(String sect) {
        this.sect = sect;
    }

    public String getBarCode() {
        return barCode;
    }

    public void setBarCode(String barCode) {
        this.barCode = barCode;
    }

    public float getScore() {
        return score;
    }

    public void setScore(float score) {
        this.score = score;
    }

    public int getPeople() {
        return people;
    }

    public void setPeople(int people) {
        this.people = people;
    }

    public String getUrlAddress() {
        return urlAddress;
    }

    public void setUrlAddress(String urlAddress) {
        this.urlAddress = urlAddress;
    }

    @Override
    public String toString() {
        return "Information{" +
                "id=" + id +
                ", musicName='" + musicName + '\'' +
                ", singer='" + singer + '\'' +
                ", time='" + time + '\'' +
                ", type='" + type + '\'' +
                ", medium='" + medium + '\'' +
                ", sect='" + sect + '\'' +
                ", barCode='" + barCode + '\'' +
                ", score=" + score +
                ", people=" + people +
                ", urlAddress='" + urlAddress + '\'' +
                '}';
    }



    public static class Info {
        private String singer;
        private String time;
        private String type;
        private double medium;
    }



    }

dao---InformationDAO

package dao;

import util.DBConnection;
import vo.Information;

import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;

public class InformationDAO {

    //按歌名查询
    public static Information queryByName(String musicName) {
        Connection con = null;
        PreparedStatement pst = null;
        ResultSet rs = null;
        Information information = null;
        try {
            con = DBConnection.getConnection();
            String sql = "SELECT * FROM music_information WHERE musicName = ?";
            pst = con.prepareStatement(sql);
            pst.setString(1, musicName);
            rs = pst.executeQuery();
            if (rs.next()) {
                information = new Information();
                information.setId(rs.getInt("id"));
                information.setMusicName(rs.getString("musicName"));
                information.setSinger(rs.getString("singer"));
                information.setTime(rs.getString("time"));
                information.setType(rs.getString("type"));
                information.setSect(rs.getString("medium"));
                information.setSect(rs.getString("sect"));
                information.setBarCode(rs.getString("barcode"));
                information.setScore(rs.getFloat("score"));
                information.setPeople(rs.getInt("people"));
                information.setUrlAddress(rs.getString("URLaddress"));
            }
        } catch (SQLException e) {
            throw new RuntimeException(e);
        } finally {
            DBConnection.close(con, pst);
        }
        return information;
    }

    public static List<Information> queryBySinger(String singer) {
        List<Information> infoList = new ArrayList<>();
        Connection con = null;
        PreparedStatement pst = null;
        ResultSet rs = null;

        try {
            con = DBConnection.getConnection();
            String sql = "SELECT * FROM music_information WHERE singer = ?";
            pst = con.prepareStatement(sql);
            pst.setString(1, singer);
            rs = pst.executeQuery();

            while (rs.next()) {
                Information info = new Information();
                info.setId(rs.getInt("id"));
                info.setMusicName(rs.getString("musicName"));
                info.setSinger(rs.getString("singer"));
                info.setTime(rs.getString("time"));
                info.setType(rs.getString("type"));
                info.setMedium(rs.getString("medium"));
                info.setSect(rs.getString("sect"));
                info.setBarCode(rs.getString("barcode"));
                info.setScore(rs.getFloat("score"));
                info.setPeople(rs.getInt("people"));
                info.setUrlAddress(rs.getString("URLaddress"));
                infoList.add(info);
            }
        } catch (SQLException e) {
            e.printStackTrace();
        } finally {
            DBConnection.close(con, pst);
        }
        return infoList;
    }

    public static int getTotalPeople() {
        String query = "SELECT SUM(people) AS totalPeople FROM music_information";
        try (Connection conn = DBConnection.getConnection();
             PreparedStatement pst = conn.prepareStatement(query);
             ResultSet rs = pst.executeQuery()) {
            if (rs.next()) {
                return rs.getInt("totalPeople");
            }
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return 0;
    }

    public static float getAverageScore(String singer) {
        String query = "SELECT AVG(score) AS averageScore FROM music_information WHERE singer = ? AND sect = '民谣'";
        float averageScore = -1; // 默认值,表示没有找到数据
        Connection con = null;
        PreparedStatement pst = null;
        ResultSet rs = null;

        try {
            con = DBConnection.getConnection();
            pst = con.prepareStatement(query);
            pst.setString(1, singer);
            rs = pst.executeQuery();

            if (rs.next()) {
                averageScore = rs.getFloat("averageScore");
            }
        } catch (SQLException e) {
            e.printStackTrace();
        } finally {
            // 关闭资源
            try {
                rs.close();
                pst.close();
                con.close();
            } catch (SQLException e) {
                throw new RuntimeException(e);
            }

        }
        return averageScore;
    }

    //query 任意条件查寻
    public static ArrayList<Information> query(Information information) {
        Connection con = null;
        PreparedStatement pst = null;
        ResultSet rs = null;
        ArrayList<Information> informationArrayList = new ArrayList<>();
        try {
            con = DBConnection.getConnection();
            StringBuilder sql = new StringBuilder("SELECT * FROM music_information WHERE 1 = 1");
            if (information.getId() != 0) {
                sql.append(" AND id = ?");
            }
            if (information.getMusicName() != null) {
                sql.append(" AND musicName = ?");
            }
            if (information.getSinger() != null) {
                sql.append(" AND signer = ?");
            }
            if (information.getTime() != null) {
                sql.append(" AND time = ?");
            }
            if (information.getType() != null) {
                sql.append(" AND type = ?");
            }
            if (information.getMedium() != null) {
                sql.append(" AND medium = ?");
            }
            if (information.getSect() != null) {
                sql.append(" AND sect = ?");
            }
            if (information.getBarCode() != null) {
                sql.append(" AND barCode = ?");
            }
            if (information.getScore() != 0) {
                sql.append(" AND score = ?");
            }
            if (information.getPeople() != 0) {
                sql.append(" AND people = ?");
            }
            if (information.getUrlAddress() != null) {
                sql.append(" AND URLaddress = ?");
            }
            pst = con.prepareStatement(sql.toString());
            int paramIndex = 1;
            if (information.getId() != 0) {
                pst.setInt(paramIndex++, information.getId());
            }
            if (information.getMusicName() != null) {
                pst.setString(paramIndex++, information.getMusicName());
            }
            if (information.getSinger() != null) {
                pst.setString(paramIndex++, information.getSinger());
            }
            if (information.getTime() != null) {
                pst.setString(paramIndex++, information.getTime());
            }
            if (information.getType() != null) {
                pst.setString(paramIndex++, information.getType());
            }
            if (information.getMedium() != null) {
                pst.setString(paramIndex++, information.getMedium());
            }
            if (information.getSect() != null) {
                pst.setString(paramIndex++, information.getSect());
            }
            if (information.getBarCode() != null) {
                pst.setString(paramIndex++, information.getBarCode());
            }
            if (information.getScore() != 0) {
                pst.setFloat(paramIndex++, information.getScore());
            }
            if (information.getPeople() != 0) {
                pst.setInt(paramIndex++, information.getPeople());
            }
            if (information.getUrlAddress() != null) {
                pst.setString(paramIndex++, information.getUrlAddress());
            }
            rs = pst.executeQuery();
            while (rs.next()) {
                Information i = new Information();
                i.setId(rs.getInt("id"));
                i.setMusicName(rs.getString("musicName"));
                i.setSinger(rs.getString("singer"));
                i.setTime(rs.getString("time"));
                i.setType(rs.getString("type"));
                i.setMedium(rs.getString("medium"));
                i.setSect(rs.getString("sect"));
                i.setBarCode(rs.getString("barcode"));
                i.setScore(rs.getFloat("score"));
                i.setPeople(rs.getInt("people"));
                i.setUrlAddress(rs.getString("URLaddress"));
                informationArrayList.add(i);
            }
        } catch (SQLException e) {
            throw new RuntimeException(e);
        } finally {
            DBConnection.close(con, pst);
        }
        return informationArrayList;
    }

    //insert
    public static boolean insert(Information information) {
        Connection con = null;
        PreparedStatement pst = null;
        boolean success = false;
        try {
            con = DBConnection.getConnection();
            String sql = "INSERT INTO music_information (id,musicName,singer,time,type,medium,sect,barcode,score,people,URLaddress)" +
                    "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?,?,?)";
            pst = con.prepareStatement(sql);
            pst.setInt(1,information.getId());
            pst.setString(2, information.getMusicName());
            pst.setString(3, information.getSinger());
            pst.setString(4, information.getTime());
            pst.setString(5, information.getType());
            pst.setString(6, information.getMedium());
            pst.setString(7, information.getSect());
            pst.setString(8, information.getBarCode());
            pst.setFloat(9, information.getScore());
            pst.setInt(10, information.getPeople());
            pst.setString(11, information.getUrlAddress());
            int rowsAffected = pst.executeUpdate();
            if (rowsAffected > 0) {
                success = true;
            }
        } catch (SQLException e) {
            throw new RuntimeException(e);
        } finally {
            DBConnection.close(con, pst);
        }
        return success;
    }

    //update 更新商品信息
    public static boolean update(Information information) {
        Connection con = null;
        PreparedStatement pst = null;
        boolean success = false;
        try {
            con = DBConnection.getConnection();
            String sql = "UPDATE music_information SET singer=?, time=?, type=?, medium=?, sect=?, barcode=?, score=?, people=?, URLaddress=? WHERE musicName=?";
            pst = con.prepareStatement(sql);
            pst.setString(1, information.getSinger());
            pst.setString(2, information.getTime());
            pst.setString(3, information.getType());
            pst.setString(4, information.getMedium());
            pst.setString(5, information.getSect());
            pst.setString(6, information.getBarCode());
            pst.setFloat(7, information.getScore());
            pst.setInt(8, information.getPeople());
            pst.setString(9, information.getUrlAddress());
            pst.setString(10, information.getMusicName()); // musicName 作为最后一个参数
            System.out.println("执行 SQL: " + pst.toString()); // 添加日志以调试
            int rowsAffected = pst.executeUpdate();
            if (rowsAffected > 0) {
                success = true;
            } else {
                System.out.println("更新失败,没有匹配的记录被更新"); // 添加日志
            }
        } catch (SQLException e) {
            throw new RuntimeException(e);
        } finally {
            DBConnection.close(con, pst);
        }
        return success;
    }


    //delete 删除商品信息
    public static boolean delete(Information information) {
        Connection con = null;
        PreparedStatement pst = null;
        boolean success = false;
        try {
            con = DBConnection.getConnection();
            String sql = "DELETE FROM music_information WHERE musicName = ?";
            pst = con.prepareStatement(sql);
            pst.setString(1, information.getMusicName());
            int rowsAffected = pst.executeUpdate();
            if (rowsAffected > 0) {
                success = true;
            }
        } catch (SQLException e) {
            throw new RuntimeException(e);
        } finally {
            DBConnection.close(con, pst);
        }
        return success;
    }

}

util---DBConnection

package util;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;

public class DBConnection {
	private static String driverName;
	private static String url;
	private static String user;
	private static String password;

	//驱动加载,只需执行一次
	static{
		driverName = "com.mysql.cj.jdbc.Driver";
		try {
			Class.forName(driverName);
		} catch (ClassNotFoundException e) {
			throw new RuntimeException(e);
		}
	}

	//获取链接
	public static Connection getConnection(){
		url = "jdbc:mysql://localhost:3306/music?useUnicode=true&characterEncoding=utf-8";
		user = "root";
		password = "123456";
		Connection con = null;
		try {
			con = DriverManager.getConnection(url,user,password);
		} catch (SQLException e) {
			throw new RuntimeException(e);
		}
		return con;
	}

	//关闭资源
	public static void close(Connection con, PreparedStatement pst){
		if(con!=null) {
			try {
				con.close();
			} catch (SQLException e) {
				throw new RuntimeException(e);
			}
		}
		if(pst!=null) {
			try {
				pst.close();
			} catch (SQLException e) {
				throw new RuntimeException(e);
			}
		}
	}
}

service---MusicService

package service;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import dao.InformationDAO;
import vo.Information;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class MusicService {

    private static List<String> musicName = new ArrayList<>();
    private static List<String> musicURLaddress = new ArrayList<>();
    private static List<String> musicScore = new ArrayList<>();
    private static List<String> musicPeople = new ArrayList<>();
    private static List<String> musicSinger = new ArrayList<>();
    private static List<String> musicTime = new ArrayList<>();
    private static List<String> musicType = new ArrayList<>();
    private static List<String> musicMedium = new ArrayList<>();
    private static List<String> musicSect = new ArrayList<>();
    private static List<String> musicBarcode = new ArrayList<>();

    public static void getData() throws IOException, InterruptedException {
        String userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36";
        for (int i = 0; i < 5; i++) {  // 爬取共10页,每页20条数据
            String pageUrl = "https://music.douban.com/tag/民谣?start=" + (i * 20) + "&type=T";
            System.out.println("开始爬取第" + (i + 1) + "页,地址是:" + pageUrl);
            getMusicInfo(pageUrl, userAgent);
            Thread.sleep(1000);  // 等待1秒(防止反爬)
        }
        // 插入数据库
        insertMusicInfoToDB();
    }

    public static void getMusicInfo(String url, String userAgent) throws IOException {
        Document document = Jsoup.connect(url).userAgent(userAgent).get();
        //获取<tr
        Elements musicElements = document.select(".item");

        for (Element music : musicElements) {
            // 专辑名称
            String name = music.select(".pl2 a").text().replace("\n", "").replace("                ", " ").trim();
            musicName.add(name);
            // 专辑链接
            String URLaddress = music.select(".pl2 a").attr("href");
            musicURLaddress.add(URLaddress);
            // 音乐评分
            String score;
            try {
                score = music.select(".rating_nums").text();
            } catch (Exception e) {
                score = "";
            }
            musicScore.add(score);
            //评分人数
            String people = music.select(".pl").get(1).text().replace(" ", "").replace("人评价", "").replace("(", "").replace(")", "");  // 评分人数
            musicPeople.add(people);

            String[] musicInfos = music.select(".pl").get(0).text().trim().split(" / ");
            if (musicInfos.length >= 4) {
                musicSinger.add(musicInfos[0]);
                musicTime.add(musicInfos[1]);
                musicType.add(musicInfos[2]);
                musicMedium.add(musicInfos[3]);
                musicSect.add(musicInfos.length > 4 ? musicInfos[4] : "");
                musicBarcode.add(musicInfos.length > 5 ? musicInfos[5] : "");
            } else {
                // 处理信息不完整的情况
                musicSinger.add(musicInfos[0]);
                musicTime.add(musicInfos.length > 1 ? musicInfos[1] : "");
                musicType.add(musicInfos.length > 2 ? musicInfos[2] : "");
                musicMedium.add(musicInfos.length > 3 ? musicInfos[3] : "");
                musicSect.add("");
                musicBarcode.add("");
            }
        }
    }

    public static Map<String, Object> insertMusicInfoToDB() {
        Map<String, Object> resultMap = new HashMap<>();
        for (int i = 0; i < musicName.size(); i++) {
            Information info = new Information();
            info.setMusicName(musicName.get(i));
            info.setSinger(musicSinger.get(i));
            info.setTime(musicTime.get(i));
            info.setType(musicType.get(i));
            info.setMedium(musicMedium.get(i));
            info.setSect(musicSect.get(i));
            info.setBarCode(musicBarcode.get(i));
            try {
                info.setScore(Float.parseFloat(musicScore.get(i)));
            } catch (NumberFormatException e) {
                info.setScore(0.0f);
            }
            try {
                info.setPeople(Integer.parseInt(musicPeople.get(i)));
            } catch (NumberFormatException e) {
                info.setPeople(0);
            }
            info.setUrlAddress(musicURLaddress.get(i));
            boolean success = InformationDAO.insert(info);
            resultMap.put(musicName.get(i), success); // 将结果添加到Map中
            if (success) {
                System.out.println("成功插入: " + info.getMusicName());
            } else {
                System.out.println("插入失败: " + info.getMusicName());
            }
        }
        return resultMap;
    }
}

mysql

create database music;
use music;

CREATE TABLE `music_information` (  
    `id` INT ,  
    `musicName` VARCHAR(255) PRIMARY KEY,  
    `singer` VARCHAR(255),  
    `time` varchar(50),    # 发行日期
    `type` VARCHAR(255),  # 专辑类型
    `medium` VARCHAR(100),
    `sect` varchar(50),  # 流派
    `barcode` VARCHAR(50),  
    `score` DECIMAL(3, 1),  
    `people` INT,  
    `URLaddress` VARCHAR(500)  
);

INSERT INTO `music_information` (`id`,`musicName`,`singer`,`time`,`type`,`medium`,`sect`,`barcode`,`score`,`people`,`URLaddress`)VALUES  
('1','Song Title 1', 'Artist Name 1', '2023-01-01', 'Album Type 1','md1', '民谣', '123456789012', 4.5, 1000, 'https://example.com/song1'), 
('3','st1', 'Artist Name 1', '2023-01-01', 'Album Type 1','md1', 'Pop', '123456789012', 4.5, 1000, 'https://example.com/song1'),
('4','st2', 'Artist Name 1', '2023-01-01', 'Album Type 1','md1', '民谣', '123456789012', 4.2, 1000, 'https://example.com/song1'),
('2','Song Title 2', 'Artist Name 2', '2022-05-15', 'Album Type 2','md2', 'Rock', '234567890123', 4.2, 500, 'https://example.com/song2');

drop table music_information;
select*from music_information;
相关推荐
老猿讲编程27 分钟前
一个例子来说明Ada语言的实时性支持
开发语言·ada
Chrikk1 小时前
Go-性能调优实战案例
开发语言·后端·golang
幼儿园老大*1 小时前
Go的环境搭建以及GoLand安装教程
开发语言·经验分享·后端·golang·go
canyuemanyue1 小时前
go语言连续监控事件并回调处理
开发语言·后端·golang
杜杜的man1 小时前
【go从零单排】go语言中的指针
开发语言·后端·golang
测开小菜鸟1 小时前
使用python向钉钉群聊发送消息
java·python·钉钉
Ai 编码助手2 小时前
MySQL中distinct与group by之间的性能进行比较
数据库·mysql
P.H. Infinity2 小时前
【RabbitMQ】04-发送者可靠性
java·rabbitmq·java-rabbitmq
生命几十年3万天3 小时前
java的threadlocal为何内存泄漏
java
陈燚_重生之又为程序员3 小时前
基于梧桐数据库的实时数据分析解决方案
数据库·数据挖掘·数据分析