Java网络爬虫

JSoup是一个用于处理HTML的Java库,它提供了一个非常方便类似于使用DOM,CSS和jquery的方法的API来提取和操作数据。

一、基本使用

复制代码
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>

二、爬取豆瓣电影的案例

复制代码
public class DouBan {
    public static void main(String[] args) {
        String url = "https://movie.douban.com/top250";
        crawlMovies(url);
    }
    /**
     * 爬取的方法
     * @param url
     * @return
     */
    public static void crawlMovies(String url) {
        try {
            Document doc = Jsoup.connect(url).get();   //模拟浏览器向服务器发起get请求
            Elements elements = doc.select("#content > div > div.article > ol > li");
           // System.out.println(elements);
            for (Element element : elements) {
                String rank = element.select("div.pic > em").text();
                String name = element.select("div.info > div.hd > a > span:nth-child(1)").text();
                String score = element.select("div.info > div.bd > div.star > span.rating_num").text();
                System.out.println(rank + " " + name + " " + score);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

三、Jsoup能做什么?

  • 从 URL、文件或字符串中抓取和解析HTML(爬虫)
  • 使用DOM 遍历或 CSS 选择器查找和提取数据
  • 操作HTML 元素、属性和文本
  • 根据安全列表清理用户提交的内容,以防止XSS攻击
  • 输出整洁的网页

**四、**Jsoup相关概念

  • Document :文档对象。每份HTML页面都是一个文档对象,Document 是 jsoup 体系中最顶层的结构。
  • Element:元素对象。一个 Document 中可以着包含着多个 Element 对象,可以使用 Element 对象来遍历节点提取数据或者直接操作HTML。
  • Elements:元素对象集合,类似于List。
  • Node:节点对象。标签名称、属性等都是节点对象,节点对象用来存储数据。
  • 类继承关系:Document 继承自 Element(class Document extends Element) ,Element 继承自 Node(class Element extends Node)。
  • 一般执行流程:先获取 Document 对象,然后获取 Element 对象,最后再通过 Node 对象获取数据。

**五、**Jsoup获取文档

1.导入jsoup的jar包

复制代码
 <!-- jsoup -->
  <dependency>
     <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.11.3</version>
  </dependency>

**2.**从URL中加载文档对象(常用)

使用 Jsoup.connect(String url).get()方法获取(只支持 http 和 https 协议)

复制代码
 try {
     Document document = Jsoup.connect("http://www.baidu.com").get();
     System.out.println(document);
} catch (IOException e) {
     throw new RuntimeException(e);
}

connect(String url)方法创建一个新的 Connection并通过.get()或者.post()方法获得数据。如果从该URL获取HTML时发生错误,便会抛出 IOException,应适当处理。

3.设置请求头,模拟浏览器访问服务器

复制代码
public class ParseUtils {
    public static final String url ="https://www.zhaopin.com/sou/jl530/kw01L00O80EO062/p2";
    public static void main(String[] args) throws IOException {
        Document scriptHtml = Jsoup.connect(url)
                .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7")
                .header("Accept-Encoding", "gzip, deflate, br, zstd")
                .header("Accept-Language", "zh-CN,zh;q=0.9")//,en-US;q=0.5,en;q=0.3
                .header("Cache-Control","max-age=0")
                .header("Cookie", "x-zp-client-id=ef9626f5-a52b-4a15-8a12-b0a85e7c218d; sts_deviceid=19035834b09cac-08a587ad82e264-26001f51-2073600-19035834b0a15e9; _uab_collina=171888479017949017053887; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221148276223%22%2C%22first_id%22%3A%22190871d5ea3145e-0a17f313bc4b92-26001f51-1327104-190871d5ea41dba%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E4%BB%98%E8%B4%B9%E5%B9%BF%E5%91%8A%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%99%BA%E8%81%94%E6%8B%9B%E8%81%98%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Fother.php%22%7D%2C%22identities%22%3A%22eyIkaWRlbnRpdHlfY29va2llX2lkIjoiMTkwODcxZDVlYTMxNDVlLTBhMTdmMzEzYmM0YjkyLTI2MDAxZjUxLTEzMjcxMDQtMTkwODcxZDVlYTQxZGJhIiwiJGlkZW50aXR5X2xvZ2luX2lkIjoiMTE0ODI3NjIyMyJ9%22%2C%22history_login_id%22%3A%7B%22name%22%3A%22%24identity_login_id%22%2C%22value%22%3A%221148276223%22%7D%2C%22%24device_id%22%3A%22190871d5ea3145e-0a17f313bc4b92-26001f51-1327104-190871d5ea41dba%22%7D; sensorsdata2015jssdkchannel=%7B%22prop%22%3A%7B%22_sa_channel_landing_url%22%3A%22https%3A%2F%2Flanding.zhaopin.com%2Fregister%3Fidentity%3Dc%26channel_name%3Dbaidu_sem_track%26callback_id%3DIQJ0tYqm%26_data_version%3D0.6.1%26channel_utm_content%3Dpp%26project%3Dzlclient%26channel_utm_medium%3Dcpt%26tid%3DxTB%26channel_link_type%3Dweb%26channel_utm_source%3Dbaidupcpz_x%26hash_key%3DuaKIYNjV6PvgIWUUAJ9n%26sat_cf%3D2%26channel_utm_campaign%3Dzbt%26channel_utm_term%3D01%26_channel_track_key%3DM9kzlBZB%26link_version%3D1%26channel_keyword_id%3D%7Bkeywordid%7D%26channel_ad_id%3D%7Bcreative%7D%26channel_account_id%3D%7Buserid%7D%26channel_keyword%3D%7Bkw_enc_utf8%7D%26channel_adgroup_id%3D%7Bunitid%7D%26channel_campaign_id%3D%7Bplanid%7D%22%7D%7D; acw_tc=1a0c638e17304455938745196e0043c4e7cb34dc92735ce3e0202f87e76de6; acw_sc__v2=672481198b4466bdd2c6a0e05c76c286162ddfaa; FSSBBIl1UgzbN7NS=59Qtv7HRSzxMeWvnuFXvnlcZuUwWZOarFgNmqZm3Pop1pEImPg_MEymlLVH9e6Z4bBoHBQWJkdMAKbIYUs9SfLG; Hm_lvt_21a348fada873bdc2f7f75015beeefeb=1730445595; HMACCOUNT=2E8A695D191AA561; locationInfo_search={%22code%22:%22570%22%2C%22name%22:%22%E4%BF%9D%E5%AE%9A%22%2C%22message%22:%22%E5%8C%B9%E9%85%8D%E5%88%B0%E5%B8%82%E7%BA%A7%E7%BC%96%E7%A0%81%22}; 1420ba6bb40c9512e9642a1f8c243891=21dfec8b-1364-4c6b-94b1-fa4bcc540a88; zp_passport_deepknow_sessionId=b8554c12s3ee4047b28308ad4b2cc5a535bf; at=00e21b69c91f4bd2a61475ab25f226af; rt=e5b694c72560454fbcd47422c850dd8e; selectCity_search=530; Hm_lvt_7fa4effa4233f03d11c7e2c710749600=1730445644; Hm_lpvt_7fa4effa4233f03d11c7e2c710749600=1730445644; LastCity=%E5%8C%97%E4%BA%AC; LastCity%5Fid=530; Hm_lpvt_21a348fada873bdc2f7f75015beeefeb=1730445651; FSSBBIl1UgzbN7NT=5Rn1rVCQ9GYEqqqD1yNbqsalPXXFxf7D.CqkcGVTvMRb4jAIHGySTzFdvBKsZ7eoWQPN1ajfYCqTSA9Ty7wJcVzA47x4h1QBmt6NGZvH5CqCJOPnLcNCxotSuP3QBRdLKhiToPl5MvPBltuwgZn_OzThROJNpSVm9ioX8Kt47DyRs2cqZBun3limj.UN_FIvhAD7kob98OeaCbRN204EB0MXizs4PbouKbxyW0Oh9tmfVtvm98uFng74pA269tPiDSRvdRKRQTRYFBnJbJA1yEm5MIPP.k_k3gAI8BAdvcuG98o9yHAZXHWasLDRDQJ.OPy9l6yaEtPbAej0i6l2e4O")
                .header("Priority", "u=0, i")
                .header("Sec-Ch-Ua","\"Not/A)Brand\";v=\"8\", \"Chromium\";v=\"126\", \"Google Chrome\";v=\"126\"")
                .header("Sec-Ch-Ua-Mobile","?0")
                .header("Sec-Ch-Ua-Platform","\"Windows\"")
                .header("Sec-Fetch-Dest","document")
                .header("Sec-Fetch-Mode","navigate")
                .header("Sec-Fetch-Site","same-origin")
                .header("Sec-Fetch-User","?1")
                .header("Upgrade-Insecure-Requests","1")
                .header("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36")
                .timeout(50000)
                .get();
        System.out.println(scriptHtml);
    }
}

六、 定位选择元素

我们可以利用dom结构的方式,通过标签,id,class等查找到下属元素

1.查找元素-下列方法返回的是 Element****或Elements

  • getElementById(String id):通过id来查找元素
  • getElementsByTag(String tag):通过标签来查找元素
  • getElementsByClass(String className):通过类选择器来查找元素
  • getElementsByAttribute(String key) :通过属性名称来查找元素,例如查找带有href元素的标签。
  • siblingElements():获取兄弟元素。如果元素没有兄弟元素,则返回一个空列表。
  • firstElementSibling():获取第一个兄弟元素。
  • lastElementSibling():获取最后一个兄弟元素。
  • nextElementSibling():获取下一个兄弟元素。
  • previousElementSibling():获取上一个兄弟元素。
  • parent():获取此节点的父节点。
  • children():获取此节点的所有子节点。
  • child(int index):获取此节点的指定子节点。

2. select(String selector)****-下列方法返回的是 Element****或Elements

tagname: 通过标签查找元素,例如通过"a"来查找< a >标签。
#id: 通过ID查找元素,比如通过#logo查找< p id="logo">
.class: 通过class名称查找元素,比如通过.titile查找< p class="titile">
ns|tag: 通过标签在命名空间查找元素,比如使用 fb|name 来查找 < fb:name>

attribute\]: 利用属性查找元素,比如通过\[href\]查找\< a href="..."\> \[ \^attribute\]: 利用属性名前缀来查找元素,比如:可以用\[\^data-\] 来查找带有HTML5 dataset属性的元素 \[ attribute=value\]: 利用属性值来查找元素,比如:\[ width=500

attribute\^=value\], \[attribute$=value\], \[attribute\*=value\]: 利用匹配属性值开头、结尾或包含属性值来查找元素,比如通过\[href\*=/path/\]来查找 \[attribute\~=regex\]: 利用属性值匹配正则表达式来查找元素,比如通过 img\[src\~=(?i).(png\|jpe?g)\]来匹配所有的png或者jpg、jpeg格式的图片 \*: 通配符,匹配所有元素 ## **七、获取数据** **attr****(String key):获取单个属性值** attributes():获取所有属性值 attr(String key, String value):设置属性值 **text():获取文本内容** text(String value):设置文本内容 **html():获取元素内的HTML内容** html(String value):设置元素内的HTML内容 outerHtml():获取元素外HTML内容 data():获取数据内容(例如:script和style标签) id():获得id值 className():获得第一个类选择器值 classNames():获得所有的类选择器值 tag():获取元素标签 tagName():获取元素标签名 ## **八、具体案例** **爬取智联招聘网站** public class ParseUtils { public static final String url ="https://www.zhaopin.com/sou/jl530/kw01L00O80EO062/p1"; public static void main(String[] args) throws IOException { Document scriptHtml = Jsoup.connect(url) .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7") .header("Accept-Encoding", "gzip, deflate, br, zstd") .header("Accept-Language", "zh-CN,zh;q=0.9")//,en-US;q=0.5,en;q=0.3 .header("Cache-Control","max-age=0") .header("Cookie", "x-zp-client-id=ef9626f5-a52b-4a15-8a12-b0a85e7c218d; sts_deviceid=19035834b09cac-08a587ad82e264-26001f51-2073600-19035834b0a15e9; _uab_collina=171888479017949017053887; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221148276223%22%2C%22first_id%22%3A%22190871d5ea3145e-0a17f313bc4b92-26001f51-1327104-190871d5ea41dba%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E4%BB%98%E8%B4%B9%E5%B9%BF%E5%91%8A%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%99%BA%E8%81%94%E6%8B%9B%E8%81%98%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Fother.php%22%7D%2C%22identities%22%3A%22eyIkaWRlbnRpdHlfY29va2llX2lkIjoiMTkwODcxZDVlYTMxNDVlLTBhMTdmMzEzYmM0YjkyLTI2MDAxZjUxLTEzMjcxMDQtMTkwODcxZDVlYTQxZGJhIiwiJGlkZW50aXR5X2xvZ2luX2lkIjoiMTE0ODI3NjIyMyJ9%22%2C%22history_login_id%22%3A%7B%22name%22%3A%22%24identity_login_id%22%2C%22value%22%3A%221148276223%22%7D%2C%22%24device_id%22%3A%22190871d5ea3145e-0a17f313bc4b92-26001f51-1327104-190871d5ea41dba%22%7D; sensorsdata2015jssdkchannel=%7B%22prop%22%3A%7B%22_sa_channel_landing_url%22%3A%22https%3A%2F%2Flanding.zhaopin.com%2Fregister%3Fidentity%3Dc%26channel_name%3Dbaidu_sem_track%26callback_id%3DIQJ0tYqm%26_data_version%3D0.6.1%26channel_utm_content%3Dpp%26project%3Dzlclient%26channel_utm_medium%3Dcpt%26tid%3DxTB%26channel_link_type%3Dweb%26channel_utm_source%3Dbaidupcpz_x%26hash_key%3DuaKIYNjV6PvgIWUUAJ9n%26sat_cf%3D2%26channel_utm_campaign%3Dzbt%26channel_utm_term%3D01%26_channel_track_key%3DM9kzlBZB%26link_version%3D1%26channel_keyword_id%3D%7Bkeywordid%7D%26channel_ad_id%3D%7Bcreative%7D%26channel_account_id%3D%7Buserid%7D%26channel_keyword%3D%7Bkw_enc_utf8%7D%26channel_adgroup_id%3D%7Bunitid%7D%26channel_campaign_id%3D%7Bplanid%7D%22%7D%7D; FSSBBIl1UgzbN7NS=59Qtv7HRSzxMeWvnuFXvnlcZuUwWZOarFgNmqZm3Pop1pEImPg_MEymlLVH9e6Z4bBoHBQWJkdMAKbIYUs9SfLG; Hm_lvt_21a348fada873bdc2f7f75015beeefeb=1730445595; HMACCOUNT=2E8A695D191AA561; locationInfo_search={%22code%22:%22570%22%2C%22name%22:%22%E4%BF%9D%E5%AE%9A%22%2C%22message%22:%22%E5%8C%B9%E9%85%8D%E5%88%B0%E5%B8%82%E7%BA%A7%E7%BC%96%E7%A0%81%22}; zp_passport_deepknow_sessionId=b8554c12s3ee4047b28308ad4b2cc5a535bf; at=00e21b69c91f4bd2a61475ab25f226af; rt=e5b694c72560454fbcd47422c850dd8e; selectCity_search=530; Hm_lvt_7fa4effa4233f03d11c7e2c710749600=1730445644; LastCity=%E5%8C%97%E4%BA%AC; LastCity%5Fid=530; acw_tc=1a0c65e017304480584971022e008140339a9d67a2bcdc09439cda4eb07f02; acw_sc__v2=67248aba55648807d335a6da7d48310374429105; Hm_lpvt_7fa4effa4233f03d11c7e2c710749600=1730448062; Hm_lpvt_21a348fada873bdc2f7f75015beeefeb=1730448067; FSSBBIl1UgzbN7NT=5Rn1tabQ9P7lqqqD1y.gaoG8hMzseVw9boVrc.nzN.NMBlOlTPA97xgSfH5EQLB8V8xTx_pcZw03dFidG8tgycGAc0jAMneHGTAA4oonHGjPv5lr_OkYgUJ4DEn06wsAy0FhofDYzI8vG278LSjFCrO_6FilJo19yAZrD_faw71jclGyUctLL_7YhYJi6MMQHk59ppupNfrjvBdu6ipqSyubEk1RuC.SkBY7k5ATOXqh3HE61R7jmRL2.y8zGCS2Njiu6q8olq6d.wxusNmaF9eDg0J9eaXTMA0_1x6jZv1ofprtDUBBn9hvLxlQHqoV_T9CNRqARILac0jT8iPDE2D") .header("Priority", "u=0, i") .header("Sec-Ch-Ua","\"Not/A)Brand\";v=\"8\", \"Chromium\";v=\"126\", \"Google Chrome\";v=\"126\"") .header("Sec-Ch-Ua-Mobile","?0") .header("Sec-Ch-Ua-Platform","\"Windows\"") .header("Sec-Fetch-Dest","document") .header("Sec-Fetch-Mode","navigate") .header("Sec-Fetch-Site","same-origin") .header("Sec-Fetch-User","?1") .header("Upgrade-Insecure-Requests","1") .header("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36") .timeout(5000) .get(); // System.out.println(scriptHtml); Elements content = scriptHtml.getElementsByClass("joblist-box__item"); for(Element element:content){ String price = element.getElementsByClass("jobinfo__salary").text(); String company = element.getElementsByClass("companyinfo__name").text(); System.out.println(price+" " + company); } } }

相关推荐
6+h2 小时前
【java IO】字符流详解
java·开发语言
2401_891482172 小时前
用Python批量处理Excel和CSV文件
jvm·数据库·python
m0_743297422 小时前
使用Flask快速搭建轻量级Web应用
jvm·数据库·python
技术工小李2 小时前
多人平板答题系统护航第24届汉语桥比赛
python
董可伦2 小时前
Flink DataStream2Table 总结
服务器·python·flink
Cc琎2 小时前
api接口分布在多台服务器, 如何同步用户的每日请求次数
java·运维·服务器·redis·php
Zzj_tju3 小时前
Java 从入门到精通(六):抽象类与接口到底怎么选?
java·开发语言
2401_879693874 小时前
Python深度学习入门:TensorFlow 2.0/Keras实战
jvm·数据库·python
xixihaha132410 小时前
将Python Web应用部署到服务器(Docker + Nginx)
jvm·数据库·python