渗透利器-kali工具 (第四章-5) 爬虫入门

Python爬虫入门[spider]

1，交换机制：

服务器与本地的交换机制：

http协议：客户端与服务器一种会话的方式。

客户端-------[requests[请求]]------->服务器

客户端-------[response[响应]]------>服务器

HTTP请求：

向服务器请求的时候使用request请求，包含了很多的不同的方法：主要用到[GET、POST]

HTTP响应：

向服务器提出request之后，服务器会返回给我们一个Response[我们请求的这个网页]

RESPONST：

Status_code：[状态码]200 网页中的元素

Status_code：[状态码]403/404

可以打开谷歌，加载一个网站然后点击检查>>>nerwork>>>刷新

2，网页解析：

1.网页解析：需要使用到[bs4]

from bs4 import BeautifulSoup

import requests

#解析网页内容

url = "https://www.baidu.com"

wb_data = requests.get(url)

soup = BeautifulSoup(wb_data.text,'lxml')

print(soup)

2.描述要爬取的元素位置：

eg：标题[在网页中找到它在的位置] >>>右键复制selector

titles = soup.select('#sy_load > ul:nth-child(2) > li:nth-child(1) > div.syl_info > a')

print(titles)

解释：

#sy_load > ul:nth-child(2) > li:nth-child(1) > div.syl_info：标签的位置：selector

a：查找a标签

向上查找[上级标签]class名：

titles = soup.select('div.syl_info> a')

print(titles)

解释：

div.syl_info 标签的class名

a：查找a标签

3.bs4中具有一个BeautifulSoup安装方法：

1.安装：pip install beautifulsoup4

2.可选择安装解析器：

pip install lxml [一般安装这个即可]

pip install html5lib

3.使用：

from bs4 import BeautifulSoup

import requests

req_obj = requests.get('https://www.baidu.com')

soup = BeautifulSoup(req_boj.txt,'lxml')

不使用BeautifulSoup，只返回状态码

使用BeautifulSoup，会将站点，html代码返回。

4.经常使用到的一些方法：

from bs5 import BeautifulSoup

import requests,re

a = requests.get('https://www.baidu.com')

b = BeautifulSoup(a.txt,'lxml')

print(b.title)　　　　　输入title找标签只找一个

print(b.find('title'))　　　　输入title找标签只找一个

print(b.find_all('div'))　　找所有div标签

c = soup.div　　　　　　创建div的实例化

print(c['id'])　　　　　　查看标签的id属性

print(c.attrs)　　　　　　查看标签的所有属性

d = soup.title　　　　　　创建title的实例化

print(d.string)　　　　　　获取标签里的字符串

e = soup.head　　　　　创建head的实例化

print(e.title)　　　　　　　获取标签，再获取子标签

f = soup.body　　　　　　创建body实例化

print(f.contents)　　　　　返回标签子节点，以列表的形式返回

g = soup.title　　　　　　创建title实例化

print(g.parent)　　　　　　查找父标签

print(soup.find_all(id='link2'))

3，爬虫所需的模块和库：

库：requests.bs4

模块：BeautifulSoup

1.抓取：requests

2.分析：BeautifulSoup

3.存储：

4.目录扫描工具原理实战：

import requests

import sys

url = sys.argv[1]

dic = sys.argv[2]

with open(dic,'r') as f:

for i in f.readlines()　　　　一行读取

i = i.strip()　　　　　　去除空格

r = requests.get(url+i)

if r.stats_code == 200:

print('url:'+r.url)