Python爬虫基础——IP反爬虫的应对

主要内容：搭建代理IP池，以及案例说明。这里大概写一下代码思路，具体可以参考具体代码进行自己总结。

1、导入数据模块

2、确定请求地址

3、模拟伪装

4、发送请求

5、解析数据

python 复制代码

#第一步：导入数据模块
import requests
import parsel
import random

#第二步：确定请求地址
url = 'https://proxy.ip3366.net/free/'
#第三步：模拟伪装
headers = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'
}
#第四步：发送请求
response = requests.get(url, headers=headers)
print(response)

#第五步：解析数据
selector = parsel.Selector(response.text)
# print(selector)
#//*[@id="ipc"]/tbody
list = selector.xpath('//*[@id="content"]/section/div[2]/table/tbody//tr')
for td in list:
    ip = td.xpath('td[1]/text()').get().split('IP')[0]
    port = td.xpath('td[2]/text()').get().split('PORT')[0]
    # print(ip, port)
    proxise_dict = {
        'http1': 'http://' + f'{ip}:{port}',
        'https2': 'https://' + f'{ip}:{port}',
    }
    #print(proxise_dict)
    #第六步：检查IP代理是否可用，用这个代理去请求一下网站就行了
    try:
        #https://pic.netbian.com/4kmeinv/
        response_1 = requests.get(url='https://www.baidu.com/', proxies=proxise_dict,timeout=1)
        if response_1.status_code == 200:
            print('代理可以使用',proxise_dict)
    except:
        print('当前代理', proxise_dict,'连接超时，检测不合格')
'''
运行结果：
代理可以使用 {'http1': 'http://60.188.5.153:80', 'https2': 'https://60.188.5.153:80'}
代理可以使用 {'http1': 'http://222.66.202.6:80', 'https2': 'https://222.66.202.6:80'}
代理可以使用 {'http1': 'http://114.231.82.170:8089', 'https2': 'https://114.231.82.170:8089'}
代理可以使用 {'http1': 'http://58.20.235.231:9002', 'https2': 'https://58.20.235.231:9002'}
代理可以使用 {'http1': 'http://182.34.18.44:8089', 'https2': 'https://182.34.18.44:8089'}
代理可以使用 {'http1': 'http://60.28.196.225:80', 'https2': 'https://60.28.196.225:80'}
代理可以使用 {'http1': 'http://159.226.227.87:80', 'https2': 'https://159.226.227.87:80'}
代理可以使用 {'http1': 'http://183.164.243.108:8089', 'https2': 'https://183.164.243.108:8089'}
代理可以使用 {'http1': 'http://36.6.145.60:8089', 'https2': 'https://36.6.145.60:8089'}
代理可以使用 {'http1': 'http://117.69.237.29:8089', 'https2': 'https://117.69.237.29:8089'}
当前代理 {'http1': 'http://60.28.196.225:80', 'https2': 'https://60.28.196.225:80'} 连接超时，检测不合格
代理可以使用 {'http1': 'http://114.231.46.157:8089', 'https2': 'https://114.231.46.157:8089'}
代理可以使用 {'http1': 'http://114.231.8.177:8089', 'https2': 'https://114.231.8.177:8089'}
代理可以使用 {'http1': 'http://36.6.144.90:8089', 'https2': 'https://36.6.144.90:8089'}
代理可以使用 {'http1': 'http://36.6.145.132:8089', 'https2': 'https://36.6.145.132:8089'}
在这里免费网站好多都检测不合格，因此最好找一个便宜能用的ip网站，学习而已，开一周会员，或者直接到收费的网站使用ip吧
'''

如果爬取网上免费的没有可以试一试站大爷里边有免费的IP代理，不过以目前我的水品爬取不了，请求总是错误，只能进行手动输入，当然如果只是学习的话可以买几天花钱的也可以，毕竟找免费的IP代理需要花很长时间。（这里注意好多付费网站需要绑定你的身份，个人建议多找找不绑定的，更过分的还要上传身份证，不信你试一试，手机号注册一个账号，然后不身份验证看看你会不会收到对方电话，大家注意网络安全）

python 复制代码

import requests
import parsel
import random
proxise_dict = {'http1':'https://218.78.55.172:8089',
                'http2':'https://14.18.126.57:3128',
                'http3':'https://111.1.61.47:3128',
                'http4':'https://60.188.102.255:18080',
                'http5':'https://111.1.61.49:3128',
                #'http6':'https://60.188.102.44:3128',
                'http7':'https://120.26.0.11:8880',
                'http8':'https://218.78.55.172:8090',
                'http9':'https://120.133.37.235:1080',
                'http10':'https://101.200.243.204:204',
                'http11':'https://47.122.31.59:8081',
                'http12':'https://61.160.202.79:80',
            }
#检测IP代理是否可以用
print(proxise_dict)
proxies_list= []
for IP_proxies in proxise_dict:
    try:
        proxise_dict_1 = {f'{IP_proxies}':f'{proxise_dict[IP_proxies]}'}
        print(proxise_dict_1)
        response_1 = requests.get(url='https://www.baidu.com/', proxies=proxise_dict_1,timeout=1)
        if response_1.status_code == 200:
            print('代理可以使用',proxise_dict[IP_proxies])
            proxies_list.append(proxise_dict_1)
    except:
        print('当前代理', proxise_dict_1,'连接超时，检测不合格')
#如果可以用收集到proxies_list列表中以后用
# 使用代理IP地址发起请求
#1、请求地址
url = 'https://www.zdaye.com/'
#2、模拟伪装
headers = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'
}
#3、发送请求
proxie =random.choice(proxies_list)
# print(proxie)
response = requests.get(url, headers=headers,proxies=proxie)
print(response)