python_requests的代理配置

文章目录

- 前言
- 来源：免费网站
- 抓取
- urllib库访问
- 正则提取
- 整合
- 结果
- 结束

前言

大家好，我是yma16，本文分享配置python requests的代理配置

何以解忧，爬个ip用

来源：免费网站

可以参考快代理网站给出的ip
www.kuaidaili.com/free/inha/1...

抓取

python的urllib request库详解

Python的urllib库是用于操作URL的标准库之一，它提供了一个简单和方便的接口，用于处理各种URL的操作，例如发送HTTP请求，获取HTML内容，处理Cookie和错误处理等。urllib库包含了4个子模块，进行了相关的操作：

urllib.request：用于发送HTTP/HTTPS请求和包含在其中的各种操作；
urllib.error：包含由urllib.request生成的异常；
urllib.parse：用于解析URL，主要是拆分和组合URL；
urllib.robotparser：用于解析robots.txt文件，这是一个协议，用于告诉爬虫哪些页面可以抓取。

下面是一些常见的使用情况：

发送HTTP/HTTPS请求

使用urllib.request.urlopen()方法可以发送HTTP/HTTPS请求并获得响应，例如：

python 复制代码

import urllib.request

response = urllib.request.urlopen('http://www.python.org')
print(response.read().decode('utf-8'))

处理Cookie

使用urllib.request模块的HTTPCookieProcessor类可以处理Cookie。以下是一个示例：

python 复制代码

import http.cookiejar, urllib.request

# 创建cookiejar对象
cookie_jar = http.cookiejar.CookieJar()

# 创建HTTPCookieProcessor对象，并将cookiejar对象传入
cookie_processor = urllib.request.HTTPCookieProcessor(cookie_jar)

# 使用build_opener()方法创建opener对象，并将cookie_processor对象传入
opener = urllib.request.build_opener(cookie_processor)

# 使用urlopen()方法打开网页，此时便可以定位到包含Cookie的网页
opener.open('http://www.baidu.com')

# 通过cookiejar对象访问Cookie
for cookie in cookie_jar:
    print(cookie.name, cookie.value)

解析URL

使用urllib.parse模块的urlparse()、urlsplit()和urljoin()方法可以解析URL。以下是一个示例：

python 复制代码

import urllib.parse

url = 'https://www.python.org:8080/docs/index.html;abcd?def=456#789'
result = urllib.parse.urlsplit(url)
print(result)

错误处理

使用urllib.error模块的URLError和HTTPError类来处理HTTP请求时可能出现的错误，例如：

python 复制代码

import urllib.request, urllib.error

try:
    response = urllib.request.urlopen('http://www.pythonxyx.com')
except urllib.error.URLError as e:
    if hasattr(e, 'reason'):
        print('Failed to reach the server.')
        print('Reason:', e.reason)
    elif hasattr(e, 'code'):
        print('The server couldn\'t fulfill the request.')
        print('Error code:', e.code)
    else:
        print('Error :', e)

以上是urllib库的常见操作，还有很多其他的功能，可以根据需求进行使用。

urllib库访问

python 复制代码

    rq = request.Request(baseurl, headers=headers)  # 添加请求
    resp = request.urlopen(rq)  # 访问
    html = resp.read().decode('utf-8')

正则提取

python 复制代码

    compile_c="<td data-title=\"IP\">(.*?)</td>"
    compile_p="<td data-title=\"PORT\">(.*?)</td>"
    compile_s="<td data-title=\"匿名度\">(.*?)</td>"
    compile_k="<td data-title=\"类型\">(.*?)</td>"
    compile_l="<td data-title=\"位置\">(.*?)</td>"
    compile_v="<td data-title=\"响应速度\">(.*?)</td>"
    compile_t="<td data-title=\"最后验证时间\">(.*?)</td>"

整合

为防止503的错误，延迟2秒执行

python 复制代码

from urllib import request,parse
import re,time,xlwt
headers = {
    "user-agent": "****"# 根据浏览器上的填写
}
url='https://www.kuaidaili.com/free/inha/'



def auto_index():
    s=1
    for loc in range(1,50):
        index=str(loc)+'/'
        time.sleep(2)#先休息2秒
        s=urllib_request(url,index,s)#得到行数 不断传进去 写入excel不断循环
    worksheet.save("代理ip.xls")


def urllib_request(url,index,rows):#请求url 返回html
    baseurl=url+index
    print(baseurl)
    rq = request.Request(baseurl, headers=headers)  # 添加请求
    resp = request.urlopen(rq)  # 访问
    html = resp.read().decode('utf-8')
    return compile_html(html,rows)#跳转

def compile_html(html,sheetLocation):
    style=re.compile(r"<tbody>\s(.*?)\s</table>",re.S) # ip表
    result=re.findall(style,html)[0] # 第一个数据
    # print(result)
    compile_c="<td data-title=\"IP\">(.*?)</td>"
    compile_p="<td data-title=\"PORT\">(.*?)</td>"
    compile_s="<td data-title=\"匿名度\">(.*?)</td>"
    compile_k="<td data-title=\"类型\">(.*?)</td>"
    compile_l="<td data-title=\"位置\">(.*?)</td>"
    compile_v="<td data-title=\"响应速度\">(.*?)</td>"
    compile_t="<td data-title=\"最后验证时间\">(.*?)</td>"
    #ip
    ipText=re.compile(compile_c,re.S)
    ip_text = re.findall(ipText, result)
    #端口
    portText = re.compile(compile_p,re.S)
    port_text= re.findall(portText, result)
    #匿名度
    securityText = re.compile(compile_s, re.S)
    s_text = re.findall(securityText, result)
    #协议
    kindText = re.compile(compile_k, re.S)
    k_text = re.findall(kindText, result)
    #位置
    locText = re.compile(compile_l, re.S)
    l_text = re.findall(locText, result)
    #速度
    vText = re.compile(compile_v, re.S)
    v_text = re.findall(vText, result)
    #更新时间
    tText = re.compile(compile_t, re.S)
    t_text = re.findall(tText, result)
    length=len(ip_text)
    for i in range(0,length):#左闭右开
        print(ip_text[i], port_text[i], s_text[i], l_text[i], k_text[i],v_text[i], t_text[i])
        print(sheetLocation,i)
        proxy_excel.write(sheetLocation,0,ip_text[i])
        proxy_excel.write(sheetLocation,1,port_text[i])
        proxy_excel.write(sheetLocation,2,s_text[i])
        proxy_excel.write(sheetLocation,3,l_text[i])
        proxy_excel.write(sheetLocation,4,k_text[i])
        proxy_excel.write(sheetLocation,5,v_text[i])
        proxy_excel.write(sheetLocation,6,t_text[i])
        sheetLocation+=1#写完一行加1
    return sheetLocation#返回列
    # print(ip_text,port_text,s_text,l_text,k_text,t_text)

def write_excel(worksheet):#创建文件
    proxy_excel = worksheet.add_sheet("proxySheet")
    proxy_excel.write(0, 0, 'ip')
    proxy_excel.write(0, 1, 'port')
    proxy_excel.write(0, 2, '安全性')
    proxy_excel.write(0, 3, '地区')
    proxy_excel.write(0, 4, '协议')
    proxy_excel.write(0, 5, '速度')
    proxy_excel.write(0, 6, '更新时间')
    return proxy_excel
worksheet = xlwt.Workbook(encoding='utf-8')
proxy_excel=write_excel(worksheet)
auto_index()

结果

获取了735条代理ip信息

结束

本文分享到这结束，如有错误或者不足之处欢迎指出，感谢大家的阅读！