熟练掌握爬虫技术

一、Crawler、Requests反爬破解

1. HTTP协议与WEB开发

text 复制代码
1. 什么是请求头请求体,响应头响应体
2. URL地址包括什么
3. get请求和post请求到底是什么
4. Content-Type是什么

1.1 简介

HTTP协议是Hyper Text Transfer Protocol(超文本传输协议)的缩写,是用于万维网(WWW:World Wide Web )服务器与本地浏览器之间传输超文本的传送协议。HTTP是一个属于应用层的面向对象的协议,由于其简捷、快速的方式,适用于分布式超媒体信息系统。它于1990年提出,经过几年的使用与发展,得到不断地完善和扩展。HTTP协议工作于客户端-服务端架构为上。浏览器作为HTTP客户端通过URL向HTTP服务端即WEB服务器发送所有请求。Web服务器根据接收到的请求后,向客户端发送响应信息。

1.2 请求协议与响应协议

URL:

text 复制代码
1、URL:协议://IP:端口/路径/.../.../.../...?查询参数
https://www.lagou.com/wn/jobs?labelWords=&fromSearch=true&suginput=&kd=python
协议:HTTP
IP:每一台服务器的网络标识:www.lagou.com
端口:进程:默认80
路径:/wn/jobs
查询参数:labelWords=&fromSearch=true&suginput=&kd=python
2、网络三要素:协议、IP、端口
3、状态码:
404:找不到资源
101:进行中
202:请求成功
303:重定向
404:资源不存在、访问限制
505:服务器错误

http协议包含由浏览器发送数据到服务器需要遵循的请求协议与服务器发送数据到浏览器需要遵循的请求协议。用于HTTP协议交互的信被为HTTP报文。请求端(客户端)的HTTP报文 做请求报文,响应端(服务器端)的 做响应报文。HTTP报文本身是由多行数据构成的字文本。

text 复制代码
请求方式: get与post请求

- GET提交的数据会放在URL之后,以?分割URL和传输数据,参数之间以&相连,如EditBook?name=test1&id=123456. POST方法是把提交的数据放在HTTP包的请求体中.
- GET提交的数据大小有限制(因为浏览器对URL的长度有限制),而POST方法提交的数据没有限制

响应状态码:状态码的职 是当客户端向服务器端发送请求时, 返回的请求 结果。借助状态码,用户可以知道服务器端是正常 理了请求,还是出 现了 。状态码如200 OK,以3位数字和原因组成。

2. requests&反爬破解

2.1 User-Agent反爬

python 复制代码
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
}

res = requests.get(
    "https://www.baidu.com/",
    headers=headers
)

# 解析数据
with open("baidu.html", "w") as f:
    f.write(res.text)

2.2 Refer反爬

python 复制代码
# 选电影:喜剧
import requests

headers = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
                  ' Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0',
    "Referer": "https://movie.douban.com/explore",
}

res = requests.get(
    "https://m.douban.com/rexxar/api/v2/movie/recommend?refresh=0&start=0&count=20&"
    "selected_categories=%7B%22%E7%B1%BB%E5%9E%8B%22:%22%E5%96%9C%E5%89%A7%22%7D&uncollect=false&tags=%E5%96%9C%E5%89%A7",
    headers=headers,
)

# print(res.text)
print(res.json().get("count"))

2.3 cookie反爬

python 复制代码
# -*- coding utf-8 -*-
import requests
cookie="xq_a_token=edbee4e5d1e92f98548629214a6e17fe06486a8f; xqat=edbee4e5d1e92f98548629214a6e17fe06486a8f; xq_r_token=1bd9fe2188768570022d1a3f9e12934cdaa1dc53; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTcwODQ3NjMzNiwiY3RtIjoxNzA2MTk1NzQ0NzM1LCJjaWQiOiJkOWQwbjRBWnVwIn0.Dajzah-CDQ8ER2qN9cHnYH_TPjSiYoXzl7Ht1J_CE4TxQRbH8qEzrXe4LcT4KDd815rQOZ6DF4SORJbA1qltAQ-EmD1NiD0YX0FV-Ub-5ok2FDoLcD4_9dS3iNkpIyAQE8DNJZEMBUv4TuLl8tGh7g5l9PpcOlV-_rC5OYXTckDCklU5WNkvPRsSis2nIohnkz4up2STWsB1IowmYgAN3cTXABy5wFmpEY-KUsGYi49UGH5QSYzfAYdbOxVFO5YWOiKrzXV_GIJNRvL2G0N3wQBzMew-fpB0fopKO6BbzzdbKbY2hccxx3p27a_6b7hqED0PoMO34fUKH8z6p5yqvA; cookiesu=851706195765148; u=851706195765148; device_id=11c12c1015a4baf7b0208768b7589c02; Hm_lvt_1db88642e346389874251b5a1eded6e3=1706195767; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1706196050"
headers = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
                  ' Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0',
    "Referer": "https://xueqiu.com/",
    # "Cookie": "xq_a_token=edbee4e5d1e92f98548629214a6e17fe06486a8f; "
    #           "xqat=edbee4e5d1e92f98548629214a6e17fe06486a8f; "
    #           "xq_r_token=1bd9fe2188768570022d1a3f9e12934cdaa1dc53; "
    #           "xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9."
    #           "eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTcwODQ3NjMzNiwiY3RtIjoxNzA2"
    #           "MTk1NzQ0NzM1LCJjaWQiOiJkOWQwbjRBWnVwIn0.Dajzah-CDQ8ER2qN9cHnYH_"
    #           "TPjSiYoXzl7Ht1J_CE4TxQRbH8qEzrXe4LcT4KDd815rQOZ6DF4SORJbA1qltAQ-"
    #           "EmD1NiD0YX0FV-Ub-5ok2FDoLcD4_9dS3iNkpIyAQE8DNJZEMBUv4TuLl8tGh7g5"
    #           "l9PpcOlV-_rC5OYXTckDCklU5WNkvPRsSis2nIohnkz4up2STWsB1IowmYgAN3cTXAB"
    #           "y5wFmpEY-KUsGYi49UGH5QSYzfAYdbOxVFO5YWOiKrzXV_GIJNRvL2G0N3wQBzMew-f"
    #           "pB0fopKO6BbzzdbKbY2hccxx3p27a_6b7hqED0PoMO34fUKH8z6p5yqvA; "
    #           "cookiesu=851706195765148; u=851706195765148; "
    #           "device_id=11c12c1015a4baf7b0208768b7589c02; "
    #           "Hm_lvt_1db88642e346389874251b5a1eded6e3=1706195767; "
    #           "Hm_lpvt_1db88642e346389874251b5a1eded6e3=1706196050",
    "Cookie": cookie
}

res = requests.get(
    "https://stock.xueqiu.com/v5/stock/chart/minute.json?symbol=SZ399001&period=1d",
    headers=headers
)
print(res.text)

3.请求参数

requests里面的两个参数:data、params

3.1 post请求以及请求体参数

data参数

python 复制代码
import requests
while True:
    word = input("请输入翻译单词:")
    url = "https://aidemo.youdao.com/trans"
    my_data = {
        "q": word,
        "from": "Auto",
        "to": "Auto"
    }
    my_headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
    }
    res = requests.post(url, data=my_data, headers=my_headers)
    # print(res.text)
    print(res.json().get("translation")[0])

3.2 get请求以及查询参数

params参数

python 复制代码
# 2.get请求以及查询参数
import requests
headers = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
                  ' Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0',
    "Referer": "https://movie.douban.com/explore",
}
my_params = {
    "refresh": 0,
    "start": 0,
    "count": 20,
    "tags": "悬疑",
}
res = requests.get(
    "https://m.douban.com/rexxar/api/v2/movie/recommend",
    headers=headers,
    params=my_params,
)
# print(res.text)
print(res.json())

4. 爬虫图片和视频

4.1 直接爬取媒体数据流

图片:

python 复制代码
# -*- coding utf-8 -*-
import requests


url = "https://pic.netbian.com/uploads/allimg/231213/233751-17024818714f51.jpg"

res = requests.get(url)
# print(res.content)

#  文件操作
with open("美女.jpg", "wb") as f:
    f.write(res.content)

视频

python 复制代码
# -*- coding utf-8 -*-
import requests

url = "https://apd-vlive.apdcdn.tc.qq.com/om.tc.qq.com/A2cOGJ1ZAYQyB_mkjQd9WD_pAtroyonOY92ENqLuwa9Q/B_JxNyiJmktHRgresXhfyMeiXZqnwHhIz_hST7i-68laByiTwQm8_qdRWZhBbcMHif/svp_50001/szg_1179_50001_0bf2kyaawaaafaal3yaoijqfcvwdbnlaac2a.f632.mp4?sdtfrom=v1010&guid=e765b9e5b625f662&vkey=38DF885CE72372B324B47541285404A230F61C9E12FC69B72EC8A2CF6F6809E00461165C635758EB7E7B49738D9DB608A7C855DB4E7A0B9A082A399875D82022567E1690D97ABE2A3C002ADD06D4AD5EAD4F028688C35E6D73D29DBF2D596F63C6722B78DA1EA3707EB5A7DD2F60781A45B31B693974432F649E523C08D797BA7907BFDB2562BF44E1483A3981FAAC70BEF8BD92611EF365A183621BDE70F55B2224394DB78CD7F5"
res = requests.get(url)

# 解析数据
with open("相声.mp4", "wb") as f:
    f.write(res.content)

4.2 批量爬取数据

python 复制代码
"""1.先爬取整个页面
   2.然后做数据解析找到想要的"""
import re
import os
import requests

headers = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0',
}

res = requests.get(
    "https://pic.netbian.com/4kqiche/",
    headers=headers
)
# print(res.text)

# 数据解析url,例如:正则,xpath,bs4都是页面中解数据
# ret = re.findall(pattern:"", string:"")
img_url_list = re.findall("uploads/allimg/.*?.jpg", res.text)
print(img_url_list)

for img_url in img_url_list:
    res = requests.get("https://pic.netbian.com/"+img_url)
    # print(res.content)

    #  文件操作
    img_name = os.path.basename(img_url)
    with open("./imgs/" + img_name, "wb") as f:
        f.write(res.content)

二、同步、并发以及JS逆向实战

1.同步获取短视频





1.只要播放地址对Json数据解析,先把列表找出:


2.只想要所有的播放地址,通过列表表达式循环遍历这个列表拿到每个对象,再从一个个对象里面找到Video,再从Video里面找到播放地址(play_addr),再从播放地址找到播放列表(url_list),播放列表有重复只要第一个

3.下载

2.并发获取短视频

3.JS逆向实战

3.1 对称加密(AES)

AES是一种对称加密,所谓对称加密就是加密与解密使用的秘钥是一个。key和iv必须一致

常见的对称加密: AES, DES, 3DES. 我们这里讨论AES。
安装:

python 复制代码
pip install pycryptodome

AES 加密最常用的模式就是 CBC 模式和 ECB模式 ,当然还有很多其它模式,他们都属于AES加密。ECB模式和CBC 模式俩者区别就是 ECB 不需要 iv偏移量,而CBC需要。

python 复制代码
"""
长度
    16: *AES-128*   
    24: *AES-192*
    32: *AES-256*
    
MODE 加密模式. 
    常见的ECB, CBC
    ECB:是一种基础的加密方式,密文被分割成分组长度相等的块(不足补齐),然后单独一个个加密,一个个输出组成密文。
    CBC:是一种循环模式,前一个分组的密文和当前分组的明文异或或操作后再加密,这样做的目的是增强破解难度。
"""

CBC加密案例(选择aes-128):先加密,再编码

python 复制代码
from Crypto.Cipher import AES   # Crypto是一个算法库,Cipher有相应的算法,我们用AES
from Crypto.Util.Padding import pad     # 里面有个工具叫填充叫pad
import base64   # 64编码

key = '0123456789abcdef'.encode()  # 秘钥: 因为aes-128模式,所以必须16字节
iv = 'abcdabcdabcdabcd'.encode() # 偏移量:因为aes-128模式,所以必须16字节
text = 'Self-improvement is a lifelong process!'  # 加密内容,因为aes-128模式,所以字节长度必须是16的倍数
# while len(text.encode('utf-8')) % 16 != 0:  # 如果text不足16位的倍数就用空格补足为16位
#     text += '\0'
text = pad(text.encode(), 16)   # pad在这里如果加密不足16位就会block_size:16填充
print("完整text:", text)

aes = AES.new(key, AES.MODE_CBC, iv)  # 创建一个aes对象,传key,iv;中间值是个固定值模式,用AES中CBC模式

en_text = aes.encrypt(text)  # 加密明文encrypt
print("aes加密数据:::", en_text)  # b"_\xf04\x7f/R\xef\xe9\x14#q\xd8A\x12\x8e\xe3\xa5\x93\x96'zOP\xc1\x85{\xad\xc2c\xddn\x86"

en_text = base64.b64encode(en_text).decode()  # 将返回的字节型数据转进行base64编码,防止混淆,歧义
print(en_text)  #  Pwhs4f1/GxersDcWwZa6fxJTS4YfeV3FoOWvcq14jSLdG+clB/H3+kqBnAfwmZ03


CBC解密案例:先解码,再解密

python 复制代码
from Crypto.Cipher import AES
import base64
from Crypto.Util.Padding import unpad

key = '0123456789abcdef'.encode()
iv = 'abcdabcdabcdabcd'.encode()
aes = AES.new(key, AES.MODE_CBC, iv)

text = 'Pwhs4f1/GxersDcWwZa6fxJTS4YfeV3FoOWvcq14jSLdG+clB/H3+kqBnAfwmZ03'.encode()  # 需要解密的文本
ecrypted_base64 = base64.b64decode(text)  # base64解码成字节流
source = aes.decrypt(ecrypted_base64)  # 解密decrypt
print("aes解密数据:::", source.decode())
print("aes解密数据:::", unpad(source, 16).decode())

1.在Python中进行AES加密解密时,所传入的密文、明文、秘钥、iv偏移量、都需要是bytes(字节型)数据。python 在构建aes对象时也只能接受bytes类型数据。

2.当秘钥,iv偏移量,待加密的明文,字节长度不够16字节或者16字节倍数的时候需要进行补全。

3.CBC模式需要重新生成AES对象,为了防止这类错误,无论是什么模式都重新生成AES对象就可以了。

3.2 毛毛租的python逆向

text 复制代码
毛毛租平台:https://www.maomaozu.com/#/build
3.2.1 加密



绝招:加一个断点






找key和iv:

3.2.2 解密






python 复制代码
import json

import requests
from Crypto.Cipher import AES
from Crypto.Util.Padding import pad, unpad
import base64

cookies = {
    'PHPSESSID': '6rhg42ce8egfeulonevjnfj4sv',
    'Hm_lvt_6cd598ca665714ffcd8aca3aafc5e0dc': '1706941076',
    'SECKEY_ABVK': 'aJ99/mmPcgDcnVNO8MQjq74LRk9XDbNZo7uGOCGoln0%3D',
    'Hm_lpvt_6cd598ca665714ffcd8aca3aafc5e0dc': '1706941479',
    'BMAP_SECKEY': 'IN6Q3NYbpjYXemaxNcEdhP7dkIvDfrO09kOcuQx3rurdS546vjNWE-mY8RexJlLiLTvJaySgMcDcsFIr0mbjJKoCPrsissHnmXCxfpEUr4az4OxDtbb-s1bmRsoQs0yz9nVTEtFnE5dWUcYecms3m4YY8bV6rl2Sj6HvoQPViznasWG2OkGUebHlE5loh2dV',
}

headers = {
    'Accept': '*/*',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Content-Type': 'application/json; charset=UTF-8',
    # 'Cookie': 'PHPSESSID=6rhg42ce8egfeulonevjnfj4sv; Hm_lvt_6cd598ca665714ffcd8aca3aafc5e0dc=1706941076; SECKEY_ABVK=aJ99/mmPcgDcnVNO8MQjq74LRk9XDbNZo7uGOCGoln0%3D; Hm_lpvt_6cd598ca665714ffcd8aca3aafc5e0dc=1706941479; BMAP_SECKEY=IN6Q3NYbpjYXemaxNcEdhP7dkIvDfrO09kOcuQx3rurdS546vjNWE-mY8RexJlLiLTvJaySgMcDcsFIr0mbjJKoCPrsissHnmXCxfpEUr4az4OxDtbb-s1bmRsoQs0yz9nVTEtFnE5dWUcYecms3m4YY8bV6rl2Sj6HvoQPViznasWG2OkGUebHlE5loh2dV',
    'Origin': 'https://www.maomaozu.com',
    'Pragma': 'no-cache',
    'Referer': 'https://www.maomaozu.com/',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0',
    'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Microsoft Edge";v="122"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}

# #   批量爬
# for i in range(1, 10):
#     data = {
#         "Type": 0,
#         "expire": 1706974966977,
#         "page": 5,
#     }
#
#     key = '55b3b62613aef1a0'.encode()
#     iv = '55b3b62613aef1a0'.encode()
#     text = json.dumps(data)
#
#     text = pad(text.encode(), 16)
#     print("完整text:", text)
#
#     aes = AES.new(key, AES.MODE_CBC, iv)  # 创建一个aes对象
#
#     en_text = aes.encrypt(text)  # 加密明文
#     print("aes加密数据:::", en_text)
#
#     en_text = base64.b64encode(en_text).decode()
#     print(en_text)
#
#     # data = 'i1gpLEJyKvluv3sQVGr/h6RZxT9vv00IpxineW3h2Y8GGtjqGm2Gl46nX7lTrD7H'   # 加密值
#
#     response = requests.post('https://www.maomaozu.com/index/build.json', cookies=cookies, headers=headers,
#                              data=en_text)
#     print(response.text)

# 数据加密3个加密项
data = {
    "Type": 0,
    "expire": 1706974966977,
    "page": 5,
}

key = '55b3b62613aef1a0'.encode()
iv = '55b3b62613aef1a0'.encode()
text = json.dumps(data)

text = pad(text.encode(), 16)
print("完整text:", text)

aes = AES.new(key, AES.MODE_CBC, iv)  # 创建一个aes对象

en_text = aes.encrypt(text)  # 加密明文
print("aes加密数据:::", en_text)

en_text = base64.b64encode(en_text).decode()
print(en_text)

# data = 'i1gpLEJyKvluv3sQVGr/h6RZxT9vv00IpxineW3h2Y8GGtjqGm2Gl46nX7lTrD7H'   # 加密值

response = requests.post('https://www.maomaozu.com/index/build.json', cookies=cookies, headers=headers, data=en_text)
print(response.text)


# 解密数据逻辑
key = "0a1fea31626b3b55".encode()
iv = "0a1fea31626b3b55".encode()
aes = AES.new(key, AES.MODE_CBC, iv)

ecrypted_base64 = base64.b64decode(response.text.encode())  # base64解码成字节流
source = aes.decrypt(ecrypted_base64)  # 解密
print("aes解密数据:::", source.decode())
print("aes解密数据:::", unpad(source, 16).decode())

三、JS逆向破解X-Bogus值

1.JS逆向实战破解X-Bogus值

X-Bogus:以DFS开头,总长28位



答案是X-Bogus,因为会把负载里面所有的值打包生成X-Boogus

1.1 找X-Bogus加密位置(请求堆栈)





1.1.1 绝招加高级断点(日志断点)

日志断点看有没有X-B值



日志断点加上请求内容还是太多,下面看条件断点

1.1.2 绝招加高级断点(条件断点)




1.1.3 做逆向(js逆向)








2. Python调用JS获取X-Bogus值

安装:

python 复制代码
pip install pyExecJs
python 复制代码
import execjs

with open("douyin.js") as f:
    js_data = f.read()

js_compile =execjs.compile(js_data)
xb_data =js_compile.call("window.xiaoc",)
print(xb_data)


python 复制代码
import requests
import execjs

with open("douying.js") as f:
    js_code = f.read()
js_compile = execjs.compile(js_code)
url = 'https://www.douyin.com/aweme/v1/web/aweme/post/?'
user_id = "MS4wLjABAAAA2WKmM-8lEtk72YjLLI6CFWFZRDtA_WtTUmg-5p7wHqI"
params = f"device_platform=webapp&aid=6383&channel=channel_pc_web&sec_user_id={user_id}&max_cursor=0&locate_query=false&show_live_replay_strategy=1&need_time_list=1&time_list_query=0&whale_cut_token=&cut_version=1&count=18&publish_video_strategy_type=2&pc_client_type=1&version_code=170400&version_name=17.4.0&cookie_enabled=true&screen_width=1536&screen_height=864&browser_language=zh-CN&browser_platform=Win32&browser_name=Edge&browser_version=122.0.0.0&browser_online=true&engine_name=Blink&engine_version=122.0.0.0&os_name=Windows&os_version=10&cpu_core_num=12&device_memory=8&platform=PC&downlink=10&effective_type=4g&round_trip_time=100&webid=7331003658269885952&msToken=i-THFcUZPJzlfcptH7pAamO1QadvQ88RnCYldJseXyIeYmMRC7guwnHnX0z6ENz1dxnyj-1QWQQLjqp9_pHjr8lU-MqWQ9g466pOEyefDAGUGskgcu6wkKoWNzH6"
x_b = js_compile.call("window.yuan", params)
print("xb:", x_b)

new_url = url + params + "&X-Bogus=" + x_b

headers = {
    'authority': 'www.douyin.com',
    'accept': 'application/json, text/plain, */*',
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'cache-control': 'no-cache',
    'cookie': 'ttwid=1%7CRJTTuwiJjYo8a1GXOAc0ysKVOH3AoSWjoA5U6N16pHk%7C1706882315%7C10741635cb69c8b1954456e49e85f1a8b629e3fe8266996d349e0165c31a4c5d; volume_info=%7B%22isUserMute%22%3Afalse%2C%22isMute%22%3Afalse%2C%22volume%22%3A0.7%7D; passport_csrf_token=2315da18ef8f1bd1c7d818260eae36b3; passport_csrf_token_default=2315da18ef8f1bd1c7d818260eae36b3; bd_ticket_guard_client_web_domain=2; ttcid=80c91cb353504f9fa1cbfd551625091119; SEARCH_RESULT_LIST_TYPE=%22single%22; FORCE_LOGIN=%7B%22videoConsumedRemainSeconds%22%3A180%2C%22isForcePopClose%22%3A1%7D; xgplayer_device_id=88169692726; xgplayer_user_id=821848890322; pwa2=%220%7C0%7C3%7C0%22; n_mh=j2Ixqzt56zfZqN7wQu0bsTqZmHMjmKDA8OWvBWdIfNw; passport_auth_status=ca6c3e1b67cbe85fe382b36d87909e47%2C; passport_auth_status_ss=ca6c3e1b67cbe85fe382b36d87909e47%2C; _bd_ticket_crypt_doamin=2; __security_server_data_status=1; store-region=cn-gs; store-region-src=uid; s_v_web_id=verify_ls4q16wq_gd9vvmzX_APhv_4s5s_BQWF_Kx0IHQHrC3eV; d_ticket=cc21dc4dd81ccb045d9f213894826fe819301; publish_badge_show_info=%221%2C0%2C0%2C1706883171006%22; sso_uid_tt=5ce4bf94843b5ceab602e1b8265103f2; sso_uid_tt_ss=5ce4bf94843b5ceab602e1b8265103f2; toutiao_sso_user=7c738c8c4feec8e0cd87a55f80fef383; toutiao_sso_user_ss=7c738c8c4feec8e0cd87a55f80fef383; uid_tt=2ebe791b464abe496f63029d978005c6; uid_tt_ss=2ebe791b464abe496f63029d978005c6; sid_tt=fde6df3996602ac63ddb768c5ae33686; sessionid=fde6df3996602ac63ddb768c5ae33686; sessionid_ss=fde6df3996602ac63ddb768c5ae33686; LOGIN_STATUS=1; _bd_ticket_crypt_cookie=3db9c81dd0322f1ee81e8cdfcd276366; download_guide=%223%2F20240203%2F0%22; stream_player_status_params=%22%7B%5C%22is_auto_play%5C%22%3A0%2C%5C%22is_full_screen%5C%22%3A0%2C%5C%22is_full_webscreen%5C%22%3A0%2C%5C%22is_mute%5C%22%3A0%2C%5C%22is_speed%5C%22%3A1%2C%5C%22is_visible%5C%22%3A0%7D%22; passport_assist_user=CkFNOzqmwHfXWRP4fyG05ko4_tavdpx5sOdpxaxaZb3KerJKAvnQ_EYX2N_zqxeqcAxJbfoIF_jeWbGmfp6nO2aTsBpKCjxUSrMbw60tg9cAtd0mQvsBdPCiBm2_h2p-EF-YIhSA10b_92HZgF0oetw9H9Av8ThB2baI3o-zq5ptELsQ377IDRiJr9ZUIAEiAQMx5xgb; sid_ucp_sso_v1=1.0.0-KDYwMjEyMmJmYzk2NzlmZDAyZWQ5NzBkNzFlYzllMjQ1ZDJlMWY5NDIKIQjN9LDFnfTrAhDaxv6tBhjvMSAMMPmV9vgFOAVA-wdIAxoCbGYiIDdjNzM4YzhjNGZlZWM4ZTBjZDg3YTU1ZjgwZmVmMzgz; ssid_ucp_sso_v1=1.0.0-KDYwMjEyMmJmYzk2NzlmZDAyZWQ5NzBkNzFlYzllMjQ1ZDJlMWY5NDIKIQjN9LDFnfTrAhDaxv6tBhjvMSAMMPmV9vgFOAVA-wdIAxoCbGYiIDdjNzM4YzhjNGZlZWM4ZTBjZDg3YTU1ZjgwZmVmMzgz; sid_guard=fde6df3996602ac63ddb768c5ae33686%7C1707058010%7C5184000%7CThu%2C+04-Apr-2024+14%3A46%3A50+GMT; sid_ucp_v1=1.0.0-KDgyY2IyM2Y3NzRhODEzZjY4YTg0MjhhZjQ2Mjg5YmYwN2U3NTgxMzMKGwjN9LDFnfTrAhDaxv6tBhjvMSAMOAVA-wdIBBoCbHEiIGZkZTZkZjM5OTY2MDJhYzYzZGRiNzY4YzVhZTMzNjg2; ssid_ucp_v1=1.0.0-KDgyY2IyM2Y3NzRhODEzZjY4YTg0MjhhZjQ2Mjg5YmYwN2U3NTgxMzMKGwjN9LDFnfTrAhDaxv6tBhjvMSAMOAVA-wdIBBoCbHEiIGZkZTZkZjM5OTY2MDJhYzYzZGRiNzY4YzVhZTMzNjg2; odin_tt=15ed412c92f86046a1f66b145483d18abe524ff517a32e287576fbbfee99abf87fb4b8ab259662e67b01f22aa06d185e; __ac_nonce=065c109e900f165569cd2; __ac_signature=_02B4Z6wo00f01ziblOAAAIDAhWWIp6lmGeM4u5BAAKvyHpOmyrU6K4rSSVRojioj55mhOG-mKTbPjs-6kQzaUaBrWFogi-SAHv3zDTMm8UbiOwX1XPxgDOTr7qpvpa12HandEuX1fzU5t3DT1d; dy_swidth=1536; dy_sheight=864; csrf_session_id=17273b27de04f7773592476475360114; strategyABtestKey=%221707149804.686%22; msToken=p6X3LJVOnOsZfhXFO3WGU-vQNTBIcLa7dg4sB05ADylBkkt9qrWQevh6eaZZU652-TDp_8f80ZNGFNlUd9ap-dzG_C74z_v6u8VA1Fg156ZmACGi1fQ=; home_can_add_dy_2_desktop=%220%22; IsDouyinActive=true; stream_recommend_feed_params=%22%7B%5C%22cookie_enabled%5C%22%3Atrue%2C%5C%22screen_width%5C%22%3A1536%2C%5C%22screen_height%5C%22%3A864%2C%5C%22browser_online%5C%22%3Atrue%2C%5C%22cpu_core_num%5C%22%3A12%2C%5C%22device_memory%5C%22%3A8%2C%5C%22downlink%5C%22%3A10%2C%5C%22effective_type%5C%22%3A%5C%224g%5C%22%2C%5C%22round_trip_time%5C%22%3A150%7D%22; msToken=i-THFcUZPJzlfcptH7pAamO1QadvQ88RnCYldJseXyIeYmMRC7guwnHnX0z6ENz1dxnyj-1QWQQLjqp9_pHjr8lU-MqWQ9g466pOEyefDAGUGskgcu6wkKoWNzH6; FOLLOW_NUMBER_YELLOW_POINT_INFO=%22MS4wLjABAAAAaGmtHScBtHcQitX8N9xUsNBdYVG4USCnwCrcjubaRP_o8UgL_J7Gmki9xuE6bbqL%2F1707235200000%2F0%2F0%2F1707152072578%22; bd_ticket_guard_client_data=eyJiZC10aWNrZXQtZ3VhcmQtdmVyc2lvbiI6MiwiYmQtdGlja2V0LWd1YXJkLWl0ZXJhdGlvbi12ZXJzaW9uIjoxLCJiZC10aWNrZXQtZ3VhcmQtcmVlLXB1YmxpYy1rZXkiOiJCSEd3amJDZWRrRGQvRmxIYjJJU3JuVVFERDNQakt3ZTdwaDZjWWlOR3VBTy9hUWlPbittMVpZQUNpWmJzRFJMWGxOWmp3ak04c0lURElRbVludEtqNlU9IiwiYmQtdGlja2V0LWd1YXJkLXdlYi12ZXJzaW9uIjoxfQ%3D%3D; tt_scid=B6.qvhXzjKAHGQyeSuXGsra-IPcCo.sMCBwCjy7OHMNlfgpsijh9lnDDeIxlH2QXae71; passport_fe_beating_status=true',
    'pragma': 'no-cache',
    'referer': 'https://www.douyin.com/user/MS4wLjABAAAA2WKmM-8lEtk72YjLLI6CFWFZRDtA_WtTUmg-5p7wHqI',
    'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Microsoft Edge";v="122"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0',
}

response = requests.get(
    new_url,
    headers=headers)

print(response.text)

3. 下载视频


完整代码:
JS逆向crawler douyinshipin

python 复制代码
import requests
import execjs
import threading
with open("douying.js") as f:
    js_code = f.read()
js_compile = execjs.compile(js_code)
url = 'https://www.douyin.com/aweme/v1/web/aweme/post/?'
user_id = "MS4wLjABAAAA2WKmM-8lEtk72YjLLI6CFWFZRDtA_WtTUmg-5p7wHqI"
params = f"device_platform=webapp&aid=6383&channel=channel_pc_web&sec_user_id={user_id}&max_cursor=0&locate_query=false&show_live_replay_strategy=1&need_time_list=1&time_list_query=0&whale_cut_token=&cut_version=1&count=18&publish_video_strategy_type=2&pc_client_type=1&version_code=170400&version_name=17.4.0&cookie_enabled=true&screen_width=1536&screen_height=864&browser_language=zh-CN&browser_platform=Win32&browser_name=Edge&browser_version=122.0.0.0&browser_online=true&engine_name=Blink&engine_version=122.0.0.0&os_name=Windows&os_version=10&cpu_core_num=12&device_memory=8&platform=PC&downlink=10&effective_type=4g&round_trip_time=100&webid=7331003658269885952&msToken=i-THFcUZPJzlfcptH7pAamO1QadvQ88RnCYldJseXyIeYmMRC7guwnHnX0z6ENz1dxnyj-1QWQQLjqp9_pHjr8lU-MqWQ9g466pOEyefDAGUGskgcu6wkKoWNzH6"
x_b = js_compile.call("window.yuan", params)
print("xb:", x_b)

new_url = url + params + "&X-Bogus=" + x_b

headers = {
    'authority': 'www.douyin.com',
    'accept': 'application/json, text/plain, */*',
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'cache-control': 'no-cache',
    'cookie': 'ttwid=1%7CRJTTuwiJjYo8a1GXOAc0ysKVOH3AoSWjoA5U6N16pHk%7C1706882315%7C10741635cb69c8b1954456e49e85f1a8b629e3fe8266996d349e0165c31a4c5d; volume_info=%7B%22isUserMute%22%3Afalse%2C%22isMute%22%3Afalse%2C%22volume%22%3A0.7%7D; passport_csrf_token=2315da18ef8f1bd1c7d818260eae36b3; passport_csrf_token_default=2315da18ef8f1bd1c7d818260eae36b3; bd_ticket_guard_client_web_domain=2; ttcid=80c91cb353504f9fa1cbfd551625091119; SEARCH_RESULT_LIST_TYPE=%22single%22; FORCE_LOGIN=%7B%22videoConsumedRemainSeconds%22%3A180%2C%22isForcePopClose%22%3A1%7D; xgplayer_device_id=88169692726; xgplayer_user_id=821848890322; pwa2=%220%7C0%7C3%7C0%22; n_mh=j2Ixqzt56zfZqN7wQu0bsTqZmHMjmKDA8OWvBWdIfNw; passport_auth_status=ca6c3e1b67cbe85fe382b36d87909e47%2C; passport_auth_status_ss=ca6c3e1b67cbe85fe382b36d87909e47%2C; _bd_ticket_crypt_doamin=2; __security_server_data_status=1; store-region=cn-gs; store-region-src=uid; s_v_web_id=verify_ls4q16wq_gd9vvmzX_APhv_4s5s_BQWF_Kx0IHQHrC3eV; d_ticket=cc21dc4dd81ccb045d9f213894826fe819301; publish_badge_show_info=%221%2C0%2C0%2C1706883171006%22; sso_uid_tt=5ce4bf94843b5ceab602e1b8265103f2; sso_uid_tt_ss=5ce4bf94843b5ceab602e1b8265103f2; toutiao_sso_user=7c738c8c4feec8e0cd87a55f80fef383; toutiao_sso_user_ss=7c738c8c4feec8e0cd87a55f80fef383; uid_tt=2ebe791b464abe496f63029d978005c6; uid_tt_ss=2ebe791b464abe496f63029d978005c6; sid_tt=fde6df3996602ac63ddb768c5ae33686; sessionid=fde6df3996602ac63ddb768c5ae33686; sessionid_ss=fde6df3996602ac63ddb768c5ae33686; LOGIN_STATUS=1; _bd_ticket_crypt_cookie=3db9c81dd0322f1ee81e8cdfcd276366; download_guide=%223%2F20240203%2F0%22; stream_player_status_params=%22%7B%5C%22is_auto_play%5C%22%3A0%2C%5C%22is_full_screen%5C%22%3A0%2C%5C%22is_full_webscreen%5C%22%3A0%2C%5C%22is_mute%5C%22%3A0%2C%5C%22is_speed%5C%22%3A1%2C%5C%22is_visible%5C%22%3A0%7D%22; passport_assist_user=CkFNOzqmwHfXWRP4fyG05ko4_tavdpx5sOdpxaxaZb3KerJKAvnQ_EYX2N_zqxeqcAxJbfoIF_jeWbGmfp6nO2aTsBpKCjxUSrMbw60tg9cAtd0mQvsBdPCiBm2_h2p-EF-YIhSA10b_92HZgF0oetw9H9Av8ThB2baI3o-zq5ptELsQ377IDRiJr9ZUIAEiAQMx5xgb; sid_ucp_sso_v1=1.0.0-KDYwMjEyMmJmYzk2NzlmZDAyZWQ5NzBkNzFlYzllMjQ1ZDJlMWY5NDIKIQjN9LDFnfTrAhDaxv6tBhjvMSAMMPmV9vgFOAVA-wdIAxoCbGYiIDdjNzM4YzhjNGZlZWM4ZTBjZDg3YTU1ZjgwZmVmMzgz; ssid_ucp_sso_v1=1.0.0-KDYwMjEyMmJmYzk2NzlmZDAyZWQ5NzBkNzFlYzllMjQ1ZDJlMWY5NDIKIQjN9LDFnfTrAhDaxv6tBhjvMSAMMPmV9vgFOAVA-wdIAxoCbGYiIDdjNzM4YzhjNGZlZWM4ZTBjZDg3YTU1ZjgwZmVmMzgz; sid_guard=fde6df3996602ac63ddb768c5ae33686%7C1707058010%7C5184000%7CThu%2C+04-Apr-2024+14%3A46%3A50+GMT; sid_ucp_v1=1.0.0-KDgyY2IyM2Y3NzRhODEzZjY4YTg0MjhhZjQ2Mjg5YmYwN2U3NTgxMzMKGwjN9LDFnfTrAhDaxv6tBhjvMSAMOAVA-wdIBBoCbHEiIGZkZTZkZjM5OTY2MDJhYzYzZGRiNzY4YzVhZTMzNjg2; ssid_ucp_v1=1.0.0-KDgyY2IyM2Y3NzRhODEzZjY4YTg0MjhhZjQ2Mjg5YmYwN2U3NTgxMzMKGwjN9LDFnfTrAhDaxv6tBhjvMSAMOAVA-wdIBBoCbHEiIGZkZTZkZjM5OTY2MDJhYzYzZGRiNzY4YzVhZTMzNjg2; odin_tt=15ed412c92f86046a1f66b145483d18abe524ff517a32e287576fbbfee99abf87fb4b8ab259662e67b01f22aa06d185e; __ac_nonce=065c109e900f165569cd2; __ac_signature=_02B4Z6wo00f01ziblOAAAIDAhWWIp6lmGeM4u5BAAKvyHpOmyrU6K4rSSVRojioj55mhOG-mKTbPjs-6kQzaUaBrWFogi-SAHv3zDTMm8UbiOwX1XPxgDOTr7qpvpa12HandEuX1fzU5t3DT1d; dy_swidth=1536; dy_sheight=864; csrf_session_id=17273b27de04f7773592476475360114; strategyABtestKey=%221707149804.686%22; msToken=p6X3LJVOnOsZfhXFO3WGU-vQNTBIcLa7dg4sB05ADylBkkt9qrWQevh6eaZZU652-TDp_8f80ZNGFNlUd9ap-dzG_C74z_v6u8VA1Fg156ZmACGi1fQ=; home_can_add_dy_2_desktop=%220%22; IsDouyinActive=true; stream_recommend_feed_params=%22%7B%5C%22cookie_enabled%5C%22%3Atrue%2C%5C%22screen_width%5C%22%3A1536%2C%5C%22screen_height%5C%22%3A864%2C%5C%22browser_online%5C%22%3Atrue%2C%5C%22cpu_core_num%5C%22%3A12%2C%5C%22device_memory%5C%22%3A8%2C%5C%22downlink%5C%22%3A10%2C%5C%22effective_type%5C%22%3A%5C%224g%5C%22%2C%5C%22round_trip_time%5C%22%3A150%7D%22; msToken=i-THFcUZPJzlfcptH7pAamO1QadvQ88RnCYldJseXyIeYmMRC7guwnHnX0z6ENz1dxnyj-1QWQQLjqp9_pHjr8lU-MqWQ9g466pOEyefDAGUGskgcu6wkKoWNzH6; FOLLOW_NUMBER_YELLOW_POINT_INFO=%22MS4wLjABAAAAaGmtHScBtHcQitX8N9xUsNBdYVG4USCnwCrcjubaRP_o8UgL_J7Gmki9xuE6bbqL%2F1707235200000%2F0%2F0%2F1707152072578%22; bd_ticket_guard_client_data=eyJiZC10aWNrZXQtZ3VhcmQtdmVyc2lvbiI6MiwiYmQtdGlja2V0LWd1YXJkLWl0ZXJhdGlvbi12ZXJzaW9uIjoxLCJiZC10aWNrZXQtZ3VhcmQtcmVlLXB1YmxpYy1rZXkiOiJCSEd3amJDZWRrRGQvRmxIYjJJU3JuVVFERDNQakt3ZTdwaDZjWWlOR3VBTy9hUWlPbittMVpZQUNpWmJzRFJMWGxOWmp3ak04c0lURElRbVludEtqNlU9IiwiYmQtdGlja2V0LWd1YXJkLXdlYi12ZXJzaW9uIjoxfQ%3D%3D; tt_scid=B6.qvhXzjKAHGQyeSuXGsra-IPcCo.sMCBwCjy7OHMNlfgpsijh9lnDDeIxlH2QXae71; passport_fe_beating_status=true',
    'pragma': 'no-cache',
    'referer': 'https://www.douyin.com/user/MS4wLjABAAAA2WKmM-8lEtk72YjLLI6CFWFZRDtA_WtTUmg-5p7wHqI',
    'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Microsoft Edge";v="122"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0',
}

response = requests.get(
    new_url,
    headers=headers)

# print(response.text)

aweme_list = response.json().get("aweme_list")

url_list = [aweme.get("video").get("play_addr").get("url_list")[0] for aweme in aweme_list] # 加起来形成一个完整地址,取video里面的play_addr里面url_list的第一个地址(播放列表有重复的取第一个)(查看网页预览一步一步点)
# print(url_list)


# 下载短视频

def get_one_video(url, c):
    res = requests.get(url)
    # 文件写操作
    with open(f"./videos/{c}.mp4", "wb") as f:  # w:写文本 wb写字节
        f.write(res.content)
    print(f"{c}.mp4下载成功!")


c = 1
t_list = []
for url in url_list:
    t = threading.Thread(target=get_one_video, args=(url, c))
    t.start()
    t_list.append(t)
    c += 1

for t in t_list:
    t.join()    # 遍历t_list里面所有的线程对象,等待所有的都执行完join才通过

四、图片验证码、打码平台、JS逆向将加密拿到本地

1.图片验证码

python 复制代码
import requests

res = requests.get('https://www.gushiwen.cn/RandCode.ashx')

with open("code.png", "wb") as f:
    f.write(res.content)

2.打码平台

网址:http://www.ttshitu.com/,找到开发文档点击Python,没有钱了要用我的账号密码,充钱

python 复制代码
import base64
import json
import requests


# 一、图片文字类型(默认 3 数英混合):
# 1 : 纯数字
# 1001:纯数字2
# 2 : 纯英文
# 1002:纯英文2
# 3 : 数英混合
# 1003:数英混合2
#  4 : 闪动GIF
# 7 : 无感学习(独家)
# 11 : 计算题
# 1005:  快速计算题
# 16 : 汉字
# 32 : 通用文字识别(证件、单据)
# 66:  问答题
# 49 :recaptcha图片识别
# 二、图片旋转角度类型:
# 29 :  旋转类型
#
# 三、图片坐标点选类型:
# 19 :  1个坐标
# 20 :  3个坐标
# 21 :  3 ~ 5个坐标
# 22 :  5 ~ 8个坐标
# 27 :  1 ~ 4个坐标
# 48 : 轨迹类型
#
# 四、缺口识别
# 18 : 缺口识别(需要2张图 一张目标图一张缺口图)
# 33 : 单缺口识别(返回X轴坐标 只需要1张图)
# 五、拼图识别
# 53:拼图识别
def base64_api(uname, pwd, img, typeid):
    with open(img, 'rb') as f:
        base64_data = base64.b64encode(f.read())
        b64 = base64_data.decode()
    data = {"username": uname, "password": pwd, "typeid": typeid, "image": b64}
    result = json.loads(requests.post("http://api.ttshitu.com/predict", json=data).text)
    if result['success']:
        return result["data"]["result"]
    else:
        # !!!!!!!注意:返回 人工不足等 错误情况 请加逻辑处理防止脚本卡死 继续重新 识别
        return result["message"]
    return ""


if __name__ == "__main__":
    img_path = "code.png"
    result = base64_api(uname='stara', pwd='050611zZ', img=img_path, typeid=3)
    print(result)

3. JS逆向案例

一品威客网站:https://www.epwk.com/login.html

抓包分析


复制cURL,写爬虫代码

python 复制代码
import requests

cookies = {
    'Hm_lvt_387b8f4fdb89d4ea233922bdc6466394': '1709892079',
    'PHPSESSID': 'a3c14f78e68b82a9c91d890fcc45b15d313e35f4',
    'time_diff': '1',
    'XDEBUG_SESSION': 'XDEBUG_ECLIPSE',
    'adbanner_city': '%E5%85%B0%E5%B7%9E%E5%B8%82',
    'Hm_lpvt_387b8f4fdb89d4ea233922bdc6466394': '1709894677',
    'login_fail_need_graphics': '0',
}

headers = {
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'Access-Token': '',
    'App-Id': '4ac490420ac63db4',
    'App-Ver': '',
    'CHOST': 'www.epwk.com',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded',
    # 'Cookie': 'Hm_lvt_387b8f4fdb89d4ea233922bdc6466394=1709892079; PHPSESSID=a3c14f78e68b82a9c91d890fcc45b15d313e35f4; time_diff=1; XDEBUG_SESSION=XDEBUG_ECLIPSE; adbanner_city=%E5%85%B0%E5%B7%9E%E5%B8%82; Hm_lpvt_387b8f4fdb89d4ea233922bdc6466394=1709894677; login_fail_need_graphics=0',
    'Device-Os': 'web',
    'Device-Ver': '',
    'Imei': '',
    'NonceStr': '1709894743bnhju',
    'Origin': 'https://www.epwk.com',
    'Os-Ver': '',
    'Pragma': 'no-cache',
    'Referer': 'https://www.epwk.com/login.html',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'Signature': 'lvSv16nUQ71tdqkhzS/g7l/HiQXeib5mZIAyFnBLrLhhiZkoGTiV8OXfe6aqMvgY',
    'Timestemp': '1709894743',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0',
    'X-REQUEST-ID': 'ed47d61b77278235b96b8f6c92a54810',
    'sec-ch-ua': '"Chromium";v="124", "Microsoft Edge";v="124", "Not-A.Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}

data = {
    'username': 'eeeeeeeeeeeeeeeeeeee',
    'password': 'eeeeeeeeeeeeeeeeeeeeeeee',
    'code': '7y9a',
    'hdn_refer': '',
}

response = requests.post('https://www.epwk.com/api/epwk/v1/user/login', cookies=cookies, headers=headers, data=data)
print(response.text)

3.1 查看Signature值到底是怎么加密生成的(关键字查询)

Signature应该是有一个函数固定生成加密值的,结构应该是Signature=函数()

加断点




3.2 JS逆向实现












JS源代码:

js 复制代码
const cryptojs = require("crypto-js")

l = {
    key: cryptojs.enc.Utf8.parse("fX@VyCQVvpdj8RCa"),
    iv: cryptojs.enc.Utf8.parse(function (t) {
        for (var e = "", i = 0; i < t.length - 1; i += 2) {
            var n = parseInt(t[i] + "" + t[i + 1], 16);
            e += String.fromCharCode(n)
        }
        return e
    }("00000000000000000000000000000000"))
}
    , v = function (data) {
    return function (data) {
        return cryptojs.AES.encrypt(data, l.key, {
            iv: l.iv,
            mode: cryptojs.mode.CBC,
            padding: cryptojs.pad.Pkcs7
        }).toString()
    }(data)
}
    , d = function (data) {
    return cryptojs.MD5(data).toString()
}
    , f = function (t) {
    var e = "";
    return Object.keys(t).sort().forEach((function (n) {
            e += n + ("object" === typeof (t[n]) ? JSON.stringify(t[n], (function (t, e) {
                    return "number" == typeof e && (e = String(e)),
                        e
                }
            )).replace(/\//g, "\\/") : t[n])
        }
    )),
        e
}

h = function (t) {
    var data = arguments.length > 1 && void 0 !== arguments[1] ? arguments[1] : {}
        , e = arguments.length > 2 && void 0 !== arguments[2] ? arguments[2] : "a75846eb4ac490420ac63db46d2a03bf"
        , n = e + f(data) + f(t) + e;
    return n = d(n),
        n = v(n)
}

U = {
    "App-Ver": "",
    "Os-Ver": "",
    "Device-Ver": "",
    "Imei": "",
    "Access-Token": "",
    "Timestemp": 1709898581,
    "NonceStr": "1709898581goio1",
    "App-Id": "4ac490420ac63db4",
    "Device-Os": "web"
}

M = {
    "username": "eeeeeeeeeeeeeeeeeeee",
    "password": "eeeeeeeeeeee",
    "code": "n5a7",
    "hdn_refer": ""
}

C = 'a75846eb4ac490420ac63db46d2a03bf'

console.log(h(U, M, C))
python 复制代码
mport requests
import execjs
import json


def base64_api(base64_img, typeid=3):
    data = {"username": "yuan0316", "password": "yuan0316", "typeid": typeid, "image": base64_img}
    result = json.loads(requests.post("http://api.ttshitu.com/predict", json=data).text)
    if result['success']:
        return result["data"]["result"]
    else:
        # !!!!!!!注意:返回 人工不足等 错误情况 请加逻辑处理防止脚本卡死 继续重新 识别
        return result["message"]
    return ""


with open("yipinweike.js") as f:
    js_code = f.read()

js_compile = execjs.compile(js_code)

data = {

}
headers = js_compile.call("fn", data)
print(headers)

url = "https://www.epwk.com/api/epwk/v1/captcha/show?channel=common_channel&base64=1"
res = requests.get(url, headers=headers)
base64_img = res.json().get("data").get("base64")
code = base64_api(base64_img)
print(code)

data = {
    "username": "121232312",
    "password": "1231232131",
    "code": code,
    "hdn_refer": "https://www.epwk.com/"
}
headers = js_compile.call("fn", data)
print(headers)

url = "https://www.epwk.com/api/epwk/v1/user/login"

res = requests.post(url, headers=headers, data=data)
print(res.text)

五、前端JS相关环境编译,node.js和pyexecjs抓取

1. 前端JS相关

  • 三元运算
python 复制代码
v1 = 条件 ? 值A : 值B;		# 如果条件成立v1=值A,不成立v1等于值B

res = 1 === 1 ? 99 : 88 			# res=99
  • 特殊的逻辑运算
python 复制代码
v1 = 1===1 || 2===2			# Ture
v2 = 9 || 14   				# 9
v3 = 0 || 15   				# 15
v3 = 0 || 15 || "zhangfei"		# 15
  • 赋值和比较
python 复制代码
v1 = 11 === (n=123)		# Flase
  • 案例:
python 复制代码
v1 = 1 > ( n = 2) || 1 === 1 ? 9 :8		

# 分析
n = 2
v1 = 9
python 复制代码
var o = (null === (n = window.byted_acrawler) || void 0 === n ? void 0 : null === (a = n.sign) || void 0 === a ? void 0 : a.call(n, i)) || "";

void 0 -> undifined
# 分析(window.byted_acrawler不为空、window.byted_acrawler.sign不为空)
var o = (null === (n = window.byted_acrawler) || void 0 === n ? void 0 : null === (a = n.sign) || void 0 === a ? void 0 : a.call(n, i)) || "";

var o = window.byted_acrawler.sign.call(n,i) || ""

var o = window.byted_acrawler.sign.call(n,i)
  • 执行函数
javascript 复制代码
function sign(v1){
    // this在函数内部
    console.log(v1);
}
// 执行,函数内部this=window全局对象
sign(123)			# 123

// 执行函数内部会把第一个参数赋值给 this=123
sign.call(123,456)			# 456
javascript 复制代码
// n就会传递给call函数中this
// i当做参数传递
var o = window.byted_acrawler.sign.call(n,i)
var o = window.byted_acrawler.sign(i)
  • 扩展
javascript 复制代码
# 之前的javascript不支持面向对象,通过将函数去伪造
function Person(name,age){
    this.name=name;
    this.age = age
}

obj = new Person("张飞",123)
  • 函数的参数
javascript 复制代码
function sign(){
    console.log(arguments)
}

sign()        
sign(11,22,33)
sign(11,22,44,55)
虽然没定义参数,但是可以传入参数
  • 合并对象补充JS环境
javascript 复制代码
v1 = { k1: 123 }
v2 = { k2:99, k3:888}

Objects.assign(v1,v2)	# 将第二个字典全部更新到V1;和python字典update很像

console.log(v1) 			# {k1: 123, k2:99, k3:888}

2.编译js代码

2.1 node.js编译代码

  • v1.js
javascript 复制代码
function func(arg) {
    return arg + 'i666';
}
let data = func("老铁");
console.log(data)
  • node编译执行

  • python执行执行本地命令:node v1.js

python 复制代码
import os
import subprocess

# 根据自己的操作系统去修改(相当于python的sys.path,加载安装的模块)
os.environ["NODE_PATH"] = "/usr/local/lib/node_modules/"  

signature = subprocess.getoutput('node v1.js')

2.2 pyexecjs编译代码

准备环境:

  • node.js
  • pyexecjs模块
python 复制代码
pip install pyexecjs

例如:

  • v2.js
javascript 复制代码
function func(arg) {
    return arg + '666';
}
  • 执行js代码
javascript 复制代码
import execjs
import os

os.environ["NODE_PATH"] = "/usr/local/lib/node_modules/"
with open('v2.js', mode='r', encoding='utf-8') as f:
    js = f.read()

JS = execjs.compile(js)

sign = JS.call("func", "微信")
print(sign) # 微信666

node.js:电脑上安装上node.js之后(编译器,相当于装CPython解释器), 自动安装npm(第三方包管理器,相当于pip)

2.3 浏览器环境

有些JS的代码你从别的地拿过来执行的时候不成功,因为需要模拟浏览器环境

环境准备

  • node.js
  • jsdom(通过后端node+js代码实现伪造浏览器环境)
javascript 复制代码
npm install node-gyp@latest sudo npm explore -g npm -- npm i node-gyp@latest
npm install jsdom -g   # -g全局安装

注意:上述安装成功后已可以模拟浏览器环境,由于今天的头条他的内容。

javascript 复制代码
npm install canvas -g

方式一:v10.js

javascript 复制代码
const jsdom = require("jsdom");
const {JSDOM} = jsdom;

const resourceLoader = new jsdom.ResourceLoader({
    userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36",
});

const html = `<!DOCTYPE html><p>Hello world</p>`;


const dom = new JSDOM(html, {
    url: "https://www.toutiao.com",
    referrer: "https://example.com/",
    contentType: "text/html",
    resources: resourceLoader,
});

console.log(dom.window.location)
console.log(dom.window.navigator.userAgent)
console.log(dom.window.document.referrer)
python 复制代码
import os
import subprocess

# 根据自己的操作系统去修改(相当于python的sys.path,加载安装的模块)
os.environ["NODE_PATH"] = "/usr/local/lib/node_modules/"  

res = subprocess.getoutput('node v10.js')

方式二:无法补充环境时

javascript 复制代码
const jsdom = require("jsdom");
const {JSDOM} = jsdom;

const resourceLoader = new jsdom.ResourceLoader({
    userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36"
});

const html = `<!DOCTYPE html><p>Hello world</p>`;
const dom = new JSDOM(html, {
    url: "https://www.toutiao.com",
    referrer: "https://example.com/",
    contentType: "text/html",
    resources: resourceLoader,
});

/*
console.log(dom.window.location)
console.log(dom.window.navigator.userAgent)
console.log(dom.window.document.referrer)
*/

window = global;

const params = {
    location: {
        hash: "",
        host: "www.toutiao.com",
        hostname: "www.toutiao.com",
        href: "https://www.toutiao.com",
        origin: "https://www.toutiao.com",
        pathname: "/",
        port: "",
        protocol: "https:",
        search: "",
    },
    navigator: {
        appCodeName: "Mozilla",
        appName: "Netscape",
        appVersion: "5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
        cookieEnabled: true,
        deviceMemory: 8,
        doNotTrack: null,
        hardwareConcurrency: 4,
        language: "zh-CN",
        languages: ["zh-CN", "zh"],
        maxTouchPoints: 0,
        onLine: true,
        platform: "MacIntel",
        product: "Gecko",
        productSub: "20030107",
        userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
        vendor: "Google Inc.",
        vendorSub: "",
        webdriver: false
    }
};

Object.assign(global,params) 	# location、navigator设置成了全局变量


# 在下面如果你使用location.href、navigator.appCodeName
我们在上面代码加入window = global这样window.location.href、window.appCodeName也能够获取到

注意:在nodejs中默认代码中会有一个global的关键字(全局变量)。

javascript 复制代码
v1 = 123;  # 写了个全局变量,相当于global赋了个值
console.log(global);
javascript 复制代码
global.v1 = 123
global.v2 = 123
global.navigator = {
	...
}
console.log(v1,v2);

navigator.userAgent

3.头条(node.js实现)

3.1 分析请求


直接发送获取到结果:

python 复制代码
import requests
# 这后面的就是我们需要注意的签名_02B4Z6wo009010IJgRwAAIDDtGCIOlEVa8tCLYWAALV5CV7lvAp2MWxOhC9EGgecK8orbBZu.elV57IoxY70Cqa8TI2XW0z.U3dOc84bBFDE83277HsB4oykmNYgkYd-9NbV8enDst.RVEBu76
res = requests.get(
    url="https://www.toutiao.com/api/pc/list/feed?offset=0&channel_id=94349549395&max_behot_time=0&category=pc_profile_channel&disable_raw_data=true&aid=24&app_name=toutiao_web&_signature=_02B4Z6wo009010IJgRwAAIDDtGCIOlEVa8tCLYWAALV5CV7lvAp2MWxOhC9EGgecK8orbBZu.elV57IoxY70Cqa8TI2XW0z.U3dOc84bBFDE83277HsB4oykmNYgkYd-9NbV8enDst.RVEBu76",
    headers={
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0"
    }
)

print(res.text)

3.2 _signature(寻找签名因为具有失效性,假如操作体育可能就获取不到了)

python 复制代码
var o = (null === (n = window.byted_acrawler) || void 0 === n || null === (a = n.sign) || void 0 === a ? void 0 : a.call(n, o)) || ""


python 复制代码
n=undefined;
i={url:"https://www.toutiao.com/api/pc/list/feed?offset=0&channel_id=94349549395&max_behot_time=0&category=pc_profile_channel&disable_raw_data=true&aid=24&app_name=toutiao_web"}
var o = window.byted_acrawler.sign.call(n,i);

再简化一下

python 复制代码
i={url:"https://www.toutiao.com/api/pc/list/feed?offset=0&channel_id=94349549395&max_behot_time=0&category=pc_profile_channel&disable_raw_data=true&aid=24&app_name=toutiao_web"}
var o = window.byted_acrawler.sign(i);
  • 找到sign算法,看看他是内部实现(走不通)。
  • 应该有一个js,给全局变量中赋值,
  • 整体调用试试看,把JS粘贴过来,找到了这个JS加载完之后赋的值

3.3 验证签名是否可用

做一个拼接:url+&_signature=签名

javascript 复制代码
https://www.toutiao.com/api/pc/list/feed?offset=0&channel_id=94349549395&max_behot_time=0&category=pc_profile_channel&disable_raw_data=true&aid=24&app_name=toutiao_web&_signature=_02B4Z6wo009010IJgRwAAIDDtGCIOlEVa8tCLYWAALV5CV7lvAp2MWxOhC9EGgecK8orbBZu.elV57IoxY70Cqa8TI2XW0z.U3dOc84bBFDE83277HsB4oykmNYgkYd-9NbV8enDst.RVEBu76& _signature=_02B4Z6wo00f01uuvg2AAAIDCHcaKRkpygJbrh4fAAN8p4e
python 复制代码
import requests

res = requests.get(
    url="https://www.toutiao.com/api/pc/list/feed?offset=0&channel_id=94349549395&max_behot_time=0&category=pc_profile_channel&disable_raw_data=true&aid=24&app_name=toutiao_web&_signature=_02B4Z6wo009010IJgRwAAIDDtGCIOlEVa8tCLYWAALV5CV7lvAp2MWxOhC9EGgecK8orbBZu.elV57IoxY70Cqa8TI2XW0z.U3dOc84bBFDE83277HsB4oykmNYgkYd-9NbV8enDst.RVEBu76& _signature=_02B4Z6wo00f01uuvg2AAAIDCHcaKRkpygJbrh4fAAN8p4e",
    headers={
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0"
    }
)

print(res.text)

3.4 补环境运行


v20.js

javascript 复制代码
const jsdom = require("jsdom");
const {JSDOM} = jsdom;

const resourceLoader = new jsdom.ResourceLoader({
    userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36"
});

const html = `<!DOCTYPE html><p>Hello world</p>`;
const dom = new JSDOM(html, {
    url: "https://www.toutiao.com",
    referrer: "https://example.com/",
    contentType: "text/html",
    resources: resourceLoader,
});

/*
console.log(dom.window.location)
console.log(dom.window.navigator.userAgent)
console.log(dom.window.document.referrer)
*/

// 报错加入:既然要去读这个referrer,而它是去document里面读,那么写上个全局变量
document = dom.window.document



window = global;

const params = {
    location: {
        hash: "",
        host: "www.toutiao.com",
        hostname: "www.toutiao.com",
        href: "https://www.toutiao.com",
        origin: "https://www.toutiao.com",
        pathname: "/",
        port: "",
        protocol: "https:",
        search: "",
    },
    navigator: {
        appCodeName: "Mozilla",
        appName: "Netscape",
        appVersion: "5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
        cookieEnabled: true,
        deviceMemory: 8,
        doNotTrack: null,
        hardwareConcurrency: 4,
        language: "zh-CN",
        languages: ["zh-CN", "zh"],
        maxTouchPoints: 0,
        onLine: true,
        platform: "MacIntel",
        product: "Gecko",
        productSub: "20030107",
        userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
        vendor: "Google Inc.",
        vendorSub: "",
        webdriver: false
    }
};

Object.assign(global,params)

// 以上为手动补充环境


// var glb;
// (glb = "undefined" == typeof window ? global : window)
//  浏览器执行,在函数内部会创建一个全局变量 byted_acrawler   ===》》》 global
window._$jsvmprt = function(b, e, f) {
    function a() {
        if ("undefined" == typeof Reflect || !Reflect.construct)
            return !1;
        if (Reflect.construct.sham)
            return !1;
        if ("function" == typeof Proxy)
            return !0;
        try {
            return Date.prototype.toString.call(Reflect.construct(Date, [], (function() {}
            ))),
            !0
        } catch (b) {
            return !1
        }
    }
    function d(b, e, f) {
        return (d = a() ? Reflect.construct : function(b, e, f) {
            var a = [null];
            a.push.apply(a, e);
            var d = new (Function.bind.apply(b, a));
            return f && c(d, f.prototype),
            d
        }
        ).apply(null, arguments)
    }
    function c(b, e) {
        return (c = Object.setPrototypeOf || function(b, e) {
            return b.__proto__ = e,
            b
        }
        )(b, e)
    }
    function n(b) {
        return function(b) {
            if (Array.isArray(b)) {
                for (var e = 0, f = new Array(b.length); e < b.length; e++)
                    f[e] = b[e];
                return f
            }
        }(b) || function(b) {
            if (Symbol.iterator in Object(b) || "[object Arguments]" === Object.prototype.toString.call(b))
                return Array.from(b)
        }(b) || function() {
            throw new TypeError("Invalid attempt to spread non-iterable instance")
        }()
    }
    for (var i = [], r = 0, t = [], o = 0, l = function(b, e) {
        var f = b[e++]
          , a = b[e]
          , d = parseInt("" + f + a, 16);
        if (d >> 7 == 0)
            return [1, d];
        if (d >> 6 == 2) {
            var c = parseInt("" + b[++e] + b[++e], 16);
            return d &= 63,
            [2, c = (d <<= 8) + c]
        }
        if (d >> 6 == 3) {
            var n = parseInt("" + b[++e] + b[++e], 16)
              , i = parseInt("" + b[++e] + b[++e], 16);
            return d &= 63,
            [3, i = (d <<= 16) + (n <<= 8) + i]
        }
    }, u = function(b, e) {
        var f = parseInt("" + b[e] + b[e + 1], 16);
        return f = f > 127 ? -256 + f : f
    }, s = function(b, e) {
        var f = parseInt("" + b[e] + b[e + 1] + b[e + 2] + b[e + 3], 16);
        return f = f > 32767 ? -65536 + f : f
    }, p = function(b, e) {
        var f = parseInt("" + b[e] + b[e + 1] + b[e + 2] + b[e + 3] + b[e + 4] + b[e + 5] + b[e + 6] + b[e + 7], 16);
        return f = f > 2147483647 ? 0 + f : f
    }, y = function(b, e) {
        return parseInt("" + b[e] + b[e + 1], 16)
    }, v = function(b, e) {
        return parseInt("" + b[e] + b[e + 1] + b[e + 2] + b[e + 3], 16)
    }, g = g || this || window, h = Object.keys || function(b) {
        var e = {}
          , f = 0;
        for (var a in b)
            e[f++] = a;
        return e.length = f,
        e
    }
    , m = (b.length,
    0), I = "", C = m; C < m + 16; C++) {
        var q = "" + b[C++] + b[C];
        q = parseInt(q, 16),
        I += String.fromCharCode(q)
    }
    if ("HNOJ@?RC" != I)
        throw new Error("error magic number " + I);
    m += 16;
    parseInt("" + b[m] + b[m + 1], 16);
    m += 8,
    r = 0;
    for (var w = 0; w < 4; w++) {
        var S = m + 2 * w
          , R = "" + b[S++] + b[S]
          , x = parseInt(R, 16);
        r += (3 & x) << 2 * w
    }
    m += 16,
    m += 8;
    var z = parseInt("" + b[m] + b[m + 1] + b[m + 2] + b[m + 3] + b[m + 4] + b[m + 5] + b[m + 6] + b[m + 7], 16)
      , O = z
      , E = m += 8
      , j = v(b, m += z);
    j[1];
    m += 4,
    i = {
        p: [],
        q: []
    };
    for (var A = 0; A < j; A++) {
        for (var D = l(b, m), T = m += 2 * D[0], $ = i.p.length, P = 0; P < D[1]; P++) {
            var U = l(b, T);
            i.p.push(U[1]),
            T += 2 * U[0]
        }
        m = T,
        i.q.push([$, i.p.length])
    }
    var _ = {
        5: 1,
        6: 1,
        70: 1,
        22: 1,
        23: 1,
        37: 1,
        73: 1
    }
      , k = {
        72: 1
    }
      , M = {
        74: 1
    }
      , H = {
        11: 1,
        12: 1,
        24: 1,
        26: 1,
        27: 1,
        31: 1
    }
      , J = {
        10: 1
    }
      , N = {
        2: 1,
        29: 1,
        30: 1,
        20: 1
    }
      , B = []
      , W = [];
    function F(b, e, f) {
        for (var a = e; a < e + f; ) {
            var d = y(b, a);
            B[a] = d,
            a += 2;
            k[d] ? (W[a] = u(b, a),
            a += 2) : _[d] ? (W[a] = s(b, a),
            a += 4) : M[d] ? (W[a] = p(b, a),
            a += 8) : H[d] ? (W[a] = y(b, a),
            a += 2) : J[d] ? (W[a] = v(b, a),
            a += 4) : N[d] && (W[a] = v(b, a),
            a += 4)
        }
    }
    return K(b, E, O / 2, [], e, f);
    function G(b, e, f, a, c, l, m, I) {
        null == l && (l = this);
        var C, q, w, S = [], R = 0;
        m && (C = m);
        var x, z, O = e, E = O + 2 * f;
        if (!I)
            for (; O < E; ) {
                var j = parseInt("" + b[O] + b[O + 1], 16);
                O += 2;
                var A = 3 & (x = 13 * j % 241);
                if (x >>= 2,
                A < 1) {
                    A = 3 & x;
                    if (x >>= 2,
                    A > 2)
                        (A = x) > 10 ? S[++R] = void 0 : A > 1 ? (C = S[R--],
                        S[R] = S[R] >= C) : A > -1 && (S[++R] = null);
                    else if (A > 1) {
                        if ((A = x) > 11)
                            throw S[R--];
                        if (A > 7) {
                            for (C = S[R--],
                            z = v(b, O),
                            A = "",
                            P = i.q[z][0]; P < i.q[z][1]; P++)
                                A += String.fromCharCode(r ^ i.p[P]);
                            O += 4,
                            S[R--][A] = C
                        } else
                            A > 5 && (S[R] = h(S[R]))
                    } else if (A > 0) {
                        (A = x) > 8 ? (C = S[R--],
                        S[R] = typeof C) : A > 6 ? S[R] = --S[R] : A > 4 ? S[R -= 1] = S[R][S[R + 1]] : A > 2 && (q = S[R--],
                        (A = S[R]).x === G ? A.y >= 1 ? S[R] = K(b, A.c, A.l, [q], A.z, w, null, 1) : (S[R] = K(b, A.c, A.l, [q], A.z, w, null, 0),
                        A.y++) : S[R] = A(q))
                    } else {
                        if ((A = x) > 14)
                            z = s(b, O),
                            (U = function e() {
                                var f = arguments;
                                return e.y > 0 ? K(b, e.c, e.l, f, e.z, this, null, 0) : (e.y++,
                                K(b, e.c, e.l, f, e.z, this, null, 0))
                            }
                            ).c = O + 4,
                            U.l = z - 2,
                            U.x = G,
                            U.y = 0,
                            U.z = c,
                            S[R] = U,
                            O += 2 * z - 2;
                        else if (A > 12)
                            q = S[R--],
                            w = S[R--],
                            (A = S[R--]).x === G ? A.y >= 1 ? S[++R] = K(b, A.c, A.l, q, A.z, w, null, 1) : (S[++R] = K(b, A.c, A.l, q, A.z, w, null, 0),
                            A.y++) : S[++R] = A.apply(w, q);
                        else if (A > 5)
                            C = S[R--],
                            S[R] = S[R] != C;
                        else if (A > 3)
                            C = S[R--],
                            S[R] = S[R] * C;
                        else if (A > -1)
                            return [1, S[R--]]
                    }
                } else if (A < 2) {
                    A = 3 & x;
                    if (x >>= 2,
                    A < 1) {
                        if ((A = x) > 9)
                            ;
                        else if (A > 7)
                            C = S[R--],
                            S[R] = S[R] & C;
                        else if (A > 5)
                            z = y(b, O),
                            O += 2,
                            S[R -= z] = 0 === z ? new S[R] : d(S[R], n(S.slice(R + 1, R + z + 1)));
                        else if (A > 3) {
                            z = s(b, O);
                            try {
                                if (t[o][2] = 1,
                                1 == (C = G(b, O + 4, z - 3, [], c, l, null, 0))[0])
                                    return C
                            } catch (m) {
                                if (t[o] && t[o][1] && 1 == (C = G(b, t[o][1][0], t[o][1][1], [], c, l, m, 0))[0])
                                    return C
                            } finally {
                                if (t[o] && t[o][0] && 1 == (C = G(b, t[o][0][0], t[o][0][1], [], c, l, null, 0))[0])
                                    return C;
                                t[o] = 0,
                                o--
                            }
                            O += 2 * z - 2
                        }
                    } else if (A < 2) {
                        if ((A = x) > 12)
                            S[++R] = u(b, O),
                            O += 2;
                        else if (A > 10)
                            C = S[R--],
                            S[R] = S[R] << C;
                        else if (A > 8) {
                            for (z = v(b, O),
                            A = "",
                            P = i.q[z][0]; P < i.q[z][1]; P++)
                                A += String.fromCharCode(r ^ i.p[P]);
                            O += 4,
                            S[R] = S[R][A]
                        } else
                            A > 6 && (q = S[R--],
                            C = delete S[R--][q])
                    } else if (A < 3) {
                        (A = x) < 2 ? S[++R] = C : A < 4 ? (C = S[R--],
                        S[R] = S[R] <= C) : A < 11 ? (C = S[R -= 2][S[R + 1]] = S[R + 2],
                        R--) : A < 13 && (C = S[R],
                        S[++R] = C)
                    } else {
                        if ((A = x) > 12)
                            S[++R] = l;
                        else if (A > 5)
                            C = S[R--],
                            S[R] = S[R] !== C;
                        else if (A > 3)
                            C = S[R--],
                            S[R] = S[R] / C;
                        else if (A > 1) {
                            if ((z = s(b, O)) < 0) {
                                I = 1,
                                F(b, e, 2 * f),
                                O += 2 * z - 2;
                                break
                            }
                            O += 2 * z - 2
                        } else
                            A > -1 && (S[R] = !S[R])
                    }
                } else if (A < 3) {
                    A = 3 & x;
                    if (x >>= 2,
                    A > 2)
                        (A = x) > 7 ? (C = S[R--],
                        S[R] = S[R] | C) : A > 5 ? (z = y(b, O),
                        O += 2,
                        S[++R] = c["$" + z]) : A > 3 && (z = s(b, O),
                        t[o][0] && !t[o][2] ? t[o][1] = [O + 4, z - 3] : t[o++] = [0, [O + 4, z - 3], 0],
                        O += 2 * z - 2);
                    else if (A > 1) {
                        if ((A = x) < 2) {
                            for (z = v(b, O),
                            C = "",
                            P = i.q[z][0]; P < i.q[z][1]; P++)
                                C += String.fromCharCode(r ^ i.p[P]);
                            S[++R] = C,
                            O += 4
                        } else if (A < 4)
                            if (S[R--])
                                O += 4;
                            else {
                                if ((z = s(b, O)) < 0) {
                                    I = 1,
                                    F(b, e, 2 * f),
                                    O += 2 * z - 2;
                                    break
                                }
                                O += 2 * z - 2
                            }
                        else
                            A < 6 ? (C = S[R--],
                            S[R] = S[R] % C) : A < 8 ? (C = S[R--],
                            S[R] = S[R]instanceof C) : A < 15 && (S[++R] = !1)
                    } else if (A > 0) {
                        (A = x) < 1 ? S[++R] = g : A < 3 ? (C = S[R--],
                        S[R] = S[R] + C) : A < 5 ? (C = S[R--],
                        S[R] = S[R] == C) : A < 14 && (C = S[R - 1],
                        q = S[R],
                        S[++R] = C,
                        S[++R] = q)
                    } else {
                        (A = x) < 2 ? (C = S[R--],
                        S[R] = S[R] > C) : A < 9 ? (z = v(b, O),
                        O += 4,
                        q = R + 1,
                        S[R -= z - 1] = z ? S.slice(R, q) : []) : A < 11 ? (z = y(b, O),
                        O += 2,
                        C = S[R--],
                        c[z] = C) : A < 13 ? (C = S[R--],
                        S[R] = S[R] >> C) : A < 15 && (S[++R] = s(b, O),
                        O += 4)
                    }
                } else {
                    A = 3 & x;
                    if (x >>= 2,
                    A > 2)
                        (A = x) > 13 ? (S[++R] = p(b, O),
                        O += 8) : A > 11 ? (C = S[R--],
                        S[R] = S[R] >>> C) : A > 9 ? S[++R] = !0 : A > 7 ? (z = y(b, O),
                        O += 2,
                        S[R] = S[R][z]) : A > 0 && (C = S[R--],
                        S[R] = S[R] < C);
                    else if (A > 1) {
                        (A = x) > 10 ? (z = s(b, O),
                        t[++o] = [[O + 4, z - 3], 0, 0],
                        O += 2 * z - 2) : A > 8 ? (C = S[R--],
                        S[R] = S[R] ^ C) : A > 6 && (C = S[R--])
                    } else if (A > 0) {
                        if ((A = x) < 3) {
                            var D = 0
                              , T = S[R].length
                              , $ = S[R];
                            S[++R] = function() {
                                var b = D < T;
                                if (b) {
                                    var e = $[D++];
                                    S[++R] = e
                                }
                                S[++R] = b
                            }
                        } else
                            A < 5 ? (z = y(b, O),
                            O += 2,
                            C = c[z],
                            S[++R] = C) : A < 7 ? S[R] = ++S[R] : A < 9 && (C = S[R--],
                            S[R] = S[R]in C)
                    } else {
                        if ((A = x) > 13)
                            C = S[R],
                            S[R] = S[R - 1],
                            S[R - 1] = C;
                        else if (A > 4)
                            C = S[R--],
                            S[R] = S[R] === C;
                        else if (A > 2)
                            C = S[R--],
                            S[R] = S[R] - C;
                        else if (A > 0) {
                            for (z = v(b, O),
                            A = "",
                            P = i.q[z][0]; P < i.q[z][1]; P++)
                                A += String.fromCharCode(r ^ i.p[P]);
                            A = +A,
                            O += 4,
                            S[++R] = A
                        }
                    }
                }
            }
        if (I)
            for (; O < E; ) {
                j = B[O];
                O += 2;
                A = 3 & (x = 13 * j % 241);
                if (x >>= 2,
                A < 1) {
                    var U;
                    A = 3 & x;
                    if (x >>= 2,
                    A < 1) {
                        if ((A = x) > 14)
                            z = W[O],
                            (U = function e() {
                                var f = arguments;
                                return e.y > 0 ? K(b, e.c, e.l, f, e.z, this, null, 0) : (e.y++,
                                K(b, e.c, e.l, f, e.z, this, null, 0))
                            }
                            ).c = O + 4,
                            U.l = z - 2,
                            U.x = G,
                            U.y = 0,
                            U.z = c,
                            S[R] = U,
                            O += 2 * z - 2;
                        else if (A > 12)
                            q = S[R--],
                            w = S[R--],
                            (A = S[R--]).x === G ? A.y >= 1 ? S[++R] = K(b, A.c, A.l, q, A.z, w, null, 1) : (S[++R] = K(b, A.c, A.l, q, A.z, w, null, 0),
                            A.y++) : S[++R] = A.apply(w, q);
                        else if (A > 5)
                            C = S[R--],
                            S[R] = S[R] != C;
                        else if (A > 3)
                            C = S[R--],
                            S[R] = S[R] * C;
                        else if (A > -1)
                            return [1, S[R--]]
                    } else if (A < 2) {
                        (A = x) < 4 ? (q = S[R--],
                        (A = S[R]).x === G ? A.y >= 1 ? S[R] = K(b, A.c, A.l, [q], A.z, w, null, 1) : (S[R] = K(b, A.c, A.l, [q], A.z, w, null, 0),
                        A.y++) : S[R] = A(q)) : A < 6 ? S[R -= 1] = S[R][S[R + 1]] : A < 8 ? S[R] = --S[R] : A < 10 && (C = S[R--],
                        S[R] = typeof C)
                    } else if (A < 3) {
                        if ((A = x) > 11)
                            throw S[R--];
                        if (A > 7) {
                            for (C = S[R--],
                            z = W[O],
                            A = "",
                            P = i.q[z][0]; P < i.q[z][1]; P++)
                                A += String.fromCharCode(r ^ i.p[P]);
                            O += 4,
                            S[R--][A] = C
                        } else
                            A > 5 && (S[R] = h(S[R]))
                    } else {
                        (A = x) < 1 ? S[++R] = null : A < 3 ? (C = S[R--],
                        S[R] = S[R] >= C) : A < 12 && (S[++R] = void 0)
                    }
                } else if (A < 2) {
                    A = 3 & x;
                    if (x >>= 2,
                    A > 2)
                        (A = x) > 12 ? S[++R] = l : A > 5 ? (C = S[R--],
                        S[R] = S[R] !== C) : A > 3 ? (C = S[R--],
                        S[R] = S[R] / C) : A > 1 ? O += 2 * (z = W[O]) - 2 : A > -1 && (S[R] = !S[R]);
                    else if (A > 1) {
                        (A = x) < 2 ? S[++R] = C : A < 4 ? (C = S[R--],
                        S[R] = S[R] <= C) : A < 11 ? (C = S[R -= 2][S[R + 1]] = S[R + 2],
                        R--) : A < 13 && (C = S[R],
                        S[++R] = C)
                    } else if (A > 0) {
                        if ((A = x) < 8)
                            q = S[R--],
                            C = delete S[R--][q];
                        else if (A < 10) {
                            for (z = W[O],
                            A = "",
                            P = i.q[z][0]; P < i.q[z][1]; P++)
                                A += String.fromCharCode(r ^ i.p[P]);
                            O += 4,
                            S[R] = S[R][A]
                        } else
                            A < 12 ? (C = S[R--],
                            S[R] = S[R] << C) : A < 14 && (S[++R] = W[O],
                            O += 2)
                    } else {
                        if ((A = x) < 5) {
                            z = W[O];
                            try {
                                if (t[o][2] = 1,
                                1 == (C = G(b, O + 4, z - 3, [], c, l, null, 0))[0])
                                    return C
                            } catch (m) {
                                if (t[o] && t[o][1] && 1 == (C = G(b, t[o][1][0], t[o][1][1], [], c, l, m, 0))[0])
                                    return C
                            } finally {
                                if (t[o] && t[o][0] && 1 == (C = G(b, t[o][0][0], t[o][0][1], [], c, l, null, 0))[0])
                                    return C;
                                t[o] = 0,
                                o--
                            }
                            O += 2 * z - 2
                        } else
                            A < 7 ? (z = W[O],
                            O += 2,
                            S[R -= z] = 0 === z ? new S[R] : d(S[R], n(S.slice(R + 1, R + z + 1)))) : A < 9 && (C = S[R--],
                            S[R] = S[R] & C)
                    }
                } else if (A < 3) {
                    A = 3 & x;
                    if (x >>= 2,
                    A < 1)
                        (A = x) < 2 ? (C = S[R--],
                        S[R] = S[R] > C) : A < 9 ? (z = W[O],
                        O += 4,
                        q = R + 1,
                        S[R -= z - 1] = z ? S.slice(R, q) : []) : A < 11 ? (z = W[O],
                        O += 2,
                        C = S[R--],
                        c[z] = C) : A < 13 ? (C = S[R--],
                        S[R] = S[R] >> C) : A < 15 && (S[++R] = W[O],
                        O += 4);
                    else if (A < 2) {
                        (A = x) < 1 ? S[++R] = g : A < 3 ? (C = S[R--],
                        S[R] = S[R] + C) : A < 5 ? (C = S[R--],
                        S[R] = S[R] == C) : A < 14 && (C = S[R - 1],
                        q = S[R],
                        S[++R] = C,
                        S[++R] = q)
                    } else if (A < 3) {
                        if ((A = x) < 2) {
                            for (z = W[O],
                            C = "",
                            P = i.q[z][0]; P < i.q[z][1]; P++)
                                C += String.fromCharCode(r ^ i.p[P]);
                            S[++R] = C,
                            O += 4
                        } else
                            A < 4 ? S[R--] ? O += 4 : O += 2 * (z = W[O]) - 2 : A < 6 ? (C = S[R--],
                            S[R] = S[R] % C) : A < 8 ? (C = S[R--],
                            S[R] = S[R]instanceof C) : A < 15 && (S[++R] = !1)
                    } else {
                        (A = x) > 7 ? (C = S[R--],
                        S[R] = S[R] | C) : A > 5 ? (z = W[O],
                        O += 2,
                        S[++R] = c["$" + z]) : A > 3 && (z = W[O],
                        t[o][0] && !t[o][2] ? t[o][1] = [O + 4, z - 3] : t[o++] = [0, [O + 4, z - 3], 0],
                        O += 2 * z - 2)
                    }
                } else {
                    A = 3 & x;
                    if (x >>= 2,
                    A > 2)
                        (A = x) > 13 ? (S[++R] = W[O],
                        O += 8) : A > 11 ? (C = S[R--],
                        S[R] = S[R] >>> C) : A > 9 ? S[++R] = !0 : A > 7 ? (z = W[O],
                        O += 2,
                        S[R] = S[R][z]) : A > 0 && (C = S[R--],
                        S[R] = S[R] < C);
                    else if (A > 1) {
                        (A = x) > 10 ? (z = W[O],
                        t[++o] = [[O + 4, z - 3], 0, 0],
                        O += 2 * z - 2) : A > 8 ? (C = S[R--],
                        S[R] = S[R] ^ C) : A > 6 && (C = S[R--])
                    } else if (A > 0) {
                        if ((A = x) > 7)
                            C = S[R--],
                            S[R] = S[R]in C;
                        else if (A > 5)
                            S[R] = ++S[R];
                        else if (A > 3)
                            z = W[O],
                            O += 2,
                            C = c[z],
                            S[++R] = C;
                        else if (A > 1) {
                            D = 0,
                            T = S[R].length,
                            $ = S[R];
                            S[++R] = function() {
                                var b = D < T;
                                if (b) {
                                    var e = $[D++];
                                    S[++R] = e
                                }
                                S[++R] = b
                            }
                        }
                    } else {
                        if ((A = x) < 2) {
                            for (z = W[O],
                            A = "",
                            P = i.q[z][0]; P < i.q[z][1]; P++)
                                A += String.fromCharCode(r ^ i.p[P]);
                            A = +A,
                            O += 4,
                            S[++R] = A
                        } else
                            A < 4 ? (C = S[R--],
                            S[R] = S[R] - C) : A < 6 ? (C = S[R--],
                            S[R] = S[R] === C) : A < 15 && (C = S[R],
                            S[R] = S[R - 1],
                            S[R - 1] = C)
                    }
                }
            }
        return [0, null]
    }
    function K(b, e, f, a, d, c, n, i) {
        var r, t;
        null == c && (c = this),
        d && !d.d && (d.d = 0,
        d.$0 = d,
        d[1] = {});
        var o = {}
          , l = o.d = d ? d.d + 1 : 0;
        for (o["$" + l] = o,
        t = 0; t < l; t++)
            o[r = "$" + t] = d[r];
        for (t = 0,
        l = o.length = a.length; t < l; t++)
            o[t] = a[t];
        return i && !B[e] && F(b, e, 2 * f),
        B[e] ? G(b, e, f, 0, o, c, null, 1)[1] : G(b, e, f, 0, o, c, null, 0)[1]
    }
};

// (glb = "undefined" == typeof window ? global : window)
const v1 = "";
window._$jsvmprt(v1, [, ,  void 0, "undefined" != typeof module ? module : void 0, "undefined" != typeof define ? define : void 0, "undefined" != typeof Object ? Object : void 0, void 0, "undefined" != typeof TypeError ? TypeError : void 0, "undefined" != typeof document ? document : void 0,  void 0,  void 0, "undefined" != typeof Date ? Date : void 0, "undefined" != typeof Math ? Math : void 0, "undefined" != typeof navigator ? navigator : void 0, "undefined" != typeof location ? location : void 0, "undefined" != typeof history ? history : void 0, "undefined" != typeof Image ? Image : void 0, "undefined" != typeof console ? console : void 0, "undefined" != typeof PluginArray ? PluginArray : void 0, "undefined" != typeof indexedDB ? indexedDB : void 0, "undefined" != typeof DOMException ? DOMException : void 0, "undefined" != typeof parseInt ? parseInt : void 0, "undefined" != typeof String ? String : void 0, "undefined" != typeof Array ? Array : void 0, "undefined" != typeof Error ? Error : void 0, "undefined" != typeof JSON ? JSON : void 0, "undefined" != typeof Promise ? Promise : void 0, "undefined" != typeof WebSocket ? WebSocket : void 0, "undefined" != typeof eval ? eval : void 0, "undefined" != typeof setTimeout ? setTimeout : void 0, "undefined" != typeof encodeURIComponent ? encodeURIComponent : void 0, "undefined" != typeof encodeURI ? encodeURI : void 0, "undefined" != typeof Request ? Request : void 0, "undefined" != typeof Headers ? Headers : void 0, "undefined" != typeof decodeURIComponent ? decodeURIComponent : void 0, "undefined" != typeof RegExp ? RegExp : void 0]);

编译通过了,下面看看byted_acrawler有没有生成并且看看有没有生成signature



3.4.1 Python获取签名值并实现
python 复制代码
# -*- coding: utf-8 -*-

# (venv) E:\crawler(爬虫)\day04> node v20.js "https://www.toutiao.com/api/pc/list/feed?offset=0&channel_id=94349549395&max_behot_time=0&category=pc_profile_channel&disable_raw_data=true&aid=24&app_name=toutiao_web"
# _02B4Z6wo00f01vu4PgAAAIDBBEfB.ftCImL7mjqAANsre0

import os
import subprocess

# 根据自己的操作系统去修改(相当于python的sys.path,加载安装的模块)
os.environ["NODE_PATH"] = "D:\\Nodejs\\node_global\\node_modules"

url = "https://www.toutiao.com/api/pc/list/feed?offset=0&channel_id=94349549395&max_behot_time=0&category=pc_profile_channel&disable_raw_data=true&aid=24&app_name=toutiao_web"
result = subprocess.run(f'node v20.js "{url}"', shell=True, stdout=subprocess.PIPE)
signature = result.stdout.decode('utf-8')

print(signature)


得到签名在让url拼接起来已发送请求就应该获取到评论

拿着url再去发送请求

4.pyexecjs签名并实现


python 复制代码
import requests
import execjs
import os

os.environ["NODE_PATH"] = "D:\\Nodejs\\node_global\\node_modules"
with open('v20.js', mode='r', encoding='utf-8') as f:
    js = f.read()
JS = execjs.compile(js)

url = "https://www.toutiao.com/api/pc/list/feed?offset=0&channel_id=94349549395&max_behot_time=0&category=pc_profile_channel&disable_raw_data=true&aid=24&app_name=toutiao_web"
signature = JS.call("get_sign", url)

final_url = f"{url}&_signature={signature}"

res = requests.get(
    url=final_url,
    headers={
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
    }
)

print(res.text)

六、知网资料批量爬取并可视化展示

1.批量爬取知网数据

  • lxml:是 Python 的一个功能强大且易用的 XML 和 HTML 处理库。它提供了简单又轻巧的 API,使得解析、构建和操作
    XML 和 HTML 文档变得非常方便。lxml 库通常用于处理 XML 和 HTML 文档,例如解析网页、处理配置文件等。
  • openpyxl:是 Python 中用于操作 Excel 文件(.xlsx 格式)的库。通过 openpyxl,你可以读取、修改和创建 Excel 文件,包括对工作表、单元格内容、样式等的操作。这个库在处理 Excel 数据时非常方便,可以用于数据处理、报表生成等应用场景。


python 复制代码
import requests
from lxml import etree
from openpyxl import Workbook

base_url = 'http://search.cnki.com.cn/Search/ListResult'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
}


def get_page_text(url, headers, search_word, page_num):
    data = {
        'searchType': 'MulityTermsSearch',
        'ArticleType': '',
        'ReSearch': '',
        'ParamIsNullOrEmpty': 'false',
        'Islegal': 'false',
        'Content': search_word,
        'Theme': '',
        'Title': '',
        'KeyWd': '',
        'Author': '',
        'SearchFund': '',
        'Originate': '',
        'Summary': '',
        'PublishTimeBegin': '',
        'PublishTimeEnd': '',
        'MapNumber': '',
        'Name': '',
        'Issn': '',
        'Cn': '',
        'Unit': '',
        'Public': '',
        'Boss': '',
        'FirstBoss': '',
        'Catalog': '',
        'Reference': '',
        'Speciality': '',
        'Type': '',
        'Subject': '',
        'SpecialityCode': '',
        'UnitCode': '',
        'Year': '',
        'AcefuthorFilter': '',
        'BossCode': '',
        'Fund': '',
        'Level': '',
        'Elite': '',
        'Organization': '',
        'Order': '1',
        'Page': str(page_num),
        'PageIndex': '',
        'ExcludeField': '',
        'ZtCode': '',
        'Smarts': '',
    }

    response = requests.post(url=url, headers=headers, data=data)
    page_text = response.text
    return page_text


def list_to_str(my_list):
    my_str = "".join(my_list)
    return my_str


def get_abstract(url):
    response = requests.get(url=url, headers=headers)
    page_text = response.text
    tree = etree.HTML(page_text)
    abstract = tree.xpath('//div[@class="xx_font"]//text()')
    return abstract


def parse_page_text(page_text):
    tree = etree.HTML(page_text)
    item_list = tree.xpath('//div[@class="list-item"]')
    page_info = []
    for item in item_list:
        # 标题
        title = list_to_str(item.xpath(
            './p[@class="tit clearfix"]/a[@class="left"]/@title'))
        # 链接
        link = 'https:' + \
               list_to_str(item.xpath(
                   './p[@class="tit clearfix"]/a[@class="left"]/@href'))
        # 作者
        author = list_to_str(item.xpath(
            './p[@class="source"]/span[1]/@title'))
        # 出版日期
        date = list_to_str(item.xpath(
            './p[@class="source"]/span[last()-1]/text() | ./p[@class="source"]/a[2]/span[1]/text() '))
        # 关键词
        keywords = list_to_str(item.xpath(
            './div[@class="info"]/p[@class="info_left left"]/a[1]/@data-key'))
        # 摘要
        abstract = list_to_str(get_abstract(url=link))
        # 文献来源
        paper_source = list_to_str(item.xpath(
            './p[@class="source"]/span[last()-2]/text() | ./p[@class="source"]/a[1]/span[1]/text() '))
        # 文献类型
        paper_type = list_to_str(item.xpath(
            './p[@class="source"]/span[last()]/text()'))
        # 下载量
        download = list_to_str(item.xpath(
            './div[@class="info"]/p[@class="info_right right"]/span[@class="time1"]/text()'))
        # 被引量
        refer = list_to_str(item.xpath(
            './div[@class="info"]/p[@class="info_right right"]/span[@class="time2"]/text()'))

        item_info = [i.strip() for i in
                     [title, author, paper_source, paper_type, date, abstract, keywords, download, refer, link]]
        page_info.append(item_info)
        print(page_info)
    return page_info


def write_to_excel(info, search_word):
    wb = Workbook()
    ws = wb.active  # 创建子表
    ws.title = search_word
    title = ['title', 'author', 'paper_source', 'paper_type', 'date', 'abstract', 'keywords', 'download', 'refer',
             'link']  # 设置表头
    ws.append(title)

    for row in info:
        ws.append(row)

    wb.save('data.xlsx')
    return True


# 获取页面数据1页
page_text = get_page_text(base_url, headers, 'YOLOV5', 1)

# 解析页面数据
page_info = parse_page_text(page_text)
# # 添加循环遍历1到10页
# page_info_list = []
# for page_num in range(1, 11):
#     page_text = get_page_text(base_url, headers, 'YOLOV5', page_num)
#     page_info = parse_page_text(page_text)
#     page_info_list.extend(page_info)

# 读取excel表格
write_to_excel(page_info, 'YOLOV5')

# 将整个列表写入 Excel 文件
# write_to_excel(page_info_list, 'YOLOV5')

2.数据可视化

可视化第一作者、下载量、被引量

  • pandas 是一个强大的数据分析工具,主要用于数据的清洗、处理、分析和建模。它提供了快速、灵活、简单和实用的数据结构,使您能够轻松地操作结构化数据。
  • matplotlib.pyplot 是 Matplotlib 库的一部分,用于创建各种类型的图表,包括折线图、散点图、柱状图、饼图等。通过 matplotlib.pyplot,您可以将 pandas 中处理过的数据可视化,以便更直观地理解数据、分析数据间的关系,或进行结果的展示和分享。
python 复制代码
import pandas as pd
import matplotlib.pyplot as plt

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# 从 Excel 文件中读取数据
df = pd.read_excel('data.xlsx')

# 提取每篇论文的第一个作者
df['first_author'] = df['author'].str.split(';').str[0]

# 统计第一个作者的出现次数
first_author_counts = df['first_author'].value_counts()

# 创建第一个作者条形图
plt.figure(figsize=(12, 8))
plt.barh(range(len(first_author_counts)), first_author_counts.values, align='center', alpha=0.5, color='skyblue')
plt.yticks(range(len(first_author_counts)), first_author_counts.index)
plt.xlabel('第一个作者出现次数')
plt.title('第一个作者', pad=20)
plt.gca().invert_yaxis()
plt.tight_layout(pad=2.0, rect=[0, 0.03, 1, 0.95])
plt.show()

# 创建下载量条形图
plt.figure(figsize=(8, 6))
plt.barh(df['title'], df['download'], color='lightgreen')
plt.ylabel('论文标题')
plt.xlabel('下载量', labelpad=20)
plt.xticks([])          # 将 x 轴的刻度设置为空列表
plt.title('下载量', pad=20)
plt.tight_layout(pad=2.0, rect=[0, 0.03, 1, 0.95])
plt.show()

# 创建被引量条形图
plt.figure(figsize=(8, 6))
plt.barh(df['title'], df['refer'], color='gold')
plt.xlabel('被引量')
plt.title('被引量', pad=20)
plt.tight_layout(pad=2.0, rect=[0, 0.03, 1, 0.95])
plt.show()
相关推荐
古希腊掌管学习的神1 小时前
[搜广推]王树森推荐系统——矩阵补充&最近邻查找
python·算法·机器学习·矩阵
LucianaiB2 小时前
探索CSDN博客数据:使用Python爬虫技术
开发语言·爬虫·python
PieroPc4 小时前
Python 写的 智慧记 进销存 辅助 程序 导入导出 excel 可打印
开发语言·python·excel
梧桐树04298 小时前
python常用内建模块:collections
python
Dream_Snowar8 小时前
速通Python 第三节
开发语言·python
蓝天星空9 小时前
Python调用open ai接口
人工智能·python
jasmine s10 小时前
Pandas
开发语言·python
郭wes代码10 小时前
Cmd命令大全(万字详细版)
python·算法·小程序
leaf_leaves_leaf10 小时前
win11用一条命令给anaconda环境安装GPU版本pytorch,并检查是否为GPU版本
人工智能·pytorch·python