运行环境
PyCharm 2023.2.1
python3.11
详细步骤
1、安装requests,parsel和bs4库
点击终端
data:image/s3,"s3://crabby-images/b1679/b167920b00ea3ca0de677f7fd2a52f0e39b5ed56" alt=""
输入
pip install requests
pip install parsel
pip install bs4
data:image/s3,"s3://crabby-images/59822/59822e51104876b114363592fd872419fc692cf4" alt=""
2、使用pycharm编写python代码,确保在pycharm中可以正常运行代码
不同网站获取方法有一定差距,以笔趣阁网站之一为例,网址为
https://www.biqugen.net
搜索需要获取的书籍,进入如下界面,网址栏显示书籍编号(例如诛仙的书籍编号为1416)
data:image/s3,"s3://crabby-images/c5311/c5311926d3250c792bcbeaf495c4906a693e08d2" alt=""
点击进入目录第一章,进入如下界面,网址栏显示书籍起始章节的网页编号(例如序章的网页编号为1467648)
data:image/s3,"s3://crabby-images/fa7c3/fa7c391b0ac7bec18ef56722207d60ed305f2aa3" alt=""
点击进入目录最后一章,进入如下界面,网址栏显示书籍结束章节的网页编号(例如序章的网页编号为1467950)
data:image/s3,"s3://crabby-images/a1c9c/a1c9c0f88c942a37eb745e2aba0c3b7fc56945a9" alt=""
示例代码,可以通过在pycharm中运行该代码对该网站的内容进行获取
import requests
from bs4 import BeautifulSoup
from parsel import Selector
def get_novel_title(url):
try:
# 发送请求
response = requests.get(url)
# 检查请求是否成功
response.raise_for_status()
# 使用parsel提取小说名称
title_selector = Selector(response.text)
title = title_selector.css('h1::text').get()
return title
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
# 如果请求失败,返回None
return None
def get_chapter_title_and_text(url):
try:
# 发送请求
response = requests.get(url)
# 检查请求是否成功
response.raise_for_status()
# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(response.text, 'html.parser')
# 提取章节标题
title = soup.title.string
# 提取正文内容
content = soup.find('div', {'id': 'content'})
if content:
# 处理文本中的<br>标签
text_with_linebreaks = ''
for element in content.stripped_strings:
text_with_linebreaks += element + '\n'
return title, text_with_linebreaks
return None, None
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
# 如果请求失败,返回None
return None, None
# 获取用户输入,添加循环以允许重新输入
while True:
book_id = input("请输入需要获取的书籍编号:")
book_url = f'https://www.biqugen.net/book/{book_id}/'
book_title = get_novel_title(book_url)
if book_title:
print(f"成功获取书籍信息:{book_title}")
break
else:
print(f"未能获取书籍信息,请重新输入!")
while True:
try:
start_chapter = int(input("请输入起始章节的网页编号:"))
end_chapter = int(input("请输入结束章节的网页编号:"))
output_file_path = input("请输入输出文件的位置:\n(例如C:\\Users\\Abit\\Desktop\\biqugen.txt): ")
print("章节获取情况:")
# 打开文件以写入模式
with open(output_file_path, 'w', encoding='utf-8') as output_file:
# 生成URL并遍历爬取每个网页的内容
for chapter_number in range(start_chapter, end_chapter + 1):
url = f'https://www.biqugen.net/book/{book_id}/{chapter_number}.html'
title, result = get_chapter_title_and_text(url)
# 将成功的章节名称和内容写入文件
if result:
output_file.write(f"{title}:\n{result}\n{'='*50}\n")
print(f"{chapter_number} {title}")
else:
print(f"{chapter_number} 未找到相关内容")
break
except ValueError:
print("输入的章节编号应为整数,请重新输入!")
except Exception as e:
print(f"发生错误: {e}")
运行结果如图
data:image/s3,"s3://crabby-images/28629/28629c69d1122bf4fa165f211c3b3776a53e6984" alt=""
data:image/s3,"s3://crabby-images/2387e/2387ebcab9e8a7a5445348cc4fc2f03cd9c26ba0" alt=""
注意事项
如果出现以下问题
pip : 无法将"pip"项识别为 cmdlet、函数、脚本文件或可运行程序的名称。请检查名称的拼写,如果包括路径,请确保路径正确,然后再试一次。
注意环境变量的配置,需要将你的python的安装目录下的Script文件夹的路径添加到Path中,比如C:\Program Files\Python311\Scripts