python爬虫的简单实现

codereasy2023-08-11 15:39

当涉及网络爬虫时，Python中最常用的库之一是requests。它能够发送HTTP请求并获取网页内容。下面是一个简单的示例，展示如何使用requests库来获取一个网页的内容：

import requests

指定要爬取的网页的URL

发送HTTP GET请求并获取响应内容

response = requests.get(url)

if response.status_code == 200:

print(response.text)

else:

print(f"Failed to retrieve the page. Status code: {response.status_code}")

以上代码演示了如何发送HTTP GET请求并获取网页内容。但是，当涉及实际爬取时，您可能还需要考虑处理页面解析、处理页面结构、处理异常、设置请求头以模拟浏览器等。一个更完整的爬虫示例可以是：

python

Copy code

import requests

from bs4 import BeautifulSoup

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'

}

response = requests.get(url, headers=headers)

if response.status_code == 200:

soup = BeautifulSoup(response.text, 'html.parser')

复制代码

# 示例：查找所有的标题
titles = soup.find_all('h1')
for title in titles:
    print(title.text)

else:

print(f"Failed to retrieve the page. Status code: {response.status_code}")

在这个示例中，我们使用了requests库发送HTTP GET请求，同时使用了BeautifulSoup库来解析HTML内容。还添加了请求头，以便模拟浏览器的请求。请注意，BeautifulSoup需要安装，可以使用以下命令安装：

Copy code

pip install beautifulsoup4

请注意，当您编写爬虫时，您需要遵守网站的使用条款和条件，以及遵循良好的网络爬虫实践。不当的爬取行为可能导致法律问题或对目标网站造成负担。