【Python爬虫五十个小案例】爬取中国天气网城市天气

博客主页：小馒头学python

本文专栏: Python爬虫五十个小案例

专栏简介：分享五十个Python爬虫小案例

引言

天气数据在很多领域都非常重要，比如天气预报、旅游、健康等。通过爬取天气网站的公开数据，可以方便地获取各地的天气情况。本教程将向你展示如何使用 Python 爬取中国天气网（Weather China）上的城市天气数据。

爬取天气数据的意义

天气数据的爬取可以为个人用户提供定制化的天气服务，供各种应用程序使用。例如，你可以通过爬取天气数据，预测未来几天的天气变化，或提供更为精准的出行建议。

使用 Python 实现数据爬取的优势

Python 语言因为其简洁性、丰富的第三方库而被广泛用于数据爬取和处理工作。Python 中的 requests、BeautifulSoup 等库，提供了高效的网络请求和网页解析功能，使得爬虫编写变得更加简单和快速。

本文目标：教你如何用 Python 爬取中国天气网的城市天气数据

在本教程中，我们将从零开始，逐步完成爬取中国天气网指定城市天气数据的爬虫，并将数据保存为 CSV 格式文件。最后，我们还会提供一些防止爬虫被封禁的技巧，确保爬虫能够顺利运行。

准备工作

环境要求

首先，确保你已经安装了 Python（推荐 Python 3.6 及以上）。然后，你需要安装以下几个库：

python 复制代码

pip install requests beautifulsoup4 pandas

requests: 用于发送 HTTP 请求，获取网页内容。

BeautifulSoup: 用于解析 HTML 内容，从中提取数据。

pandas: 用于存储和处理数据，可以方便地将数据保存为 CSV 文件。

获取中国天气网的城市天气页面

中国天气网的天气页面结构简单，适合用来进行数据爬取。每个城市的天气页面都有一个唯一的城市代码，我们可以通过该代码访问城市的实时天气信息。

例如，北京的天气页面 URL 为：

http://www.weather.com.cn/weather1d/101010100.shtml

在该页面中，你可以找到北京的温度、天气情况、风速等信息。我们通过爬虫来获取这些数据。

爬虫的基本原理

HTTP 请求的基本流程

爬虫的工作原理简单来说就是：向网页发送请求，获取网页的 HTML 内容，解析该内容并提取我们需要的数据。这个过程可以分为三个步骤：

发送 HTTP 请求，获取网页内容。
解析 HTML 页面，提取目标数据。
将提取到的数据存储或进行进一步分析。

数据解析

中国天气网的天气页面数据是 HTML 格式，我们将使用 BeautifulSoup 进行解析。BeautifulSoup 可以帮助我们提取网页中的特定标签内容。

如何提取目标数据（城市天气）

在本教程中，我们会提取以下几种数据：

城市名称
当前天气
温度（包括最高和最低温度）

编写爬虫代码

第一步：发送请求，获取页面内容

我们使用 requests 库向目标城市的天气页面发送 HTTP 请求。以下是一个简单的代码示例：

python 复制代码

import requests

def fetch_weather(city_code):
    url = f'http://www.weather.com.cn/weather1d/{city_code}.shtml'
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print("请求失败，状态码：", response.status_code)
        return None

在上面的代码中，我们构建了一个 fetch_weather 函数，它接收一个城市的代码（例如北京的代码是 101010100），并返回该页面的 HTML 内容。

第二步：解析页面，提取天气数据

我们使用 BeautifulSoup 来解析 HTML 页面，并提取所需的数据。

python 复制代码

from bs4 import BeautifulSoup

def parse_weather(page_content):
    soup = BeautifulSoup(page_content, 'html.parser')
    weather_data = {}

    # 获取城市名称
    city_name = soup.find('div', class_='crumbs fl').text.split('>')[-1].strip()
    weather_data['city'] = city_name
    
    # 获取当天的天气情况
    weather = soup.find('p', class_='tem').find('span').text
    weather_data['weather'] = weather
    
    # 获取温度信息
    temp = soup.find('p', class_='tem').text.strip()
    weather_data['temperature'] = temp

    return weather_data

第三步：处理数据，保存到本地

将获取到的数据存储为 CSV 文件。我们使用 pandas 来将数据存储到 CSV 文件中：

python 复制代码

import pandas as pd

def save_to_csv(weather_data):
    df = pd.DataFrame(weather_data)
    df.to_csv('weather_data.csv', index=False)
    print("数据已保存到 weather_data.csv")

# 示例：爬取北京的天气
city_code = '101010100'  # 北京的城市代码
page_content = fetch_weather(city_code)

if page_content:
    weather_data = parse_weather(page_content)
    save_to_csv([weather_data])  # 将数据保存为 CSV 文件

第四步：将数据导入 CSV 文件

通过调用 save_to_csv 函数，我们将爬取到的城市天气数据保存为 weather_data.csv 文件，方便后续处理。

完整的源码

下面是完整的源码

python 复制代码

import requests
from bs4 import BeautifulSoup
import pandas as pd
import random
import time


def get_weather_data(city_code):
    url = f'https://www.weather.com.cn/weather/{city_code}.shtml'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

    # 获取网页内容并处理编码，避免乱码
    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'  # 确保正确解码
    soup = BeautifulSoup(response.text, 'html.parser')

    # 获取当前天气
    current_weather = None
    current_condition = None
    try:
        # 检查页面结构是否有更新
        current_weather = soup.find('div', class_='t').find('p', class_='tem').get_text()
        current_condition = soup.find('div', class_='wea').get_text()
    except AttributeError:
        print("无法找到当前天气信息，可能是页面结构发生了变化")

    # 获取未来天气预报
    future_weather = []
    for item in soup.find_all('li', class_='sky'):
        date = item.find('h1').get_text()
        weather_condition = item.find('p', class_='wea').get_text()
        temperature = item.find('p', class_='tem').get_text()
        future_weather.append([date, weather_condition, temperature])

    return current_weather, current_condition, future_weather


# 示例：爬取北京的天气数据
city_code = '101010100'  # 北京的城市代码
current_weather, current_condition, future_weather = get_weather_data(city_code)

# 打印当前天气
if current_weather and current_condition:
    print(f"当前天气：{current_weather}，状态：{current_condition}")
else:
    print("当前天气信息未能成功获取")

# 将未来天气数据存储到DataFrame并显示
weather_df = pd.DataFrame(future_weather, columns=["日期", "天气状况", "温度"])
print(weather_df)

运行结果：

数据保存

python 复制代码

import requests
from bs4 import BeautifulSoup
import pandas as pd
import random
import time


def get_weather_data(city_code):
    url = f'https://www.weather.com.cn/weather/{city_code}.shtml'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

    # 获取网页内容并处理编码，避免乱码
    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'  # 确保正确解码
    soup = BeautifulSoup(response.text, 'html.parser')

    # 获取当前天气
    current_weather = None
    current_condition = None
    try:
        # 检查页面结构是否有更新
        current_weather = soup.find('div', class_='t').find('p', class_='tem').get_text()
        current_condition = soup.find('div', class_='wea').get_text()
    except AttributeError:
        print("无法找到当前天气信息，可能是页面结构发生了变化")

    # 获取未来天气预报
    future_weather = []
    for item in soup.find_all('li', class_='sky'):
        date = item.find('h1').get_text()
        weather_condition = item.find('p', class_='wea').get_text()
        temperature = item.find('p', class_='tem').get_text()
        future_weather.append([date, weather_condition, temperature])

    return current_weather, current_condition, future_weather


# 示例：爬取北京的天气数据
city_code = '101010100'  # 北京的城市代码
current_weather, current_condition, future_weather = get_weather_data(city_code)

# 打印当前天气
if current_weather and current_condition:
    print(f"当前天气：{current_weather}，状态：{current_condition}")
else:
    print("当前天气信息未能成功获取")

# 将未来天气数据存储到DataFrame
weather_df = pd.DataFrame(future_weather, columns=["日期", "天气状况", "温度"])

# 保存为 CSV 文件
weather_df.to_csv('weather_data.csv', index=False, encoding='utf-8-sig')

print("天气数据已保存为 weather_data.csv")

总结

本文介绍了如何使用Python爬虫从中国天气网爬取城市天气数据

若感兴趣可以访问并订阅我的专栏：Python爬虫五十个小案例：https://blog.csdn.net/null18/category_12840403.html?fromshare=blogcolumn&sharetype=blogcolumn&sharerId=12840403&sharerefer=PC&sharesource=null18&sharefrom=from_link