Python Scrapling反爬虫小技巧之Referer

什么是 Referer？

Referer（HTTP 引用页）是一个 HTTP 请求头字段，用来告诉服务器"用户是从哪个网页点击链接来到当前页面的"。你可以把它理解为一种网络世界的"来源记录"------它告诉网站："我是从哪来的"。

注意：虽然正确的拼写应该是 "Referrer"，但由于历史上的一个拼写错误，HTTP 标准中一直使用 "Referer" 这个拼写。

Referer 的作用机制（以购物网站为例）

假设你在浏览一个电商网站，我们看看 Referer 是如何工作的：

html 复制代码

<!-- 1. 在商品列表页 (https://www.example-shop.com/products) -->
<html>
<body>
  <h1>热门商品</h1>
  
  <!-- 用户点击这个链接 -->
  <a href="/product/iphone-15-pro">iPhone 15 Pro</a>
  <a href="/product/macbook-air">MacBook Air</a>
  <a href="/product/airpods-pro">AirPods Pro</a>
</body>
</html>

当用户点击 "iPhone 15 Pro" 链接时：

yaml 复制代码

# 浏览器发送的 HTTP 请求
GET /product/iphone-15-pro HTTP/1.1
Host: www.example-shop.com
Referer: https://www.example-shop.com/products  # ← 关键！告诉服务器"我从商品列表页来的"
User-Agent: Mozilla/5.0...

服务器就能知道："哦，这个用户是从商品列表页点击过来的。"

在网页抓取中的实际应用

1. 绕过基础反爬虫检测

很多网站会检查 Referer，如果缺失或异常，就认为是爬虫：

html 复制代码

import requests

# ❌ 错误的做法：不设置 Referer
response = requests.get('https://example.com/restricted-page')
# 可能返回 403 禁止访问，因为缺少 Referer

# ✅ 正确的做法：设置合理的 Referer
headers = {
    'Referer': 'https://example.com/home',  # 假装是从首页来的
    'User-Agent': 'Mozilla/5.0...'
}
response = requests.get('https://example.com/restricted-page', headers=headers)
# 更有可能成功访问

2. 模拟真实用户浏览路径

真实用户不会直接访问深层页面，而是通过一系列点击：

py 复制代码

from scrapling.fetchers import Fetcher

# 模拟用户浏览商品的完整路径
fetcher = Fetcher()

# 第一步：访问首页
homepage = fetcher.fetch('https://www.example-shop.com')
print(f"访问首页: {homepage.title}")

# 第二步：从首页进入分类页
category_url = homepage.css('a.category-electronics').attr('href')
headers = {
    'Referer': 'https://www.example-shop.com'  # 告诉服务器"我从首页来"
}
category_page = fetcher.fetch(category_url, headers=headers)

# 第三步：从分类页进入商品详情页
product_url = category_page.css('.product-list a:first-child').attr('href')
headers['Referer'] = category_url  # 更新Referer为分类页
product_page = fetcher.fetch(product_url, headers=headers)

print(f"成功访问商品页，浏览路径被完整记录")

3. 处理防盗链（图片/资源访问）

很多网站会保护自己的图片，只允许从自己网站引用：

html 复制代码

<!-- 网站上的图片，设置了防盗链 -->
<img src="/images/product.jpg" alt="商品图片">

py 复制代码

# ❌ 直接下载图片会失败
response = requests.get('https://example.com/images/product.jpg')
# 可能返回 403 或一张"禁止盗链"的替代图片

# ✅ 正确的方法：设置 Referer
headers = {
    'Referer': 'https://example.com/product/123',  # 告诉服务器"我是从你的商品页来的"
    'User-Agent': 'Mozilla/5.0...'
}
response = requests.get('https://example.com/images/product.jpg', headers=headers)

# 保存图片
with open('product.jpg', 'wb') as f:
    f.write(response.content)
print("图片下载成功！")

Referer 的类型和策略

1. 完整的 Referer

包含完整的 URL 路径：

text 复制代码

Referer: https://www.example.com/category/electronics?sort=price

2. 仅域名

有些浏览器或隐私设置只发送域名：

html 复制代码

Referer: https://www.example.com/

3. 没有 Referer

在以下情况可能没有 Referer：

用户直接在地址栏输入网址
从书签点击进入
隐私模式浏览
HTTPS 跳转到 HTTP（安全策略）

在实际爬虫项目中的最佳实践

示例：抓取需要连续浏览的网站

py 复制代码

import time
import random
from urllib.parse import urljoin
from scrapling.fetchers import StealthySession

class SequentialCrawler:
    """模拟用户连续浏览的爬虫"""
    
    def __init__(self, base_url):
        self.session = StealthySession()
        self.base_url = base_url
        self.current_url = base_url
        self.history = []  # 记录浏览历史
    
    def navigate_to(self, path):
        """导航到新页面，自动管理Referer"""
        target_url = urljoin(self.base_url, path)
        
        headers = {
            'Referer': self.current_url,  # 总是设置上一页为Referer
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        }
        
        print(f"从 [{self.get_page_name(self.current_url)}] 导航到 [{self.get_page_name(target_url)}]")
        print(f"Referer: {self.current_url}")
        
        response = self.session.get(target_url, headers=headers)
        
        # 记录历史
        self.history.append({
            'from': self.current_url,
            'to': target_url,
            'status': response.status_code
        })
        
        # 更新当前页面
        self.current_url = target_url
        
        # 人类化延迟
        time.sleep(random.uniform(1, 3))
        
        return response
    
    def get_page_name(self, url):
        """从URL提取页面名称"""
        if 'product' in url:
            return '商品页'
        elif 'category' in url:
            return '分类页'
        elif 'cart' in url:
            return '购物车'
        elif 'checkout' in url:
            return '结算页'
        else:
            return '首页'
    
    def simulate_shopping(self):
        """模拟完整购物流程"""
        print("=== 开始模拟购物流程 ===\n")
        
        # 1. 首页
        response = self.navigate_to('/')
        
        # 2. 从首页进入电子产品分类
        electronics_link = response.css('a[href*="electronics"]').attr('href')
        response = self.navigate_to(electronics_link)
        
        # 3. 点击第一个商品
        first_product = response.css('.product-card:first-child a').attr('href')
        response = self.navigate_to(first_product)
        
        # 4. 添加到购物车
        add_to_cart = response.css('button.add-to-cart').attr('data-action')
        response = self.navigate_to(f'/cart/add?product={add_to_cart}')
        
        # 5. 查看购物车
        response = self.navigate_to('/cart')
        
        # 6. 进入结算页
        checkout_link = response.css('a.checkout').attr('href')
        response = self.navigate_to(checkout_link)
        
        print(f"\n=== 流程完成 ===")
        print(f"总共访问了 {len(self.history)} 个页面")
        print("浏览历史:")
        for i, step in enumerate(self.history, 1):
            print(f"  {i}. {self.get_page_name(step['from'])} → {self.get_page_name(step['to'])}")

# 使用示例
crawler = SequentialCrawler('https://www.example-shop.com')
crawler.simulate_shopping()

常见问题和解决方案

问题1：什么时候需要 Referer？

场景	是否需要 Referer	原因
直接访问首页	可选	首页通常是入口点
点击内部链接	必须	正常用户都有Referer
提交表单	必须	特别是登录/注册表单
AJAX 请求	必须	防止CSRF攻击
下载资源	必须	绕过防盗链

问题2：如何选择合适的 Referer？

py 复制代码

def get_smart_referer(current_url, target_url):
    """智能生成Referer"""
    from urllib.parse import urlparse
    
    current = urlparse(current_url)
    target = urlparse(target_url)
    
    # 如果同域名，使用当前页面
    if current.netloc == target.netloc:
        return current_url
    
    # 如果不同域名，但有关联
    elif target.netloc in ['static.example.com', 'img.example.com']:
        return f'https://{current.netloc}/'  # 主站域名
    
    # 无关联的域名，不设置Referer
    else:
        return None

# 使用示例
current = 'https://www.example.com/product/123'
target = 'https://static.example.com/images/123.jpg'

referer = get_smart_referer(current, target)
print(f"智能Referer: {referer}")

问题3：网站屏蔽了我的请求怎么办？

py 复制代码

from scrapling.fetchers import StealthyFetcher
import random

class RobustCrawler:
    """鲁棒的爬虫，自动处理Referer问题"""
    
    def __init__(self):
        self.fetcher = StealthyFetcher()
        self.referer_pool = [
            'https://www.google.com/',
            'https://www.baidu.com/',
            'https://www.bing.com/',
            'https://www.example.com/',
            None,  # 有时不需要Referer
        ]
    
    def fetch_with_retry(self, url, max_retries=3):
        """带重试机制的抓取"""
        for attempt in range(max_retries):
            try:
                # 随机选择Referer策略
                headers = {}
                if random.random() > 0.2:  # 80%的概率设置Referer
                    headers['Referer'] = random.choice(self.referer_pool)
                
                response = self.fetcher.fetch(url, headers=headers)
                
                if response.status_code == 403:
                    print(f"尝试 {attempt+1}: 被拒绝，尝试不同的Referer...")
                    continue
                
                return response
                
            except Exception as e:
                print(f"尝试 {attempt+1} 失败: {e}")
        
        raise Exception(f"抓取失败: {url}")

# 使用示例
crawler = RobustCrawler()
response = crawler.fetch_with_retry('https://protected-site.com/data')
print(f"成功获取: {len(response.text)} 字符")

隐私和安全注意事项

Referer 可能泄露的信息：

浏览历史：网站知道你从哪个页面来
搜索关键词：从搜索引擎来的链接包含搜索词
用户标识：URL中可能包含用户ID或会话ID

保护隐私的方法：

bash 复制代码

# 作为爬虫使用者，你可以：
# 1. 只在必要时发送Referer
# 2. 不发送包含敏感信息的Referer
# 3. 定期清理爬虫历史

# 作为网站开发者，你应该：
# 1. 验证Referer防止CSRF攻击
# 2. 不记录敏感页面的Referer
# 3. 使用Referrer-Policy头部控制发送策略