Feapder框架UpdateItem使用技巧：如何优雅地实现"只更新有值字段"

前言

在使用Feapder爬虫框架进行数据采集时，我们经常会遇到这样的场景：在列表页只获取部分字段信息，在详情页获取完整信息。如果使用UpdateItem进行数据更新，默认情况下会将所有字段（包括None值）都更新到数据库，这可能会覆盖数据库中已有的有效数据。

本文将深入分析Feapder框架的UpdateItem机制，并通过一个常见的电商商品信息爬取案例，提供几种优雅的解决方案来实现"只更新有值字段"的需求。

问题背景

假设我们要爬取某电商平台的商品信息，数据结构如下：

python 复制代码

class ProductItem(UpdateItem):
    __table_name__ = "products"
    __unique_key__ = ["product_id"]

    def __init__(self, *args, **kwargs):
        # 定义字段是为了可读性 - 代码即文档
        self.product_id = None  # 商品ID
        self.title = None  # 商品标题
        self.price = None  # 商品价格
        self.brand = None  # 品牌
        self.category = None  # 分类
        self.shop_name = None  # 店铺名称
        self.description = None  # 商品描述
        self.specifications = None  # 规格参数
        self.image_urls = None  # 商品图片
        self.stock_count = None  # 库存数量
        self.sales_count = None  # 销量
        self.comment_count = None  # 评论数
        self.rating_score = None  # 评分
        # ... 更多字段

为什么要在Item中定义字段？

1. 代码即文档

字段定义本身就是最好的文档，开发者一眼就能看出这个Item包含哪些数据
IDE可以提供代码补全和类型检查
便于团队协作，新成员可以快速理解数据结构

2. 提高可读性

明确的字段定义让代码更加清晰
注释说明了每个字段的含义和用途
便于后续维护和修改

3. 便于调试

在调试时可以清楚地看到所有字段
便于设置断点和查看变量值

典型爬取场景

场景1：商品列表页

只能获取到基础信息：标题、价格、商品ID
无法获取详细信息：描述、规格参数、销量等

场景2：商品详情页

可以获取完整信息：描述、规格参数、销量、评论数等
但列表页的基础信息可能已经变化

问题演示

在商品列表页解析时：

python 复制代码

def parse_list_page(self, request, response):
    """解析商品列表页"""
    for product in response.xpath('//div[@class="product-item"]'):
        item = ProductItem()
        
        # 列表页只能获取基础信息
        item.product_id = product.xpath('.//@data-product-id').extract_first()
        item.title = product.xpath('.//div[@class="product-title"]//text()').extract_first()
        item.price = product.xpath('.//div[@class="product-price"]//text()').extract_first()
        item.brand = product.xpath('.//div[@class="product-brand"]//text()').extract_first()
        
        # 其他字段保持None
        # item.description = None  # 详情页才有
        # item.specifications = None  # 详情页才有
        # item.sales_count = None  # 详情页才有
        
        yield item

在商品详情页解析时：

python 复制代码

def parse_detail_page(self, request, response):
    """解析商品详情页"""
    item = ProductItem()
    
    # 详情页获取完整信息
    item.product_id = request.meta.get('product_id')
    item.description = response.xpath('//div[@class="product-description"]/text()').extract_first()
    item.specifications = response.xpath('//div[@class="product-specs"]//text()').extract()
    item.sales_count = response.xpath('//div[@class="sales-count"]//text()').extract_first()
    item.comment_count = response.xpath('//div[@class="comment-count"]//text()').extract_first()
    item.rating_score = response.xpath('//div[@class="rating-score"]//text()').extract_first()
    
    # 但列表页的基础信息可能已经变化，需要更新
    item.title = response.xpath('//h1[@class="product-title"]/text()').extract_first()
    item.price = response.xpath('//span[@class="current-price"]/text()').extract_first()
    
    yield item

问题：如果数据库中已经存在这条商品记录，且某些字段有值（如description、specifications等），使用UpdateItem更新时，这些None值会覆盖数据库中的有效数据。

Feapder框架UpdateItem机制分析

核心原理

Feapder的UpdateItem基于MySQL的INSERT ... ON DUPLICATE KEY UPDATE机制：

sql 复制代码

INSERT INTO products (product_id, title, price, description, specifications) 
VALUES ('12345', 'iPhone 15', '5999', NULL, NULL) 
ON DUPLICATE KEY UPDATE 
    title=VALUES(title), 
    price=VALUES(price), 
    description=VALUES(description),  -- 这里会覆盖原有数据！
    specifications=VALUES(specifications)  -- 这里也会覆盖原有数据！

源码分析

在feapder/network/item.py的to_dict方法中：

python 复制代码

@property
def to_dict(self):
    propertys = {}
    for key, value in self.__dict__.items():
        if key not in ("__name__", "__table_name__", "__name_underline__", "__update_key__", "__unique_key__"):
            if key.startswith(f"_{self.__class__.__name__}"):
                key = key.replace(f"_{self.__class__.__name__}", "")
            propertys[key] = value  # 包括None值
    return propertys

关键发现 ：to_dict方法会包含所有字段，包括None值，这导致None值被写入数据库。

解决方案

方案1：重写to_dict方法（推荐）

python 复制代码

class ProductItem(UpdateItem):
    __table_name__ = "products"
    __unique_key__ = ["product_id"]

    def __init__(self, *args, **kwargs):
        # 定义字段是为了可读性 - 代码即文档
        self.product_id = None  # 商品ID
        self.title = None  # 商品标题
        self.price = None  # 商品价格
        self.brand = None  # 品牌
        self.category = None  # 分类
        self.shop_name = None  # 店铺名称
        self.description = None  # 商品描述
        self.specifications = None  # 规格参数
        self.image_urls = None  # 商品图片
        self.stock_count = None  # 库存数量
        self.sales_count = None  # 销量
        self.comment_count = None  # 评论数
        self.rating_score = None  # 评分

    @property
    def to_dict(self):
        """
        重写to_dict方法，只返回非None值的字段
        这样UpdateItem就只会更新有值的字段，不会覆盖数据库中的非空字段
        """
        propertys = {}
        for key, value in self.__dict__.items():
            # 跳过框架内部字段
            if key in (
                "__name__",
                "__table_name__", 
                "__name_underline__",
                "__update_key__",
                "__unique_key__",
            ):
                continue
            
            # 跳过以类名开头的私有字段
            if key.startswith(f"_{self.__class__.__name__}"):
                key = key.replace(f"_{self.__class__.__name__}", "")
            
            # 只包含非None值的字段
            if value is not None:
                propertys[key] = value
        
        return propertys

方案2：使用装饰器模式

python 复制代码

def filter_none_fields(cls):
    """
    装饰器：让UpdateItem只更新非None字段
    """
    @property
    def to_dict(self):
        propertys = {}
        for key, value in self.__dict__.items():
            if key in (
                "__name__",
                "__table_name__", 
                "__name_underline__",
                "__update_key__",
                "__unique_key__",
            ):
                continue
            
            if key.startswith(f"_{self.__class__.__name__}"):
                key = key.replace(f"_{self.__class__.__name__}", "")
            
            if value is not None:
                propertys[key] = value
        
        return propertys
    
    cls.to_dict = to_dict
    return cls

@filter_none_fields
class ProductItem(UpdateItem):
    __table_name__ = "products"
    __unique_key__ = ["product_id"]

    def __init__(self, *args, **kwargs):
        # 定义字段是为了可读性 - 代码即文档
        self.product_id = None  # 商品ID
        self.title = None  # 商品标题
        self.price = None  # 商品价格
        self.brand = None  # 品牌
        self.description = None  # 商品描述
        self.specifications = None  # 规格参数
        self.sales_count = None  # 销量
        self.comment_count = None  # 评论数
        self.rating_score = None  # 评分

方案3：继承并扩展UpdateItem

python 复制代码

class SmartUpdateItem(UpdateItem):
    """
    智能UpdateItem：只更新非None字段
    """
    
    @property
    def to_dict(self):
        """只返回非None值的字段"""
        propertys = {}
        for key, value in self.__dict__.items():
            if key in (
                "__name__",
                "__table_name__", 
                "__name_underline__",
                "__update_key__",
                "__unique_key__",
            ):
                continue
            
            if key.startswith(f"_{self.__class__.__name__}"):
                key = key.replace(f"_{self.__class__.__name__}", "")
            
            if value is not None:
                propertys[key] = value
        
        return propertys

class ProductItem(SmartUpdateItem):
    __table_name__ = "products"
    __unique_key__ = ["product_id"]

    def __init__(self, *args, **kwargs):
        # 定义字段是为了可读性 - 代码即文档
        self.product_id = None  # 商品ID
        self.title = None  # 商品标题
        self.price = None  # 商品价格
        self.brand = None  # 品牌
        self.description = None  # 商品描述
        self.specifications = None  # 规格参数
        self.sales_count = None  # 销量
        self.comment_count = None  # 评论数
        self.rating_score = None  # 评分

实际应用示例

完整的爬虫实现

python 复制代码

import feapder
from feapder import UpdateItem

class ProductItem(UpdateItem):
    __table_name__ = "products"
    __unique_key__ = ["product_id"]

    def __init__(self, *args, **kwargs):
        # 定义字段是为了可读性 - 代码即文档
        self.product_id = None  # 商品ID
        self.title = None  # 商品标题
        self.price = None  # 商品价格
        self.brand = None  # 品牌
        self.category = None  # 分类
        self.shop_name = None  # 店铺名称
        self.description = None  # 商品描述
        self.specifications = None  # 规格参数
        self.image_urls = None  # 商品图片
        self.stock_count = None  # 库存数量
        self.sales_count = None  # 销量
        self.comment_count = None  # 评论数
        self.rating_score = None  # 评分

    @property
    def to_dict(self):
        """只返回非None值的字段"""
        propertys = {}
        for key, value in self.__dict__.items():
            if key in (
                "__name__",
                "__table_name__", 
                "__name_underline__",
                "__update_key__",
                "__unique_key__",
            ):
                continue
            
            if key.startswith(f"_{self.__class__.__name__}"):
                key = key.replace(f"_{self.__class__.__name__}", "")
            
            if value is not None:
                propertys[key] = value
        
        return propertys

class ProductSpider(feapder.BatchSpider):
    """商品爬虫"""
    
    def start_requests(self, task):
        """从任务表获取商品列表页URL"""
        task_id, list_url = task
        yield feapder.Request(list_url, task_id=task_id, callback=self.parse_list_page)

    def parse_list_page(self, request, response):
        """解析商品列表页"""
        for product in response.xpath('//div[@class="product-item"]'):
            item = ProductItem()
            
            # 列表页只能获取基础信息
            item.product_id = product.xpath('.//@data-product-id').extract_first()
            item.title = product.xpath('.//div[@class="product-title"]//text()').extract_first()
            item.price = product.xpath('.//div[@class="product-price"]//text()').extract_first()
            item.brand = product.xpath('.//div[@class="product-brand"]//text()').extract_first()
            
            # 只更新有值的字段，不会覆盖数据库中的description、specifications等
            yield item
            
            # 同时请求详情页
            detail_url = product.xpath('.//div[@class="product-title"]//a/@href').extract_first()
            if detail_url:
                yield feapder.Request(
                    detail_url, 
                    callback=self.parse_detail_page,
                    meta={'product_id': item.product_id}
                )

    def parse_detail_page(self, request, response):
        """解析商品详情页"""
        item = ProductItem()
        
        # 详情页获取完整信息
        item.product_id = request.meta.get('product_id')
        item.description = response.xpath('//div[@class="product-description"]/text()').extract_first()
        item.specifications = response.xpath('//div[@class="product-specs"]//text()').extract()
        item.sales_count = response.xpath('//div[@class="sales-count"]//text()').extract_first()
        item.comment_count = response.xpath('//div[@class="comment-count"]//text()').extract_first()
        item.rating_score = response.xpath('//div[@class="rating-score"]//text()').extract_first()
        
        # 更新可能变化的基础信息
        item.title = response.xpath('//h1[@class="product-title"]/text()').extract_first()
        item.price = response.xpath('//span[@class="current-price"]/text()').extract_first()
        
        # 只更新有值的字段，不会影响其他字段
        yield item

测试验证

python 复制代码

def test_update_logic():
    """测试更新逻辑"""
    # 模拟数据库中已有数据
    existing_data = {
        'product_id': '12345',
        'title': 'iPhone 15',
        'price': '5999',
        'description': '苹果最新款手机',
        'specifications': '6.1英寸屏幕，A17芯片',
        'sales_count': '1000+'
    }
    
    # 创建更新item（模拟列表页数据）
    item = ProductItem()
    item.product_id = '12345'  # 唯一键
    item.title = 'iPhone 15 Pro'  # 标题更新了
    item.price = '6999'  # 价格更新了
    # description, specifications, sales_count 保持None
    
    # 验证to_dict只包含非None字段
    result_dict = item.to_dict
    assert 'product_id' in result_dict
    assert 'title' in result_dict
    assert 'price' in result_dict
    assert 'description' not in result_dict  # None值被过滤
    assert 'specifications' not in result_dict  # None值被过滤
    assert 'sales_count' not in result_dict  # None值被过滤
    
    print("测试通过：只更新有值的字段")

最佳实践建议

1. 推荐使用方案1

优点：

简单直接，只需要重写一个方法
性能好，不需要额外的装饰器或继承链
易理解，逻辑清晰，容易维护
符合业务逻辑，自然地实现了"只更新有值的字段"的需求

2. 在Item中定义字段的好处

代码即文档：

python 复制代码

class ProductItem(UpdateItem):
    def __init__(self, *args, **kwargs):
        # 字段定义本身就是最好的文档
        self.product_id = None  # 商品ID - 唯一标识
        self.title = None  # 商品标题 - 用于展示和搜索
        self.price = None  # 商品价格 - 实时价格
        self.brand = None  # 品牌 - 用于品牌筛选
        self.description = None  # 商品描述 - 详细说明
        self.specifications = None  # 规格参数 - 技术参数
        self.sales_count = None  # 销量 - 销售统计
        self.comment_count = None  # 评论数 - 用户反馈
        self.rating_score = None  # 评分 - 用户评价

提高开发效率：

IDE可以提供代码补全
便于设置断点和调试
新团队成员可以快速理解数据结构
便于后续维护和修改

3. 分阶段数据采集

对于需要分阶段采集的数据（如列表页+详情页），建议：

列表页：采集基础信息，快速建立数据记录
详情页：补充详细信息，更新可能变化的基础信息
使用UpdateItem：确保数据不重复，只更新有变化的字段

4. 监控和日志

python 复制代码

def parse_list_page(self, request, response):
    """解析商品列表页"""
    for product in response.xpath('//div[@class="product-item"]'):
        item = ProductItem()
        
        # 列表页数据
        item.product_id = product.xpath('.//@data-product-id').extract_first()
        item.title = product.xpath('.//div[@class="product-title"]//text()').extract_first()
        item.price = product.xpath('.//div[@class="product-price"]//text()').extract_first()
        
        # 记录日志
        self.logger.info(f"列表页采集商品: ID={item.product_id}, 标题={item.title}, 价格={item.price}")
        
        yield item

5. 字段命名规范

python 复制代码

class ProductItem(UpdateItem):
    def __init__(self, *args, **kwargs):
        # 使用下划线命名法，与数据库字段保持一致
        self.product_id = None  # 商品ID
        self.product_title = None  # 商品标题
        self.current_price = None  # 当前价格
        self.brand_name = None  # 品牌名称
        self.category_id = None  # 分类ID
        self.shop_id = None  # 店铺ID
        self.product_description = None  # 商品描述
        self.technical_specs = None  # 技术规格
        self.image_url_list = None  # 图片URL列表
        self.stock_quantity = None  # 库存数量
        self.total_sales = None  # 总销量
        self.comment_total = None  # 评论总数
        self.average_rating = None  # 平均评分

总结

通过重写UpdateItem的to_dict方法，我们可以优雅地实现"只更新有值字段"的需求。这种方法：

符合业务逻辑：自然地实现了只更新有值字段的需求
代码即文档：在Item中定义字段提高了代码的可读性和可维护性
代码简洁：业务代码不需要考虑哪些字段要更新，哪些不要更新
性能良好：没有额外的性能开销
易于维护：逻辑清晰，容易理解和维护
适用场景广泛：适用于电商、新闻、招聘等各种需要分阶段采集数据的场景

这种解决方案让我们的爬虫代码更加优雅和自然，符合"业务代码怎么写，框架就怎么工作"的设计理念。无论是商品信息、新闻文章还是招聘信息，都可以使用这种模式来实现高效的数据采集和更新。

关键要点：

在Item中定义字段是为了可读性，代码即文档
重写to_dict方法实现只更新有值字段
不需要预定义所有字段，但定义字段有助于代码理解和维护
适用于分阶段数据采集的场景