解决 xlrd 2.0+ 版本只支持 xls 格式的问题

问题背景

最近在使用 Python 处理 Excel 文件时，遇到了一个常见的错误：

复制代码

处理失败: 处理文件失败: Your version of xlrd is 2.0.2. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.

这个错误的原因是 xlrd 库在 2.0 版本之后移除了对 .xlsx 格式的支持，只保留了传统的 .xls 格式读取功能。这对于需要处理现代 Excel 文件格式的用户来说是一个不小的挑战。

问题分析

为什么会有这个变化？

历史原因：xlrd 是一个历史悠久的 Excel 读取库
格式差异：.xlsx 是基于 XML 的开放格式，而 .xls 是二进制格式
维护成本：维护两种不同格式的解析器成本较高
开源策略：作者决定专注于 .xls 格式，推荐使用其他库处理 .xlsx

影响范围

使用 xlrd 直接读取 .xlsx 文件的代码
使用 pandas 等依赖 xlrd 的库读取 .xlsx 文件
需要同时处理 .xls 和 .xlsx 格式的项目

解决方案

方案一：安装 openpyxl（推荐）

这是官方推荐的做法，也是目前最主流的方法。

bash 复制代码

# 使用 pip 安装
pip install openpyxl

# 使用 conda 安装
conda install openpyxl

# 如果使用 requirements.txt
# 添加 openpyxl>=3.0.0

方案二：降级 xlrd 版本

如果你有历史遗留代码，暂时不想修改，可以选择降级 xlrd。

bash 复制代码

# 降级到 1.2.0 版本
pip install xlrd==1.2.0

注意：此方案不推荐长期使用，因为旧版本可能缺少安全更新和新功能。

代码调整指南

1. 使用 pandas 读取 Excel 文件

python 复制代码

import pandas as pd
import os

def read_excel_auto(file_path):
    """
    自动根据文件扩展名选择读取引擎
    """
    _, ext = os.path.splitext(file_path)
    
    if ext.lower() == '.xlsx':
        # 使用 openpyxl 读取 .xlsx 文件
        return pd.read_excel(file_path, engine='openpyxl')
    elif ext.lower() == '.xls':
        # 使用 xlrd 读取 .xls 文件
        return pd.read_excel(file_path, engine='xlrd')
    elif ext.lower() == '.xlsm':
        # .xlsm 也使用 openpyxl
        return pd.read_excel(file_path, engine='openpyxl')
    else:
        raise ValueError(f"不支持的文件格式: {ext}")

# 使用示例
try:
    df = read_excel_auto('data.xlsx')
    print("文件读取成功！")
    print(df.head())
except Exception as e:
    print(f"读取失败: {e}")

2. 直接使用 openpyxl

如果你需要更细粒度的控制，可以直接使用 openpyxl：

python 复制代码

from openpyxl import load_workbook

def read_excel_detailed(file_path):
    """
    使用 openpyxl 详细读取 Excel 文件
    """
    # 加载工作簿
    wb = load_workbook(filename=file_path, 
                       read_only=False,  # 只读模式可加快大文件读取
                       data_only=True)   # 只获取值，不获取公式
    
    # 获取所有工作表名
    sheet_names = wb.sheetnames
    print(f"工作表: {sheet_names}")
    
    # 读取第一个工作表
    ws = wb.active
    
    # 获取单元格值
    cell_value = ws['A1'].value
    print(f"A1 单元格的值: {cell_value}")
    
    # 遍历行
    data = []
    for row in ws.iter_rows(min_row=1, max_row=10, values_only=True):
        data.append(row)
    
    return data

# 使用示例
data = read_excel_detailed('example.xlsx')

3. 批量处理多个文件

python 复制代码

import pandas as pd
import glob
from pathlib import Path

def batch_process_excel_files(folder_path, output_path=None):
    """
    批量处理文件夹中的所有 Excel 文件
    """
    # 支持的文件格式
    extensions = ['*.xlsx', '*.xls', '*.xlsm']
    
    all_dataframes = {}
    
    for ext in extensions:
        pattern = str(Path(folder_path) / ext)
        files = glob.glob(pattern)
        
        for file in files:
            try:
                if file.endswith('.xls'):
                    df = pd.read_excel(file, engine='xlrd')
                else:
                    df = pd.read_excel(file, engine='openpyxl')
                
                filename = Path(file).stem
                all_dataframes[filename] = df
                print(f"成功读取: {file}")
                
            except Exception as e:
                print(f"读取失败 {file}: {e}")
    
    # 如果指定了输出路径，保存合并结果
    if output_path and all_dataframes:
        with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
            for name, df in all_dataframes.items():
                df.to_excel(writer, sheet_name=name[:31])  # 工作表名最多31字符
        print(f"结果已保存到: {output_path}")
    
    return all_dataframes

实际应用案例

案例：处理销售数据报表

python 复制代码

import pandas as pd
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

class ExcelProcessor:
    def __init__(self):
        self.engine_map = {
            '.xlsx': 'openpyxl',
            '.xlsm': 'openpyxl',
            '.xls': 'xlrd'
        }
    
    def get_engine(self, file_path):
        """根据文件扩展名获取对应的引擎"""
        ext = Path(file_path).suffix.lower()
        return self.engine_map.get(ext, 'openpyxl')
    
    def read_sales_report(self, file_path):
        """读取销售报表"""
        try:
            engine = self.get_engine(file_path)
            
            # 读取 Excel 文件
            df = pd.read_excel(
                file_path,
                engine=engine,
                sheet_name=0,  # 第一个工作表
                header=0,      # 使用第一行作为列名
                skiprows=1     # 跳过第一行（如果有标题行）
            )
            
            # 数据清洗
            df = self.clean_data(df)
            
            # 数据分析
            analysis = self.analyze_sales(df)
            
            return df, analysis
            
        except Exception as e:
            raise Exception(f"读取销售报表失败: {str(e)}")
    
    def clean_data(self, df):
        """数据清洗"""
        # 删除空行
        df = df.dropna(how='all')
        
        # 填充缺失值
        numeric_cols = df.select_dtypes(include=['number']).columns
        df[numeric_cols] = df[numeric_cols].fillna(0)
        
        return df
    
    def analyze_sales(self, df):
        """销售数据分析"""
        analysis = {
            'total_sales': df['销售额'].sum(),
            'average_sales': df['销售额'].mean(),
            'max_sales': df['销售额'].max(),
            'min_sales': df['销售额'].min(),
            'product_count': df['产品名称'].nunique(),
            'record_count': len(df)
        }
        
        return analysis

# 使用示例
processor = ExcelProcessor()
try:
    df, analysis = processor.read_sales_report('sales_report.xlsx')
    print("数据分析结果:")
    for key, value in analysis.items():
        print(f"{key}: {value}")
except Exception as e:
    print(f"处理失败: {e}")

常见问题解答

Q1：如何检查已安装的 xlrd 版本？

python 复制代码

import xlrd
print(f"当前 xlrd 版本: {xlrd.__version__}")

Q2：我应该完全卸载 xlrd 吗？

不建议完全卸载 xlrd，因为：

有些旧项目可能还需要它
处理 .xls 文件时仍然需要
一些库可能依赖它

Q3：openpyxl 支持所有 Excel 功能吗？

openpyxl 支持大多数 Excel 功能，包括：

读取和写入 .xlsx、.xlsm 文件
公式、图表、样式
数据验证、条件格式
合并单元格等

但对于特别复杂的文件，可能需要结合其他库。

Q4：还有其他替代库吗？

是的，还有其他选项：

库名	支持格式	特点
openpyxl	.xlsx, .xlsm	官方推荐，功能全面
xlrd	.xls	只读，速度快
xlsxwriter	.xlsx	只写，创建文件
pandas	多种格式	高级数据操作
pyexcel	多种格式	统一API接口

Q5：如何编写兼容性代码？

python 复制代码

def read_excel_compatible(file_path, **kwargs):
    """
    兼容性读取 Excel 文件
    """
    import pandas as pd
    from pathlib import Path
    
    # 确定文件格式
    suffix = Path(file_path).suffix.lower()
    
    # 设置引擎
    if 'engine' not in kwargs:
        if suffix in ['.xls']:
            kwargs['engine'] = 'xlrd'
        else:
            kwargs['engine'] = 'openpyxl'
    
    try:
        return pd.read_excel(file_path, **kwargs)
    except ImportError as e:
        print(f"引擎导入失败: {e}")
        print("请安装相应的库: pip install openpyxl xlrd")
        raise

最佳实践建议

统一使用 openpyxl：新项目统一使用 openpyxl 处理所有 Excel 文件
明确指定引擎：使用 pandas 时始终明确指定 engine 参数
版本控制：在 requirements.txt 中固定版本
bash 复制代码
```
pandas>=1.3.0
openpyxl>=3.0.9
xlrd>=2.0.1
```
异常处理：添加适当的异常处理机制
文件格式检查：在处理前验证文件格式
性能考虑：对于大文件，使用 read_only 模式

总结

xlrd 2.0+ 版本不再支持 .xlsx 格式是一个重要的变化，但通过迁移到 openpyxl，我们可以获得更好的性能和更全面的功能支持。本文介绍了多种解决方案和实用技巧，帮助你顺利过渡到新的 Excel 处理方式。

记住关键点：

安装 openpyxl ：pip install openpyxl
明确指定引擎 ：pd.read_excel(file, engine='openpyxl')
保持向后兼容：根据文件扩展名选择合适的引擎

希望本文能帮助你顺利解决 Excel 文件处理问题！如果有任何疑问，欢迎在评论区留言讨论。