八周Python强化计划（七）

实战项目（上）------ 文本处理工具：词频统计器

这是第一次完整的小型项目开发，将综合运用前六周所学（文件操作、数据结构、函数、模块、标准库等），并引入 命令行参数解析（argparse） 和 项目结构组织，培养工程化思维。

🗓 第7周教学主题：实战项目（上）------ 构建词频统计工具

🎯 孺子可教版项目目标

开发一个命令行工具 word_counter.py，功能如下：

读取一个文本文件（如 article.txt）
清理文本（转小写、去标点）
统计词频（使用 collections.Counter）
输出前 N 个高频词（默认 N=10）
支持命令行参数：python word_counter.py article.txt --top 5

🕒 第13课：项目分析与核心模块开发（60分钟）

⏱ 时间分配建议

项目需求拆解（10分钟）
核心函数设计与编码（35分钟）
模块化组织（10分钟）
Q&A（5分钟）

📚 1. 项目结构设计（提前展示）

复制代码

text_analyzer/
│
├── main.py                 # 入口：解析命令行 + 调用逻辑
├── text_utils.py           # 核心功能：读取、清洗、统计
└── sample.txt              # 示例文本文件（可选）

✅ 遵循"单一职责"：main.py 负责交互，text_utils.py 负责业务逻辑

📚 2. 核心功能拆解与代码实现（在 `text_utils.py` 中）

步骤1：读取文本文件

python 复制代码

# text_utils.py
def read_text_file(filepath):
    """安全读取文本文件"""
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            return f.read()
    except FileNotFoundError:
        print(f"❌ 文件 {filepath} 未找到！")
        return ""
    except Exception as e:
        print(f"❌ 读取文件出错: {e}")
        return ""

步骤2：清洗文本（去标点、转小写）

python 复制代码

import string

def clean_text(text):
    """移除标点，转为小写"""
    # 移除所有标点符号
    translator = str.maketrans('', '', string.punctuation)
    clean = text.translate(translator)
    return clean.lower()

🔍 string.punctuation 包含：!"#$%&'()*+,-./:;<=>?@[\]^_{|}~`

步骤3：统计词频（使用 `Counter`）

python 复制代码

from collections import Counter

def count_words(text):
    """统计词频，忽略空字符串"""
    words = text.split()
    words = [word for word in words if word]  # 过滤空词
    return Counter(words)

步骤4：获取前 N 个高频词

python 复制代码

def get_top_words(counter, n=10):
    """返回前 n 个高频词（列表 of (word, count)）"""
    return counter.most_common(n)

步骤5：格式化输出

python 复制代码

def format_output(top_words):
    """格式化为易读字符串"""
    lines = []
    for i, (word, count) in enumerate(top_words, 1):
        lines.append(f"{i:2}. {word:<15} ({count})")
    return "\n".join(lines)

📚 3. 测试核心模块（可在课堂演示）

python 复制代码

# 临时测试代码（后续移到 main.py）
if __name__ == '__main__':
    text = read_text_file("sample.txt")
    clean = clean_text(text)
    counter = count_words(clean)
    top = get_top_words(counter, 5)
    print(format_output(top))

✅ 强调：if __name__ == '__main__' 用于模块自测！

✍️ 课堂练习（35分钟）

任务：创建 text_utils.py，实现上述5个函数。

提供一个 sample.txt（内容如下）用于测试：

复制代码

Hello world! This is a sample text.
Hello Python, hello programming.
World of code, code of world.

✅ 预期输出（前3）：

复制代码

 1. hello             (3)
 2. world             (2)
 3. code              (2)

🕒 第14课：命令行接口与项目整合（60分钟）

⏱ 时间分配

argparse 讲解（15分钟）
编写 main.py（25分钟）
项目测试与调试（15分钟）
代码组织复盘（5分钟）

📚 1. 使用 `argparse` 解析命令行参数

让工具支持：python main.py input.txt --top 5

python 复制代码

# main.py
import argparse
from text_utils import read_text_file, clean_text, count_words, get_top_words, format_output

def main():
    parser = argparse.ArgumentParser(description="词频统计工具")
    parser.add_argument("filepath", help="要分析的文本文件路径")
    parser.add_argument("--top", type=int, default=10, help="显示前N个高频词（默认10）")
    
    args = parser.parse_args()
    
    # 执行分析流程
    text = read_text_file(args.filepath)
    if not text:
        return
    
    clean = clean_text(text)
    counter = count_words(clean)
    top_words = get_top_words(counter, args.top)
    
    print(f"\n📊 《{args.filepath}》词频统计（Top {args.top}）:")
    print("-" * 40)
    print(format_output(top_words))

if __name__ == '__main__':
    main()

✅ argparse 自动提供 -h 帮助：
bash 复制代码
python main.py -h
usage: main.py [-h] [--top TOP] filepath

📚 2. 项目完整测试

bash 复制代码

# 假设当前目录有 sample.txt
python main.py sample.txt --top 5

✅ 输出示例：

复制代码

📊 《sample.txt》词频统计（Top 5）:
----------------------------------------
 1. hello             (3)
 2. world             (2)
 3. code              (2)
 4. this              (1)
 5. is                (1)

📚 3. 增强健壮性（可选扩展）

跳过停用词（如 "the", "is"）→ 可加载停用词列表
支持输出到文件 → 添加 --output 参数
处理大文件 → 使用生成器逐行读取（第2周知识）

✍️ 课堂练习（25分钟）

任务：完成 main.py，整合 text_utils.py，支持命令行调用。

要求：

能处理不存在的文件（友好提示）

能指定 --top 数量

输出格式美观

✅ 提示：直接使用上面提供的 main.py 代码即可运行！

🧠 本周核心工程实践

实践	目的
模块化（main.py + utils.py）	代码解耦，易测试、易维护
命令行接口（argparse）	让脚本可被自动化调用
错误处理	提升工具健壮性
标准库组合	展示 Python "电池 included" 优势

📝 课后任务（为第8周铺垫）

扩展功能（任选1-2项）：

添加 --output result.txt 参数，将结果保存到文件

加入简单停用词过滤（如过滤 "a", "the", "is" 等）

用 logging 模块替代 print（为第8周做准备）

为函数添加文档字符串（docstring）
示例停用词列表：
python 复制代码
STOP_WORDS = {"the", "a", "an", "and", "or", "but", "is", "are", "was", "were"}

八周Python强化计划（七）

🗓 第7周 教学主题：实战项目（上）------ 构建词频统计工具

🎯 孺子可教版项目目标

🕒 第13课：项目分析与核心模块开发（60分钟）

⏱ 时间分配建议

📚 1. 项目结构设计（提前展示）

📚 2. 核心功能拆解与代码实现（在 text_utils.py 中）

步骤1：读取文本文件

步骤2：清洗文本（去标点、转小写）

步骤3：统计词频（使用 Counter）

步骤4：获取前 N 个高频词

步骤5：格式化输出

📚 3. 测试核心模块（可在课堂演示）

✍️ 课堂练习（35分钟）

🕒 第14课：命令行接口与项目整合（60分钟）

⏱ 时间分配

📚 1. 使用 argparse 解析命令行参数

📚 2. 项目完整测试

📚 3. 增强健壮性（可选扩展）

✍️ 课堂练习（25分钟）

🧠 本周核心工程实践

📝 课后任务（为第8周铺垫）

🗓 第7周教学主题：实战项目（上）------ 构建词频统计工具

📚 2. 核心功能拆解与代码实现（在 `text_utils.py` 中）

步骤3：统计词频（使用 `Counter`）

📚 1. 使用 `argparse` 解析命令行参数