Python：词频统计流程及综合示例

在文本分析（Text Analysis）、日志处理以及自然语言处理（Natural Language Processing，NLP）中，词频统计（word frequency counting）是最基础也最常见的数据处理任务之一。

所谓词频统计，是指统计某个词、字符或模式在文本中出现的次数，并据此分析文本的结构、主题或信息分布。

在 Python 中，词频统计通常由字符串处理、文本清洗、分词、数据统计等多个步骤组成。

一、词频统计中的典型处理思路

无论是英文文本还是中文文本，词频统计通常遵循类似的处理流程。

css 复制代码

原始文本（Original Text）   ↓文本规范化（Normalization）   ↓文本清洗（Cleaning）   ↓文本切分/分词（Tokenization）   ↓条件过滤（Filtering）   ↓词频统计（Counting）   ↓排序与输出（Ranking）

1、文本规范化

文本规范化的目的是统一文本形式，使同一词项不会因为大小写或字符差异而被统计为不同词。

常见的字符串方法包括：

• str.lower()：将字符串统一转换为小写

• str.casefold()：更严格的大小写归一化（适用于大小写无关比较）

• str.strip()、str.lstrip()、str.rstrip()：去除多余空白字符

例如：

ini 复制代码

text = text.lower()

2、文本清洗

文本清洗用于去除标点符号、噪声字符或无关符号。

常见的字符串方法有：

• str.replace()：简单字符串替换

• str.translate() 与 str.maketrans()：批量字符替换或删除

string 模块相关常量有：

• string.punctuation：提供 ASCII 标点字符集合

例如：

ini 复制代码

table = str.maketrans("", "", string.punctuation)text = text.translate(table)

re 模块相关方法有：

• re.sub()：使用正则表达式进行复杂文本清洗

例如：

python 复制代码

text = re.sub(r"[^a-z\s]", " ", text)

3、文本切分（分词）

文本切分用于将连续字符串拆分为可以统计的单位（词或 token）。

常见的字符串方法有：

• str.split()：按空白或指定分隔符切分

• str.splitlines()：按行切分文本

例如：

ini 复制代码

words = text.split()

中文分词常使用第三方库 Jieba：

• jieba.lcut()：中文分词函数

例如：

ini 复制代码

words = jieba.lcut(text)

4、条件过滤

在统计前，通常需要过滤掉停用词、数字或无意义字符。

常见的字符串方法有：

• str.isalpha()：判断字符串是否全部为字母

• str.isdigit()：判断字符串是否为数字

• str.isalnum()：判断字符串是否由字母或数字组成

也可以使用集合（set）进行停用词过滤。

例如：

ini 复制代码

filtered_words = [w for w in words if w not in stopwords]

5、词频统计

词频统计通常使用字典方法或 collections 模块的 Counter 对象。

常用字典方法有：

• dict.get()：安全读取字典键值并更新计数

例如：

perl 复制代码

freq[w] = freq.get(w, 0) + 1

常用 Counter 方法有：

• collections.Counter()：快速统计可迭代对象中的元素频次

• Counter.most_common()：返回出现次数最多的元素

例如：

ini 复制代码

freq = Counter(words)

6、排序与输出

在完成词频统计后，通常需要按照词频大小对结果进行排序，并输出高频词项。排序有助于快速识别文本中的高频词汇，从而观察文本的主要主题或信息分布。

常见的做法是使用内置函数 sorted() 对统计结果排序，或使用 collections.Counter 提供的 most_common() 方法直接获取高频词。

相关函数包括：

• sorted()：对可迭代对象进行排序

• dict.items()：获取字典中的键值对

• Counter.most_common()：返回出现频率最高的元素

例如：

ini 复制代码

result = sorted(freq.items(), key=lambda item: item[1], reverse=True)

上述代码中：

• freq.items() 返回 (word, count) 形式的键值对

• item[1] 表示按词频进行排序

• reverse=True 表示按降序排列

排序后即可遍历输出结果：

bash 复制代码

for word, count in result:    print(word, count)

如果使用 collections.Counter，则可以直接获取高频词：

ini 复制代码

freq = Counter(words)top_words = freq.most_common(10)

这样可以直接得到出现次数最多的若干个词项，常用于关键词分析或文本主题初步观察。

通过这些字符串方法、字典方法以及标准库函数的组合，可以构建完整的词频统计流程。从简单的文本统计到复杂的自然语言处理任务，这些工具都构成了 Python 文本分析的基础。

二、综合示例

下面分别以英文基础文本、中文文本以及文件级英文文本为例，展示这一流程在不同场景中的具体实现方式。

1、英文词频统计基础示例

下面给出一个最基础的英文词频统计示例，仅使用字符串方法和字典。

python 复制代码

text = """Python is powerful and easy to learn.Python is widely used in data analysis.Many developers use Python for machine learning."""
# 1 文本规范化text = text.lower()
# 2 简单清洗text = text.replace(".", "")  # 此处仅删除句点
# 3 分词words = text.split()
# 4 词频统计freq = {}
for w in words:    freq[w] = freq.get(w, 0) + 1
# 5 排序输出result = sorted(freq.items(), key=lambda item: item[1], reverse=True)
for word, count in result:    print(word, count)

2、中文词频统计基础示例

中文文本通常没有天然的空格分词结构，因此需要借助分词工具。

Python 中最常用的中文分词库是 Jieba。

python 复制代码

import jieba
text = """人工智能正在改变世界。人工智能的发展推动了机器学习和深度学习的发展。人工智能技术已经广泛应用于自动驾驶、语音识别和自然语言处理等领域。"""
# 1 分词words = jieba.lcut(text)
# 2 停用词stopwords = {"的", "了", "和", "是", "在", "于", "等"}
# 3 过滤词项filtered_words = []
for w in words:    if w in stopwords:        continue    if len(w.strip()) <= 1:    # 用于去除单字        continue    filtered_words.append(w)
# 4 词频统计freq = {}
for w in filtered_words:    freq[w] = freq.get(w, 0) + 1
# 5 排序输出result = sorted(freq.items(), key=lambda item: item[1], reverse=True)
for word, count in result[:10]:    print(word, count)

3、英文词频统计进阶示例

在真实文本分析中，通常需要：

• 从文件读取文本

• 清理标点和噪声字符

• 使用高效统计工具

假设有文本文件：

go 复制代码

sample.txt

代码示例：

python 复制代码

import reimport stringfrom collections import Counter
# 1 读取文本文件with open("sample.txt", "r", encoding="utf-8") as f:    text = f.read()
# 2 文本规范化text = text.lower()
# 3 删除标点table = str.maketrans("", "", string.punctuation)text = text.translate(table)
# 4 正则清洗（只保留字母和空格）text = re.sub(r"[^a-z\s]", " ", text)
# 5 分词words = text.split()
# 6 停用词stopwords = {"the", "and", "is", "to", "of", "in"}
filtered_words = [w for w in words if w not in stopwords]
# 7 词频统计freq = Counter(filtered_words)
# 8 输出高频词（前10个）for word, count in freq.most_common(10):    print(word, count)

📘 小结

词频统计是文本分析和自然语言处理中的基础操作，其核心流程通常包括：文本规范化、文本清洗、分词、过滤和统计。在 Python 中，可以从最简单的字符串方法、字典统计开始实现基础词频分析；在更复杂的场景中，则常结合 string 模块、re 正则表达式、collections.Counter 以及分词工具（如 Jieba）构建完整的文本处理流程。通过这些工具的组合，可以高效完成英文或中文文本的词频统计任务。

"点赞有美意，赞赏是鼓励"