NLP---文本前期预处理的几个步骤

1、读取文本

python 复制代码
text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""
print("原文:\n", text1)

2、去除换行符

python 复制代码
text = text1.replace("\n", "")
print("去除原文中的换行符:\n", text)

3、分句

python 复制代码
import nltk
sents = nltk.sent_tokenize(text)
print("将文本进行分句:\n", sents)

4、分词

python 复制代码
import string
punctuation_tokens = []
for sent in sents:
    for word in nltk.word_tokenize(sent):
        punctuation_tokens.append(word)
print("将每个句子进行分词:\n", punctuation_tokens)

5、过滤标点符号

python 复制代码
tokens = []
for word in punctuation_tokens:
    if word not in string.punctuation:
        tokens.append(word)
print("将分词结果去除标点符号:\n", tokens)

6、过滤停用词

python 复制代码
from nltk.corpus import stopwords
fltered = [w for w in tokens if w not in stopwords.words("english")]
print("过滤完停用词之后:\n", fltered)

7、剩下有用的单词进行计数

python 复制代码
from collections import Counter
count = Counter(fltered)
print("对最终清洗好的单词进行计数:\n", count)
相关推荐
灵光通码1 小时前
自然语言处理开源框架全面分析
人工智能·自然语言处理·开源
这张生成的图像能检测吗1 小时前
(论文速读)从语言模型到通用智能体
人工智能·计算机视觉·语言模型·自然语言处理·多模态·智能体
MarkHD2 小时前
大语言模型入门指南:从原理到实践应用
人工智能·语言模型·自然语言处理
A尘埃2 小时前
NLP(自然语言处理, Natural Language Processing)
人工智能·自然语言处理·nlp
minhuan4 小时前
构建AI智能体:二十八、大语言模型BERT:原理、应用结合日常场景实践全面解析
人工智能·语言模型·自然语言处理·bert·ai大模型·rag
西柚小萌新8 小时前
【从零开始的大模型原理与实践教程】--第一章:NLP基础概念
人工智能·自然语言处理
研梦非凡20 小时前
CVPR 2025|基于视觉语言模型的零样本3D视觉定位
人工智能·深度学习·计算机视觉·3d·ai·语言模型·自然语言处理
金井PRATHAMA1 天前
AI赋能训诂学:解码古籍智能新纪元
人工智能·自然语言处理·知识图谱
2401_828890641 天前
使用 BERT 实现意图理解和实体识别
人工智能·python·自然语言处理·bert·transformer
scx_link1 天前
Word2Vec词嵌入技术和动态词嵌入技术
人工智能·自然语言处理·word2vec