NLP---文本前期预处理的几个步骤

饿了就干饭2023-11-15 13:15

1、读取文本

python 复制代码

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""
print("原文：\n", text1)

2、去除换行符

python 复制代码

text = text1.replace("\n", "")
print("去除原文中的换行符：\n", text)

3、分句

python 复制代码

import nltk
sents = nltk.sent_tokenize(text)
print("将文本进行分句：\n", sents)

4、分词

python 复制代码

import string
punctuation_tokens = []
for sent in sents:
    for word in nltk.word_tokenize(sent):
        punctuation_tokens.append(word)
print("将每个句子进行分词：\n", punctuation_tokens)

5、过滤标点符号

python 复制代码

tokens = []
for word in punctuation_tokens:
    if word not in string.punctuation:
        tokens.append(word)
print("将分词结果去除标点符号：\n", tokens)

6、过滤停用词

python 复制代码

from nltk.corpus import stopwords
fltered = [w for w in tokens if w not in stopwords.words("english")]
print("过滤完停用词之后：\n", fltered)

7、剩下有用的单词进行计数

python 复制代码

from collections import Counter
count = Counter(fltered)
print("对最终清洗好的单词进行计数：\n", count)