【自然语言处理】02 文本规范化

【作者主页】Francek Chen

【专栏介绍】⌈ ⌈ ⌈自然语言处理 ⌋ ⌋ ⌋ 自然语言处理 (Natural Language Processing, NLP) 是人工智能领域的重要研究方向，是一门集计算机科学、人工智能和语言学于一体的交叉学科，通过语法分析、语义理解等技术，实现机器翻译、智能问答、情感分析等功能。它包含自然语言理解和自然语言生成两个主要方面，研究内容包括字、词、短语、句子、段落和篇章等多种层次，是机器语言和人类语言之间沟通的桥梁。

【GitCode】专栏资源保存在我的GitCode仓库：https://gitcode.com/Morse_Chen/Natural_language_processing。

文章目录

一、分词

（一）基于空格与标点符号的分词

在以英语为代表的印欧语系中，大部分语言都使用空格字符来切分词。因此分词的一种非常简单的方式就是基于空格进行分词：

python 复制代码

sentence = "I learn natural language processing with dongshouxueNLP, too."
tokens = sentence.split(' ')
print(f'输入语句：{sentence}')
print(f"分词结果：{tokens}")

从上面的代码可以看到，最简单的基于空格的分词方法无法将词与词后面的标点符号分割。如果标点符号对于后续任务（例如文本分类）并不重要，可以去除这些标点符号后再进一步分词：

python 复制代码

#引入正则表达式包
import re
sentence = "I learn natural language processing with dongshouxueNLP, too."
print(f'输入语句：{sentence}')

#去除句子中的","和"."
sentence = re.sub(r'\,|\.','',sentence)
tokens = sentence.split(' ')
print(f"分词结果：{tokens}")

（二）基于正则表达式的分词

正则表达式使用单个字符串（通常称为"模式"即pattern）来描述、匹配对应文本中全部匹配某个指定规则的字符串。我们也可以使用正则表达式来实现空格分词：

python 复制代码

import re
sentence = "Did you spend $3.4 on arxiv.org for your pre-print?"+\
    " No, it's free! It's ..."
# 其中，\w表示匹配a-z，A-Z，0-9和"_"这4种类型的字符，等价于[a-zA-Z0-9_]，
# +表示匹配前面的表达式1次或者多次。因此\w+表示匹配上述4种类型的字符1次或多次。
pattern = r"\w+"
print(re.findall(pattern, sentence))

处理标点：

python 复制代码

# 可以在正则表达式中使用\S来表示除了空格以外的所有字符（\s在正则表达式中表示空格字符，\S则相应的表示\s的补集）
# |表示或运算，*表示匹配前面的表达式0次或多次，\S\w* 表示先匹配除了空格以外的1个字符，后面可以包含0个或多个\w字符。
pattern = r"\w+|\S\w*"
print(re.findall(pattern, sentence))

处理连字符：

python 复制代码

# -表示匹配连字符-，(?:[-']\w+)*表示匹配0次或多次括号内的模式。(?:...)表示匹配括号内的模式，
# 可以和+/*等符号连用。其中?:表示不保存匹配到的括号中的内容，是re代码库中的特殊标准要求的部分。
pattern = r"\w+(?:[-']\w+)*"
print(re.findall(pattern, sentence))

将前面的匹配符号的模式\S\w*组合起来，可以得到一个既可以处理标点符号又可以处理连字符的正则表达式：

python 复制代码

pattern = r"\w+(?:[-']\w+)*|\S\w*"
print(re.findall(pattern, sentence))

在英文简写和网址中，常常会使用'.'，它与英文中的句号为同一个符号，匹配这种情况的正则表达式为：

正则表达式模式：(\w+\.)+\w+(\.)*
符合匹配的字符串示例：
- U.S.A.、arxiv.org
不符合的字符串示例：
- $3.4、...

python 复制代码

#新的匹配模式
new_pattern = r"(?:\w+\.)+\w+(?:\.)*"
pattern = new_pattern +r"|"+pattern
print(re.findall(pattern, sentence))

需要注意的是，字符"."在正则表达式中表示匹配任意字符，因此要表示字符本身的含义时，需要在该符号前面加入转义字符（Escape Character）"\"，即"\."。同理，想要表示"+""？""("")""$"这些特殊字符时，需要在前面加入转义字符"\"。

在许多语言中，货币和百分比符号与数字是直接相连的，匹配这种情况的正则表达式为：

正则表达式模式：\$?\d+(\.\d+)?%?
符合匹配的字符串示例：
- $3.40、3.5%
不符合的字符串示例：
- $.4、1.4.0、1%%

python 复制代码

#新的匹配pattern，匹配价格符号
new_pattern2 = r"\$?\d+(?:\.\d+)?%?"
pattern = new_pattern2 +r"|" + new_pattern +r"|"+pattern
print(re.findall(pattern, sentence))

其中\d表示所有的数字字符，?表示匹配前面的模式0次或者1次。

省略号本身表达了一定的含义，因此要在分词中将其保留，匹配它的正则表达式为：

正则表达式模式： \text{\\} . \text{\\} . \text{\\} .
符合匹配的字符串示例：
- ...

python 复制代码

#新的匹配pattern，匹配价格符号
new_pattern3 = r"\.\.\." 
pattern = new_pattern3 +r"|" + new_pattern2 +r"|" +\
    new_pattern +r"|"+pattern
print(re.findall(pattern, sentence))

NLTK是基于Python的NLP工具包，也可以用于实现前面提到的基于正则表达式的分词。

python 复制代码

import re
import nltk
#引入NLTK分词器
from nltk.tokenize import word_tokenize
from nltk.tokenize import regexp_tokenize

tokens = regexp_tokenize(sentence,pattern)
print(tokens)

（三）基于子词的分词

基于BPE的词元学习器。

给定一个词表包含所有的字符（如，{A, B, C, D, ..., a, b, c, d, ...}），词元学习器重复以下步骤来构建词表：

找出在训练语料中最常相连的两个符号，这里称其为" C 1 C_1 C1"和" C 2 C_2 C2"；
将新组合的符号" C 1 C_1 C1 C 2 C_2 C2"加入词表；
将训练语料中所有相连的" C 1 C_1 C1"和" C 2 C_2 C2"转换成" C 1 C_1 C1 C 2 C_2 C2"；（4）重复上述步骤 k k k次。

假设有一个训练语料包含了一些方向和中国的地名的拼音：

复制代码

nan nan nan nan nan nanjing nanjing beijing beijing beijing beijing beijing beijing dongbei dongbei dongbei bei bei

首先，我们基于空格将语料分解成词元，然后加入特殊符号"_"来作为词尾的标识符，通过这种方式可以更好地去包含相似子串的词语（例如区分al在formal和almost中的区别）。

第一步，根据语料构建初始的词表：

python 复制代码

corpus = "nan nan nan nan nan nanjing nanjing beijing beijing "+\
    "beijing beijing beijing beijing dongbei dongbei dongbei bei bei"
tokens = corpus.split(' ')

#构建基于字符的初始词表
vocabulary = set(corpus) 
vocabulary.remove(' ')
vocabulary.add('_')
vocabulary = sorted(list(vocabulary))

#根据语料构建词表
corpus_dict = {}
for token in tokens:
    key = token+'_'
    if key not in corpus_dict:
        corpus_dict[key] = {"split": list(key), "count": 0}
    corpus_dict[key]['count'] += 1

print(f"语料：")
for key in corpus_dict:
    print(corpus_dict[key]['count'], corpus_dict[key]['split'])
print(f"词表：{vocabulary}")

第二步，词元学习器通过迭代的方式逐步组合新的符号加入到词表中：

python 复制代码

for step in range(9):
    # 如果想要将每一步的结果都输出，请读者自行将max_print_step改成999
    max_print_step = 3
    if step < max_print_step or step == 8: 
        print(f"第{step+1}次迭代")
    split_dict = {}
    for key in corpus_dict:
        splits = corpus_dict[key]['split']
        # 遍历所有符号进行统计
        for i in range(len(splits)-1):
            # 组合两个符号作为新的符号
            current_group = splits[i]+splits[i+1]
            if current_group not in split_dict:
                split_dict[current_group] = 0
            split_dict[current_group] += corpus_dict[key]['count']

    group_hist=[(k, v) for k, v in sorted(split_dict.items(), \
        key=lambda item: item[1],reverse=True)]
    if step < max_print_step or step == 8:
        print(f"当前最常出现的前5个符号组合：{group_hist[:5]}")
    
    merge_key = group_hist[0][0]
    if step < max_print_step or step == 8:
        print(f"本次迭代组合的符号为：{merge_key}")
    for key in corpus_dict:
        if merge_key in key:
            new_splits = []
            splits = corpus_dict[key]['split']
            i = 0
            while i < len(splits):
                if i+1>=len(splits):
                    new_splits.append(splits[i])
                    i+=1
                    continue
                if merge_key == splits[i]+splits[i+1]:
                    new_splits.append(merge_key)
                    i+=2
                else:
                    new_splits.append(splits[i])
                    i+=1
            corpus_dict[key]['split']=new_splits
            
    vocabulary.append(merge_key)
    if step < max_print_step or step == 8:
        print()
        print(f"迭代后的语料为：")
        for key in corpus_dict:
            print(corpus_dict[key]['count'], corpus_dict[key]['split'])
        print(f"词表：{vocabulary}")
        print()
        print('-------------------------------------')

得到学习到的词表之后，给定一句新的句子，使用BPE词元分词器根据词表中每个符号学到的顺序，贪心地将字符组合起来。例如输入是"nanjing beijing"，那么根据上面例子里的词表，会先把"n"和"g"组合成"ng"，然后组合"be""bei"......最终分词成：

python 复制代码

ordered_vocabulary = {key: x for x, key in enumerate(vocabulary)}
sentence = "nanjing beijing"
print(f"输入语句：{sentence}")
tokens = sentence.split(' ')
tokenized_string = []
for token in tokens:
    key = token+'_'
    splits = list(key)
    #用于在没有更新的时候跳出
    flag = 1
    while flag:
        flag = 0
        split_dict = {}
        #遍历所有符号进行统计
        for i in range(len(splits)-1): 
            #组合两个符号作为新的符号
            current_group = splits[i]+splits[i+1] 
            if current_group not in ordered_vocabulary:
                continue
            if current_group not in split_dict:
                #判断当前组合是否在词表里，如果是的话加入split_dict
                split_dict[current_group] = ordered_vocabulary[current_group] 
                flag = 1
        if not flag:
            continue
            
        #对每个组合进行优先级的排序（此处为从小到大）
        group_hist=[(k, v) for k, v in sorted(split_dict.items(),\
            key=lambda item: item[1])] 
        #优先级最高的组合
        merge_key = group_hist[0][0] 
        new_splits = []
        i = 0
        # 根据优先级最高的组合产生新的分词
        while i < len(splits):
            if i+1>=len(splits):
                new_splits.append(splits[i])
                i+=1
                continue
            if merge_key == splits[i]+splits[i+1]:
                new_splits.append(merge_key)
                i+=2
            else:
                new_splits.append(splits[i])
                i+=1
        splits=new_splits
    tokenized_string+=splits

print(f"分词结果：{tokenized_string}")

二、词规范化

（一）大小写折叠

大小写折叠（case folding）是将所有的英文大写字母转化成小写字母的过程。在搜索场景中，用户往往喜欢使用小写，而在计算机中，大写字母和小写字母并非同一字符，当遇到用户想要搜索一些人名、地名等带有大写字母的专有名词的情况下，正确的搜索结果可能会比较难匹配上。

python 复制代码

# Case Folding
sentence = "Let's study Hands-on-NLP"
print(sentence.lower())

（二）词目还原

在诸如英文这样的语言中，很多单词都会根据不同的主语、语境、时态等情形修改形态，而这些单词本身表达的含义是接近甚至是相同的。例如英文中的am、is、are都可以还原成be，英文名词cat根据不同情形有cat、cats、cat's、cats'等多种形态。这些形态对文本的语义影响相对较小，但是大幅度提高了词表的大小，因而提高了自然语言处理模型的构建成本。因此在有些文本处理问题上，会将所有的词进行词目还原（lemmatization），即找出词的原型。人类在学习这些语言的过程中，可以通过词典找词的原型；类似地，计算机可以通过建立词典来进行词目还原：

python 复制代码

# 构建词典
lemma_dict = {'am': 'be','is': 'be','are': 'be','cats': 'cat',\
    "cats'": 'cat',"cat's": 'cat','dogs': 'dog',"dogs'": 'dog',\
    "dog's": 'dog', 'chasing': "chase"}

sentence = "Two dogs are chasing three cats"
words = sentence.split(' ')
print(f'词目还原前：{words}')
lemmatized_words = []
for word in words:
    if word in lemma_dict:
        lemmatized_words.append(lemma_dict[word])
    else:
        lemmatized_words.append(word)

print(f'词目还原后：{lemmatized_words}')

另外，也可以利用NLTK自带的词典来进行词目还原：

python 复制代码

import nltk
#引入nltk分词器、lemmatizer，引入wordnet还原动词
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

#下载分词包、wordnet包
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)


lemmatizer = WordNetLemmatizer()
sentence = "Two dogs are chasing three cats"

words = word_tokenize(sentence)
print(f'词目还原前：{words}')
lemmatized_words = []
for word in words:
    lemmatized_words.append(lemmatizer.lemmatize(word, wordnet.VERB))

print(f'词目还原后：{lemmatized_words}')

三、分句

很多实际场景中，我们往往需要处理很长的文本，例如新闻、财报、日志等。让计算机直接同时处理整个文本会非常的困难，因此需要将文本分成许多句子来让计算机分别进行处理。对于分句问题，最常见的方法是根据标点符号来分割文本，例如"！""？""。"等符号。然而，在某些语言当中，个别分句符号会有歧义。例如英文中的句号"."也同时有省略符（例如"Inc."、"Ph.D."、"Mr."等）、小数点（例如"3.5"、".3%"）等含义。这些歧义会导致分句困难。为了解决这种问题，常见的方案是先进行分词，使用基于正则表达式或者基于机器学习的分词方法将文本分解成词元，随后基于符号判断句子边界。例如：

python 复制代码

sentence_spliter = set([".","?",'!','...'])
sentence = "Did you spend $3.4 on arxiv.org for your pre-print? " + \
    "No, it's free! It's ..."

tokens = regexp_tokenize(sentence,pattern)

sentences = []
boundary = [0]
for token_id, token in enumerate(tokens):
    # 判断句子边界
    if token in sentence_spliter:
        #如果是句子边界，则把分句结果加入进去
        sentences.append(tokens[boundary[-1]:token_id+1]) 
        #将下一句句子起始位置加入boundary
        boundary.append(token_id+1) 

if boundary[-1]!=len(tokens):
    sentences.append(tokens[boundary[-1]:])

print(f"分句结果：")
for seg_sentence in sentences:
    print(seg_sentence)

欢迎点赞👍 | 收藏⭐ | 评论✍ | 关注🤗