TF-IDF 到 BM25 - 技术栈

IF-IDF

IF_IDF (Term Frequency-Inverse Document Frequency): 词频-逆文档词频

1. IF:

给定的单词word在文档集合中出现的次数除以文档中所有单词总数

scss 复制代码

tf(t,d) = count of t in d / number of words in d # d值得是所有文档，可以是一个也可以是多个

2. IDF

总文件数目除以包含该词语的文件的数目 ，再将得到的商取对数得到。

最常见的逆文档频率（IDF）计算公式为： <math xmlns="http://www.w3.org/1998/Math/MathML"> I D F = log ⁡ N D F IDF=\log\frac{N}{DF} </math>IDF=logDFN，其中：
- <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N是文档的总数。例如，你有一个包含 100 篇文章的文档集合，那么 <math xmlns="http://www.w3.org/1998/Math/MathML"> N = 100 N = 100 </math>N=100。
- <math xmlns="http://www.w3.org/1998/Math/MathML"> D F DF </math>DF（Document Frequency）是包含某个特定词的文档的数量。比如，"人工智能" 这个词在 100 篇文章中有 10 篇文章出现，那么对于 "人工智能" 这个词， <math xmlns="http://www.w3.org/1998/Math/MathML"> D F = 10 DF = 10 </math>DF=10。
以这个例子来计算 "人工智能" 的 <math xmlns="http://www.w3.org/1998/Math/MathML"> I D F = log ⁡ 100 10 IDF = \log\frac{100}{10} </math>IDF=log10100。

通常为了避免分母为0，会对公式进行平滑处理，对分子分母分别加一， <math xmlns="http://www.w3.org/1998/Math/MathML"> I D F = log ⁡ N + 1 D F + 1 IDF=\log\frac{N + 1}{DF + 1} </math>IDF=logDF+1N+1

如果包含词条t的文档越少, IDF越大，则说明词条具有很好的类别区分能力,词条出现越频繁，说明该词越不重要。

4. 应用

（1）搜索引擎；（2）关键词提取；（3）文本相似性；（4）文本摘要

5. 优缺点

优点：
- 简单，易懂
缺点：
- 忽视语义：将文档当词袋模型，忽略词序与语义关系，在需理解语义任务中效果欠佳。
- 低频敏感：因逆文档频率计算，易使低频词权重过高，可能受拼写错误、罕见词等干扰。
- 缺上下文感知：不能考虑词的上下文含义差异，在需精细语义理解任务应用受限。

BM25

公式

BM25 公式：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> BM25 ( q , D ) = ∑ i = 1 n IDF ( q i ) ⋅ f ( q i , D ) ⋅ ( k 1 + 1 ) f ( q i , D ) + k 1 ⋅ ( 1 − b + b ⋅ ∣ D ∣ avgDL ) \text{BM25}(q, D) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left( 1 - b + b \cdot \frac{|D|}{\text{avgDL}} \right)} </math>BM25(q,D)=i=1∑nIDF(qi)⋅f(qi,D)+k1⋅(1−b+b⋅avgDL∣D∣)f(qi,D)⋅(k1+1)

其中：

<math xmlns="http://www.w3.org/1998/Math/MathML"> q q </math>q 是查询， <math xmlns="http://www.w3.org/1998/Math/MathML"> D D </math>D 是文档， <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q_i </math>qi是查询中的单个词项。
<math xmlns="http://www.w3.org/1998/Math/MathML"> f ( q i , D ) f(q_i, D) </math>f(qi,D) 是词项 <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q_i </math>qi 在文档 <math xmlns="http://www.w3.org/1998/Math/MathML"> D D </math>D 中的出现频率（Term Frequency，TF）。
<math xmlns="http://www.w3.org/1998/Math/MathML"> ∣ D ∣ |D| </math>∣D∣ 是文档 <math xmlns="http://www.w3.org/1998/Math/MathML"> D D </math>D 的长度（即文档中的总词数）。
<math xmlns="http://www.w3.org/1998/Math/MathML"> a v g D L avgDL </math>avgDL 是文档集中文档的平均长度。
<math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 k_1 </math>k1 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> b b </math>b 是两个调节参数，通常取值：
- <math xmlns="http://www.w3.org/1998/Math/MathML"> k 1 k_1 </math>k1 in <math xmlns="http://www.w3.org/1998/Math/MathML"> [ 1.2 , 2.0 ] [1.2, 2.0] </math>[1.2,2.0]，控制词频对评分的影响。
- <math xmlns="http://www.w3.org/1998/Math/MathML"> b b </math>b in <math xmlns="http://www.w3.org/1998/Math/MathML"> [ 0 , 1 ] [0, 1] </math>[0,1]，控制文档长度的归一化。
<math xmlns="http://www.w3.org/1998/Math/MathML"> I D F IDF </math>IDF 是逆文档频率（Inverse Document Frequency），计算公式为：

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> IDF ( q i ) = log ⁡ ( N − df ( q i ) + 0.5 df ( q i ) + 0.5 + 1.0 ) \text{IDF}(q_i) = \log \left( \frac{N - \text{df}(q_i) + 0.5}{\text{df}(q_i) + 0.5} + 1.0 \right) </math>IDF(qi)=log(df(qi)+0.5N−df(qi)+0.5+1.0)

其中：
- <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N 是文档集中的总文档数。
- <math xmlns="http://www.w3.org/1998/Math/MathML"> d f ( q i ) df(q_i) </math>df(qi)是包含词 q_的文档数（即文档频率，Document Frequency）。

改进之处

非线性词频（TF）函数：在TF-IDF中，词频（TF）是一个简单的线性函数，即词在文档中出现的次数。BM25引入了一个非线性函数来限制词频的影响。具体来说，当某个词出现的频率很高时，它对相关性的贡献会逐渐减少。这是通过一个称为"k1"的参数来控制的。

文档长度归一化 ：通过参数b调整文档长度对得分的影响，避免长文档得分偏高。

改进的IDF：对于常见词的惩罚更强，减少其对相关性的贡献。

灵活的参数调节 ：通过调整k1和b，可以优化模型以适应不同应用场景

代码实现

IF-IDF：

python 复制代码

from collections import defaultdict
import math


# 英文停用词列表，可根据实际需求扩展或替换为中文停用词（如果处理中文文本）
"""
terms = ["a", "an", "the", "and", "or", "of", "in", "to", "is", "are", "was", "were", "it", "this", "that", "for",
              "on", "at", "by", "with", "from", "as", "but", "not", "if", "you", "he", "she", "they", "we", "my", "your",
              "his", "her", "its", "our", "their"]
"""



def preprocess_text(text):
    """
    对文本进行预处理，包括统一小写、去除停用词
    """
    # text = [word.lower() for word in text if word.lower() not in terms]
    return text


def compute_tf_for_term(documents, term):
    """
    计算指定词汇term在每个文档中的词频（TF）
    """
    tf_dict = {}
    for doc_index, doc in enumerate(documents):
        doc = preprocess_text(doc)
        term_count = doc.count(term)
        total_word_count = len(doc)
        if total_word_count > 0:
            tf_dict[doc_index] = term_count / total_word_count
        else:
            tf_dict[doc_index] = 0
    return tf_dict


def compute_idf_for_term(documents, term):
    """
    计算指定词汇term的逆文档频率（IDF）
    """
    doc_num = len(documents)
    doc_count_with_term = 0
    for doc in documents:
        doc = preprocess_text(doc)
        if term in doc:
            doc_count_with_term += 1
    if doc_count_with_term > 0:
        return math.log(doc_num / doc_count_with_term)
    return 0


def compute_tf_idf_for_term(documents, term):
    """
    计算指定词汇term与每个文档的TF-IDF值
    """
    tf_dict = compute_tf_for_term(documents, term)
    idf = compute_idf_for_term(documents, term)
    tf_idf_dict = {}
    for doc_index in tf_dict:
        tf_idf_dict[doc_index] = tf_dict[doc_index] * idf
    return tf_idf_dict


if __name__ == '__main__':
    documents = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    term = "dog"  # 这里指定要计算TF-IDF值的词汇，可根据需要替换
    tf_idf_result = compute_tf_idf_for_term(documents, term)
    for doc_index, value in tf_idf_result.items():
        print(f"词汇 {term} 在文档 {doc_index + 1} 中的TF-IDF值为: {value}")

BM25:

python 复制代码

import math
import re
from collections import defaultdict


class BM25:
    def __init__(self, documents, k1=1.2, b=0.75):
        """
        BM25初始化函数
        :param documents: 文档集合，每个元素为一个文档（可以是字符串形式，内部会进行预处理）
        :param k1: BM25算法中的调节参数，默认值为1.2
        :param b: BM25算法中的调节参数，默认值为0.75
        """
        self.documents = documents
        self.k1 = k1
        self.b = b
        self.avg_doc_length = self._compute_avg_doc_length()
        self.freq_matrix = self._build_freq_matrix()
        self.idf_dict = self._compute_idf()

    def _preprocess_text(self, text):
        """
        对文本进行预处理，包括转换为小写、去除标点符号、分词（简单按空格分词，可替换为专业分词库如jieba）
        :param text: 输入的文本字符串
        :return: 处理后的词列表
        """
        text = text.lower()
        text = re.sub(r'[^\w\s]', '', text)
        words = text.split()
        return words

    def _compute_avg_doc_length(self):
        """
        计算文档集合中平均文档长度
        :return: 平均文档长度
        """
        total_length = sum(len(self._preprocess_text(doc)) for doc in self.documents)
        return total_length / len(self.documents) if len(self.documents) > 0 else 0

    def _build_freq_matrix(self):
        """
        构建词频矩阵，记录每个词在每个文档中的出现频率
        :return: 词频矩阵（以字典形式存储，键为词，值为字典，内层字典键为文档索引，值为词频）
        """
        freq_matrix = defaultdict(lambda: defaultdict(int))
        for doc_index, doc in enumerate(self.documents):
            words = self._preprocess_text(doc)
            for word in words:
                freq_matrix[word][doc_index] += 1
        return freq_matrix

    def _compute_idf(self):
        """
        计算每个词的逆文档频率（IDF）
        :return: IDF字典（键为词，值为对应的IDF值）
        """
        doc_num = len(self.documents)
        idf_dict = defaultdict(int)
        word_doc_count = defaultdict(int)
        for word in self.freq_matrix:
            for doc_index in self.freq_matrix[word]:
                word_doc_count[word] += 1
            idf_dict[word] = math.log((doc_num - word_doc_count[word] + 0.5) / (word_doc_count[word] + 0.5))
        return idf_dict

    def get_score(self, query):
        """
        计算查询词与文档集合中各文档的BM25得分
        :param query: 查询词列表（可以是经过预处理的词列表形式）
        :return: BM25得分字典（键为文档索引，值为对应的BM25得分）
        """
        score_dict = defaultdict(float)
        query = [self._preprocess_text(word)[0] if isinstance(word, str) else word for word in query]
        for word in query:
            if word in self.idf_dict:
                idf = self.idf_dict[word]
                for doc_index in range(len(self.documents)):
                    freq = self.freq_matrix[word][doc_index]
                    doc_length = len(self._preprocess_text(self.documents[doc_index]))
                    score = idf * (freq * (self.k1 + 1) / (
                            freq + self.k1 * (1 - self.b + self.b * doc_length / self.avg_doc_length)))
                    score_dict[doc_index] += score
        return score_dict


if __name__ == "__main__":
    # 示例文档集合
    documents = [
        "This is an article about natural language processing.",
        "Natural language processing techniques are very important in today's society.",
        "The article mainly introduces some applications of natural language processing."
    ]

    # 创建BM25实例并传入文档集合
    bm25 = BM25(documents)

    # 示例查询词
    query = ["natural", "language", "processing"]

    # 获取查询词与文档的BM25得分
    scores = bm25.get_score(query)

    # 将文档和对应的得分组成元组，方便排序
    doc_score_tuples = [(doc_index, score) for doc_index, score in scores.items()]

    # 按照得分从高到低对文档进行排序
    doc_score_tuples = sorted(doc_score_tuples, key=lambda x: x[1], reverse=True)

    for doc_index, score in doc_score_tuples:
        print(f"文档 {doc_index} 的BM25得分: {score}")

参考

TF-IDF算法介绍及实现-CSDN博客

【搜索核心技术】经典搜索核心算法：BM25及其变种-CSDN博客