【HuggingFace LLM】FAISS语义搜索

python 复制代码
from datasets import load_dataset

issues_dataset = load_dataset("susuahi/github-issues", split="train")
issues_dataset

# Repo card metadata block was not found. Setting CardData to empty.
# WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.

#Dataset({
#    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'type', 'active_lock_reason', 'draft', 'pull_request', 'body', 'closed_by', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'sub_issues_summary', 'issue_dependencies_summary', 'is_pull_request'],
#    num_rows: 7815
#})

datasets.load_dataset中,直接输入账户/数据集名称即可加载数据集。

数据增强

当在样本中存在不对照关系时,例如

复制代码
{'html_url': '[https://github.com/huggingface/datasets/issues/7879](https://github.com/huggingface/datasets/issues/7879)',
 'title': 'python core dump when downloading dataset',
 'comments': ["Hi @hansewetz I'm curious, for me it works just fine. Are you still observing the issue?",
  "Yup ... still the same issue.\nHowever, after adding a ```sleep(1)```call after the ```for```loop by accident during debugging, the program terminates properly (not a good solution though ... :-) ).\nAre there some threads created that handles the download that are still running when the program exits?\nHaven't had time yet to go through the code in ```iterable_dataset.py::IterableDataset```\n",
  "Interesting, I was able to reproduce it, on a jupyter notebook the code runs just fine, as a Python script indeed it seems to never finish running (which is probably leading to the core dumped error). I'll try and take a look at the source code as well to see if I can figure it out.",
  'Hi @hansewetz ,\nIf possible can I be assigned with this issue?\n\n',
  "```If possible can I be assigned with this issue?```\nHi, I don't know how assignments work here and who can take decisions about assignments ... ",
  "Hi @hansewetz and @Aymuos22, I have made some progress:\n\n1) Confirmed last working version is 3.1.0\n\n2) From 3.1.0 to 3.2.0, there was a change in how parquet files are read (see [here]([https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/parquet/parquet.py/#168).\n\nThe](https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/parquet/parquet.py/#168\).\n\nThe) issue seems to be the following code:\n\n```\nparquet_fragment.to_batches(\n                                batch_size=batch_size,\n                                columns=self.config.columns,\n                                filter=filter_expr,\n                                batch_readahead=0,\n                                fragment_readahead=0,\n                            )\n```\n\nAdding a `use_threads=False` parameter to the `to_batches` call solves the bug. However, this seems far from an optimal solution, since we'd like to be able to use multiple threads for reading the fragments. \n\nI'll keep investigating to see if there's a better solution.",
  "Hi @lhoestq, may I ask if the current behaviour was expected by you folks and you don't think it needs solving, or should I keep on investigating a compromise between using multithreading / avoid unexpected behaviour? Thanks in advance :) ",
  'Having the same issue. the code never stops executing. Using datasets 4.4.1\nTried with "islice" as well. When the streaming flag is True, the code doesn\'t end execution. On vs-code.',
  'The issue on pyarrow side is here: [https://github.com/apache/arrow/issues/45214](https://github.com/apache/arrow/issues/45214) and the original issue in `datasets` here: [https://github.com/huggingface/datasets/issues/7357\n\nIt](https://github.com/huggingface/datasets/issues/7357/n/nIt) would be cool to have a fix on the pyarrow side',
  "Thank you very much @lhoestq, I'm reading the issue thread in pyarrow and realizing you've been raising awareness around this for a long time now. When I have some time I'll look at @pitrou's PR to see if I can get a better understanding of what's going on on pyarrow. "],
 'body': '### Describe the bug\n\nWhen downloading a dataset in streamed mode and exiting the program before the download completes, the python program core dumps when exiting:\n\n```\nterminate called without an active exception\nAborted (core dumped)\n```\n\nTested with python 3.12.3, python 3.9.21\n\n\n\n### Steps to reproduce the bug\n\nCreate python venv:\n\n```bash\npython -m venv venv\n./venv/bin/activate\npip install datasets==4.4.1\n```\n\nExecute the following program:\n\n```\nfrom datasets import load_dataset\nds = load_dataset("HuggingFaceFW/fineweb-2", \'hrv_Latn\', split="test", streaming=True)\nfor sample in ds:\n    break\n```\n\n\n### Expected behavior\n\nClean program exit\n\n### Environment info\n\ndescribed above\n\n**note**: the example works correctly when using ```datasets==3.1.0```'}

其中comments字段和body字段明显是多对一 关系,可以通过转化为一对一 关系扩充样本数量

有两种方法可以实现:

  1. 使用pandas中的explode方法。
python 复制代码
issues_dataset.set_format("pandas")
df = issues_dataset[:]
# df["comments"][0].tolist()

comments_df = df.explode("comments", ignore_index=True)
# comments_df.head(4)

from datasets import Dataset
comments_dataset = Dataset.from_pandas(comments_df)
# comments_dataset

对更多的那个字段进行explode后,直接使之分裂为与其他字段一对一。

  1. 直接使用Datasetmap方法
python 复制代码
import itertools

def flatten_key(nested_list):
  return list(itertools.chain(*nested_list))

def comments_explode(example):
  # print('initial example is:',example)
  for idx, comment in enumerate(example['comments']):
    # print(comment)
    example["title"][idx] = [example["title"][idx] for _ in range(len(comment))]
    example["html_url"][idx] = [example["html_url"][idx] for _ in range(len(comment))]
    example["body"][idx] = [example["body"][idx] for _ in range(len(comment))]
  # print('medium example is:',example)
  for k,v in example.items():
    example[k] = flatten_key(v)
  # print('final example is:',example)
  return example

issues_dataset = issues_dataset.map(comments_explode, batched=True)
issues_dataset

通过返回duplicated的其他字段,使得输入一个样本,返回对应comments长度的多个字段。

FAISS语义匹配

FAISS(Facebook AI Similarity Search )是🤗 Datasets 中的一种特殊数据结构 。它是一个库,提供了高效的算法来快速搜索和聚类嵌入向量

FAISS的核心就是通过构造特殊索引来实现高维向量的查询

主要分为两步:1. 制作索引;2. 与查询向量进行匹配。

python 复制代码
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

import torch
device = torch.device("cuda")
model.to(device)

def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

# embedding = get_embeddings(comments_dataset["text"][0])
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

在FAISS匹配前,还需要把样本进行Tokenizer转化为高维向量。

在训练multi-qa-mpnet-base-dot-v1时,会把整句句子的综合语义 放在第一个token[cls]中。

因此这里取[:,0]就是在取分词后,第一个 tokenembedding vector([1, 20, 786]->[1, 786])

另外,由于FAISS仅支持float32的向量对比,因此需要使用.detach().cpu().numpy()

索引制作
python 复制代码
embeddings_dataset.add_faiss_index(column="embeddings")

使用add_faiss_index的方法可以为embedding列制作索引。

问题匹配
python 复制代码
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape # 输入问题

scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

import pandas as pd
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

使用get_nearest_examples方法可以对question_embedding输入向量进行相似查询并返回Top_k

【合并titlebodycommentstext列】

->【将text列进行tokenizer化形成embedding列】

->【使用add_fiass_indexembedding列构建索引】

->【使用get_nearest_examples方法对输入向量进行匹配

相关推荐
薛定谔的猫19821 天前
RAG(三)文档的加载 基于 LangChain+FAISS 构建本地文档向量检索系统
数据挖掘·langchain·faiss
Coder_Boy_2 天前
Java调用Python实现FAISS向量操作(两种方式完整实战)
java·python·faiss
Coder_Boy_7 天前
基于SpringAI的智能AIOps项目:微服务与DDD多模块融合设计概述
java·运维·人工智能·微服务·faiss
Coder_Boy_8 天前
基于SpringAI的智能OPS平台AIops介绍
人工智能·spring boot·aiops·faiss
Coder_Boy_10 天前
基于SpringAI的智能OPS平台开发前置技能FAISS
人工智能·springboot·faiss
七夜zippoe16 天前
多模态图文跨模态检索实战教程
架构·多模态·faiss·模型·图文
草根研究生20 天前
BM25, TF-IDF, Faiss-based methods
tf-idf·faiss
火云牌神20 天前
如何选择FAISS的索引类型
人工智能·faiss
沛沛老爹25 天前
LightRAG 系列 5:核心技术解析——HNSW 索引机制与 Web 应用中的毫秒级检索
faiss·hnsw·rag·lightrag·动态调整·索引机制·预热索引