【HuggingFace LLM】FAISS语义搜索

python 复制代码

from datasets import load_dataset

issues_dataset = load_dataset("susuahi/github-issues", split="train")
issues_dataset

# Repo card metadata block was not found. Setting CardData to empty.
# WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.

#Dataset({
#    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'type', 'active_lock_reason', 'draft', 'pull_request', 'body', 'closed_by', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'sub_issues_summary', 'issue_dependencies_summary', 'is_pull_request'],
#    num_rows: 7815
#})

在datasets.load_dataset中，直接输入账户/数据集名称即可加载数据集。

数据增强

当在样本中存在不对照关系时，例如

复制代码

{'html_url': '[https://github.com/huggingface/datasets/issues/7879](https://github.com/huggingface/datasets/issues/7879)',
 'title': 'python core dump when downloading dataset',
 'comments': ["Hi @hansewetz I'm curious, for me it works just fine. Are you still observing the issue?",
  "Yup ... still the same issue.\nHowever, after adding a ```sleep(1)```call after the ```for```loop by accident during debugging, the program terminates properly (not a good solution though ... :-) ).\nAre there some threads created that handles the download that are still running when the program exits?\nHaven't had time yet to go through the code in ```iterable_dataset.py::IterableDataset```\n",
  "Interesting, I was able to reproduce it, on a jupyter notebook the code runs just fine, as a Python script indeed it seems to never finish running (which is probably leading to the core dumped error). I'll try and take a look at the source code as well to see if I can figure it out.",
  'Hi @hansewetz ,\nIf possible can I be assigned with this issue?\n\n',
  "```If possible can I be assigned with this issue?```\nHi, I don't know how assignments work here and who can take decisions about assignments ... ",
  "Hi @hansewetz and @Aymuos22, I have made some progress:\n\n1) Confirmed last working version is 3.1.0\n\n2) From 3.1.0 to 3.2.0, there was a change in how parquet files are read (see [here]([https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/parquet/parquet.py/#168).\n\nThe](https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/parquet/parquet.py/#168\).\n\nThe) issue seems to be the following code:\n\n```\nparquet_fragment.to_batches(\n                                batch_size=batch_size,\n                                columns=self.config.columns,\n                                filter=filter_expr,\n                                batch_readahead=0,\n                                fragment_readahead=0,\n                            )\n```\n\nAdding a `use_threads=False` parameter to the `to_batches` call solves the bug. However, this seems far from an optimal solution, since we'd like to be able to use multiple threads for reading the fragments. \n\nI'll keep investigating to see if there's a better solution.",
  "Hi @lhoestq, may I ask if the current behaviour was expected by you folks and you don't think it needs solving, or should I keep on investigating a compromise between using multithreading / avoid unexpected behaviour? Thanks in advance :) ",
  'Having the same issue. the code never stops executing. Using datasets 4.4.1\nTried with "islice" as well. When the streaming flag is True, the code doesn\'t end execution. On vs-code.',
  'The issue on pyarrow side is here: [https://github.com/apache/arrow/issues/45214](https://github.com/apache/arrow/issues/45214) and the original issue in `datasets` here: [https://github.com/huggingface/datasets/issues/7357\n\nIt](https://github.com/huggingface/datasets/issues/7357/n/nIt) would be cool to have a fix on the pyarrow side',
  "Thank you very much @lhoestq, I'm reading the issue thread in pyarrow and realizing you've been raising awareness around this for a long time now. When I have some time I'll look at @pitrou's PR to see if I can get a better understanding of what's going on on pyarrow. "],
 'body': '### Describe the bug\n\nWhen downloading a dataset in streamed mode and exiting the program before the download completes, the python program core dumps when exiting:\n\n```\nterminate called without an active exception\nAborted (core dumped)\n```\n\nTested with python 3.12.3, python 3.9.21\n\n\n\n### Steps to reproduce the bug\n\nCreate python venv:\n\n```bash\npython -m venv venv\n./venv/bin/activate\npip install datasets==4.4.1\n```\n\nExecute the following program:\n\n```\nfrom datasets import load_dataset\nds = load_dataset("HuggingFaceFW/fineweb-2", \'hrv_Latn\', split="test", streaming=True)\nfor sample in ds:\n    break\n```\n\n\n### Expected behavior\n\nClean program exit\n\n### Environment info\n\ndescribed above\n\n**note**: the example works correctly when using ```datasets==3.1.0```'}

其中comments字段和body字段明显是多对一 关系，可以通过转化为一对一 关系扩充样本数量。

有两种方法可以实现：

使用pandas中的explode方法。

python 复制代码

issues_dataset.set_format("pandas")
df = issues_dataset[:]
# df["comments"][0].tolist()

comments_df = df.explode("comments", ignore_index=True)
# comments_df.head(4)

from datasets import Dataset
comments_dataset = Dataset.from_pandas(comments_df)
# comments_dataset

对更多的那个字段进行explode后，直接使之分裂为与其他字段一对一。

直接使用Dataset的map方法

python 复制代码

import itertools

def flatten_key(nested_list):
  return list(itertools.chain(*nested_list))

def comments_explode(example):
  # print('initial example is:',example)
  for idx, comment in enumerate(example['comments']):
    # print(comment)
    example["title"][idx] = [example["title"][idx] for _ in range(len(comment))]
    example["html_url"][idx] = [example["html_url"][idx] for _ in range(len(comment))]
    example["body"][idx] = [example["body"][idx] for _ in range(len(comment))]
  # print('medium example is:',example)
  for k,v in example.items():
    example[k] = flatten_key(v)
  # print('final example is:',example)
  return example

issues_dataset = issues_dataset.map(comments_explode, batched=True)
issues_dataset

通过返回duplicated的其他字段，使得输入一个样本，返回对应comments长度的多个字段。

FAISS语义匹配

FAISS（Facebook AI Similarity Search ）是🤗 Datasets 中的一种特殊数据结构。它是一个库，提供了高效的算法来快速搜索和聚类嵌入向量。

FAISS的核心就是通过构造特殊索引来实现高维向量的查询。

主要分为两步：1. 制作索引；2. 与查询向量进行匹配。

python 复制代码

from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

import torch
device = torch.device("cuda")
model.to(device)

def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

# embedding = get_embeddings(comments_dataset["text"][0])
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

在FAISS匹配前，还需要把样本进行Tokenizer转化为高维向量。

在训练multi-qa-mpnet-base-dot-v1时，会把整句句子的综合语义 放在第一个token即[cls]中。

因此这里取[:,0]就是在取分词后，第一个 token的embedding vector。（[1, 20, 786]->[1, 786]）

另外，由于FAISS仅支持float32的向量对比，因此需要使用.detach().cpu().numpy()。

索引制作

python 复制代码

embeddings_dataset.add_faiss_index(column="embeddings")

使用add_faiss_index的方法可以为embedding列制作索引。

问题匹配

python 复制代码

question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape # 输入问题

scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

import pandas as pd
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

使用get_nearest_examples方法可以对question_embedding输入向量进行相似查询并返回Top_k。

【合并title、body、comments为text列】

->【将text列进行tokenizer化形成embedding列】

->【使用add_fiass_index为embedding列构建索引】

->【使用get_nearest_examples方法对输入向量进行匹配】