M3E/OpenAi+vearch内容查重实践 | 京东云技术团队

一、实践背景介绍

1、业务背景

京东健康内容中台H2有一个目标就是需要替换两家CP内容(总体内容体量百万级),我们现在的逻辑是想按照PV热度优先高热去新生产和替换。替换后可以极大的节省cp内容引入的成本。

第一步:这么多内容,我们的生产逻辑需要按照学科和索引归类和分配,进而批量生产,靠人工一篇篇补索引,效率会很低。希望借助算法的能力,如果现在还不是非常准确,也可以算法+人工修正,

第二步:按索引归类好之后,我们和库内非CP但主题相似内容进行比对,已经有的就不做重复生产。最后剩下来的进行批量生产和替换。

2、技术背景

M3E(M3E(Multimodal Multitask Meta-Embedding)是一个开源的中文嵌入模型

Vearch 是对大规模深度学习向量进行高性能相似搜索的弹性分布式系统。也是京东自研开源的项目,具有强大的相似搜索的弹性分布式能力。

OpenAI的迅速发展对算法成本产生了重大影响。随着技术的进步和研究的不断推进,OpenAI已经取得了许多突破,使得算法的开发和部署成本大大降低。OpenAI的Chat模式和Embedding模式是OpenAI API中的两种不同的使用方式。

1、Chat模式: Chat模式是OpenAI API的一种使用方式,旨在支持对话式的人机交互。在Chat模式下,您可以通过向API发送一系列的用户消息来与模型进行交互,模型将逐条回复每个消息。这种交互式的方式使得您可以与模型进行对话,提出问题、请求解释、寻求建议等。

ini 复制代码
import openai

response = openai.Completion.create(
  engine="davinci",
  prompt="What is the capital of France?",
  max_tokens=100,
  n=1,
  stop=None,
  temperature=0.7
)

print(response.choices[0].text.strip())

2、Embedding模式: Embedding模式是OpenAI API的另一种使用方式,旨在获取文本的嵌入表示。在Embedding模式下,您可以将一段文本传递给API,并获取该文本的高维向量表示,也称为嵌入向量。这些嵌入向量可以用于计算文本之间的相似度、聚类、分类等任务。

ini 复制代码
import openai

response = openai.Embed.create(
  model="text-embedding-ada-002",
  documents=["Once upon a time", "In a land far, far away"],
)
embedding1 = response.embeddings[0]
embedding2 = response.embeddings[1]

# 进行嵌入向量的相似度计算等其它操作

本次实践主要使用了Embedding,具体实践如下文。

二、实践流程

1、总体流程

(1)、总体流程图

(2)、OpenAi/M3E向量生成部分代码实践

python 复制代码
async def embed_and_store_with_limit_and_check(
        self, semaphore, id, vector_store,  text_future_func = None, text: Union[str, list[str]] = "", **additional_properties
    ):
        async with semaphore:
            retry_count = (
                3  # Task failed with exception Response payload is not completed
            )
            retry_count_doubled = False
            retry = 1
            last_error = None
            while retry <= retry_count:  # Retry up to 3 times.
                try:
                    try:
                        data = await vector_store.get(vector_id=id)
                        id = data.id
                        embedding = data.result.embedding.feature
                        return (id, embedding)
                    except VearchRouterGetNotFoundError:
                        try:
                            return await self.embed_and_store(
                                text=text,
                                id=id,
                                vector_store=vector_store,
                                text_future_func=text_future_func,
                                **additional_properties,
                            )
                        except asyncio.TimeoutError:
                            logger.error(
                                f"embed_and_store_with_limit_and_check - id {id} #[{vector_store.space_name} {vector_store.db_name}] - Timeout during embed_and_store()"
                            )
                            raise
                except Exception as error:
                    error_message = f"{error}" or f"{error.__class__} {error.__doc__}"
                    logger.error(
                        f"embed_and_store_with_limit_and_check - id {id} #[{vector_store.space_name} {vector_store.db_name}] - failed with exception {error_message}, retry {retry}"
                    )
                    if isinstance(error, VearchRouterStatusError):
                        if error.reason == "partition_not_leader":
                          logger.info(
                              f"embed_and_store_with_limit_and_check - id {id} #[{vector_store.space_name} {vector_store.db_name}] - {error_message}, retry {retry} asyncio.sleep(10) doubled"
                          )
                          await asyncio.sleep(10)  # Response payload is not completed
                          if not retry_count_doubled:
                              retry_count = retry_count * 2
                              retry_count_doubled = True
                    if isinstance(error, aiohttp.client_exceptions.ClientPayloadError):
                        await asyncio.sleep(5)  # Response payload is not completed
                        if not retry_count_doubled:
                            retry_count = retry_count * 2
                            retry_count_doubled = True
                    else:
                        await asyncio.sleep(1)  # Wait for 1 second before retrying
                    retry = retry + 1
                    last_error = error

            raise VearchRouterClientRetryError(
                retry_count,
                f"embed_and_store_with_limit_and_check - id {id} #[{vector_store.space_name} {vector_store.db_name}] - completely failed with exception {last_error} - retried {retry_count} times",
                error=last_error,
            )

(3)、vearch向量存储及相似度搜索部分代码实

python 复制代码
async def score_similarity(
        self, vector_store, embedding=None, id=None, **search_properties
    ):
        """Find the most similar word and the similarity score for a given word in the document"""
        if not isinstance(embedding, list):
            try:
                results_with_scores = await vector_store.search_by_ids(ids=[id])
                # embedding = response.result.embedding.feature
                return results_with_scores.results[0].hits.hits
            except VearchRouterStatusError as error:
                raise error
                # if error.found == False:
                #   query_result = await embeddings.embed_query(word)

        results_with_scores = await vector_store.search(
            feature=embedding, **search_properties
        )

        return results_with_scores.hits.hits

2、OpenAi实现查重的局限性

(1)、成本

以目前100万数据量为例,如果使用目前OpenAi的开放接口实现,每篇内容由于token等限制进出一次需要0.007美元,100万篇内容需要7000美元才可以完成数据特征提取和向量生成,依照目前的内容体量和运用,这个成本还是高于预期,在成本方面没有比其他方案有优势。

(2)、效率

同样以100万数据为例,一篇内容特征提取和向量生成的时间由于国内各种限制,时间最快也在6-9s,即便是在并发以及多token的情况下,那100万内容执行完成最少也大于30天,这在实效性方面相比于其他方案也不占优势。

3、M3E模型引入

(1)、模型调研介绍

M3E(Moka Massive Mixed Embedding)是一个开源的中文嵌入模型,具有以下优势:

多模态支持:M3E模型能够同时处理多种模态的数据,如文本、图像、语音等。这种多模态的支持使得模型能够更好地处理复杂的现实场景,提供更全面的语义理解。

多任务学习:M3E模型支持同时学习多个任务,而不需要针对每个任务单独训练一个模型。通过共享模型的参数和特征表示,M3E能够将不同任务之间的知识相互传递和共享,提高学习效率和泛化能力。

元嵌入学习:M3E模型采用元学习的思想,通过在训练过程中模拟快速学习新任务的过程,使模型能够更好地适应新任务。这种元学习的能力使得M3E模型在面对新任务时能够从少量样本中快速学习并取得良好的性能。

中文语义理解:M3E模型专注于中文语义理解任务,具有针对中文语言特点的优化。这使得M3E模型在处理中文文本时能够更好地捕捉语义信息,提供更准确的嵌入表示。

开源和可定制性:M3E模型是开源的,可以根据具体需求进行定制和扩展。开放源代码使得用户可以自由地修改和优化模型,以适应不同的应用场景。

模型对比:

参数数量 维度 中文 英文 s2s s2p s2c 开源 兼容性 s2s Acc s2p ndcg@10
m3e-small 24M 512 0.5834 0.7262
m3e-base 110M 768 0.6157 0.8004
text2vec 110M 768 0.5755 0.6346
openai-ada-002 未知 1536 0.5956 0.7786

(2)、M3E选择的必要

a、实践过程中在不牺牲准确度的情况下向量维度长度短,节省存储空间和带宽,且在和vearch向量库结合使用的过程中发现768维度的向量生在查询和存储时表现的更优越。

b、模型非商业开源并且可以本地微调模型,有效结合业务场景进行

c、可以有针对性的根据数据规模和场景优化和分配资源,定时高效的达到业务预期效果目标。

d、兼容性,代表了模型在开源社区中各种项目被支持的程度,由于 m3e 和 text2vec 都可以直接通过 sentence-transformers 直接使用,所以和 openai 在社区的支持度上相当

e、使用场景主要是中文,少量英文的情况,建议使用 m3e 系列的模型,M3E 在大规模句对数据集上的训练,包含中文百科,金融,医疗,法律,新闻,学术等多个领域共计 2200W 句对样本,数据集详见 M3E 数据集

f、模型持续优化中,开发过程中可以持续提高数据质量,后续可期待更加优秀的模型。

(3)、运用

less 复制代码
pip3 install -i https://mirrors.jd.com/pypi/simple sentence-transformers==2.2.2

#### Download m3e-base
python3 -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('moka-ai/m3e-base'); print(model.encode(['Hello World!', '你好,世界!']))"

#### Save m3e-base to local path
python3 -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('moka-ai/m3e-base'); model.save('m3e-base-model/')"
python 复制代码
代码示例:
async def embed (self, text_or_documents):
      if isinstance(text_or_documents, list):
          documents = text_or_documents
      else:
          """Split the text_or_documents, embed the documents and insert the embedding into the online vector storage"""
          text_splitter = LocalTextSplitter().get_instance() # self._tokenizer = spacy.load(pipeline)
          documents = text_splitter.split_text(text_or_documents)

      
      embedding_return = await self._async_***_with_****(documents=documents)
      if len(embedding_return) > 1:
          # Compute the mean vector
           **************

          # Normalizing the mean vector
           *************
      
      return embedding

向量生成示例:
{"_index":"content_gpt_db","_type":"content_space_m3e","_id":867602,"found":true,"_source":{"content_type":1,"embedding":{"feature":[0.04050827,0.021327972,-0.0051002502,0.017009735,-0.016672134,-0.01061821,0.026807785,-0.018224716,-0.03107071,-0.0053977966,0.043376923,0.028705597,0.004207611,-0.020687103,-0.0447731,-0.009578705,0.05571747,0.06632233,-0.051948547,-0.013450623,-0.032985687,-0.008350372,-0.043361664,-0.02400589,-0.019294739,-0.023269653,0.005455017,0.0059661865,0.008682251,-0.023887634,0.046310425,-0.036338806,-0.0020313263,0.0062503815,0.05295372,0.026079178,0.011068344,-0.028791428,0.029096603,0.030740738,0.026367188,0.052009583,-0.009216309,-0.004173279,0.0009822845,0.018190384,0.033262253,0.05126381,0.012481689,0.005584717,-0.011810303,0.35385132,-0.043067932,0.0099105835,-0.014457703,0.038978577,0.022174835,-0.039844513,-0.012966156,-0.011081696,0.009370804,-0.024477005,-0.01061058,0.0028133392,-0.009471893,-0.027820587,-0.041484833,0.011547089,0.009700775,-0.05132675,0.06669235,-0.06849289,-0.0129470825,0.004447937,0.074913025,0.008506775,-0.033031464,0.017101288,0.045627594,-0.009830475,0.02917099,0.030750275,-0.017490387,-0.016429901,-0.042669296,-0.014154434,0.0004749298,0.049741745,0.07151413,-0.012218475,-0.013538361,-0.016918182,0.016963959,-0.015842438,-0.03572464,-0.034015656,0.046806335,-0.001625061,-0.006690979,0.040275574,-0.035312653,0.008182526,-0.024295807,-0.047908783,0.023643494,0.054634094,-0.07056427,0.04160309,-0.014863968,0.00399971,0.025701523,-0.0082912445,-0.022632599,0.0016212463,-0.059513092,-0.022808075,-0.008533478,-0.052440643,0.037700653,-0.045360565,0.0012359619,0.06803894,-0.04005432,-0.02885437,-0.032421112,0.010250092,-0.0092430115,0.055828094,-0.05140686,-0.0019073486,0.012435913,-0.04206848,-0.08063507,-0.016105652,-0.00031280518,-0.005180359,0.002243042,-0.009155273,-0.044174194,-0.007598877,0.015665054,0.015577316,0.006883621,-0.031778336,-0.017795563,0.016918182,0.019405365,0.0077323914,-0.012916565,-0.007698059,0.031211853,-0.048286438,0.017166138,0.0033416748,-0.02381897,0.03614807,-0.014591217,0.06523514,-0.04491043,-0.05462265,0.029396057,0.03844452,0.011238098,-0.051124573,-0.024749756,0.0068511963,0.0137786865,-0.033081055,-0.0028033257,0.0011496544,-0.012090206,-0.013271809,-0.018554688,-0.019104004,-0.004699707,-0.11206055,0.007501602,0.0144023895,-0.019788742,0.028829575,-0.03552246,0.028182983,-0.027923584,0.014785767,-0.032590866,-0.0011997223,0.003458023,0.036985397,-0.012435913,-0.040542603,-0.034469604,-0.0028839111,-0.014625549,0.014442444,0.06880951,0.01688385,-0.044792175,-0.014442444,-0.01712799,0.024909973,0.036842346,-0.015365601,0.032600403,-0.023117065,-0.017802238,-0.011162758,0.021027565,-0.0071382523,0.0023880005,0.016410828,-0.07878876,-0.033210754,0.029317856,0.037729263,-0.013490677,0.01420784,-0.076553345,0.03074646,0.020904541,-0.016113281,-0.008716583,-0.058559418,-0.03612137,-0.029781342,-0.03557396,-0.026613235,-0.0034923553,0.033971786,0.01530838,0.019039154,0.05249405,-0.06877518,-0.05325699,-0.054332733,0.022380829,0.0017127991,-0.00060653687,0.003200531,-0.05033493,0.031169891,-0.027420044,0.07209778,0.03919983,0.023788452,-0.03340912,0.038368225,-0.011619568,-0.049583435,0.023187637,-0.031404495,0.001543045,0.011007309,0.03263092,0.0027999878,-0.029151917,-0.03868866,-0.01224041,-0.006829262,-0.014925957,-0.008881569,-0.0025873184,0.012497902,0.018328667,0.0066041946,-0.03035736,-0.0110321045,-0.03830719,-0.026245117,-0.03142929,-0.007991791,0.019321442,-0.021755219,-0.008829117,-0.050519943,0.010892868,-0.015569687,0.0134391785,0.02917862,0.00075912476,-0.09794235,0.011421204,0.04624176,0.066841125,-0.0044174194,-0.019325256,-0.0010528564,-0.03643036,-0.025726318,-0.014377594,-0.024211884,-0.03343582,0.020572662,0.027690887,0.0475502,0.03835678,-0.043956757,-0.00034713745,0.048107147,0.025608063,-0.014255524,0.028633118,-0.07511139,-0.048667908,0.0210495,0.06496048,0.013729095,-0.0051841736,0.016643524,-0.022533417,0.0012626648,0.034671783,-0.029605865,-0.011131287,0.0044937134,-0.065330505,0.019874573,-0.05259323,0.00045394897,-0.008098602,0.01354599,0.05250168,0.07034683,-0.0058631897,0.07423782,0.011419296,-0.037618637,0.01867485,0.000062942505,0.004085541,0.038211823,0.019878387,-0.0754509,0.0065402985,0.0045223236,0.030115128,0.0017757416,-0.014886856,-0.011007309,0.026533127,0.033769608,-0.051013947,0.035007477,0.05788803,-0.049877167,-0.037107468,0.0016613007,0.015481949,-0.02353859,-0.039718628,-0.04598999,-0.044052124,0.010528564,-0.028961182,-0.016166687,0.0015945435,-0.013336182,0.032533646,0.018568039,0.03763771,0.025045395,-0.052635193,-0.051948547,-0.062217712,0.08403778,0.0012397766,-0.0012321472,0.056552887,-0.027065277,0.04188156,-0.03208542,0.06875229,0.0647316,-0.013954163,-0.022972107,0.11660004,0.032203674,-0.031936646,0.0020599365,-0.020370483,-0.06651306,0.0062942505,-0.049430847,0.04660797,0.020118713,-0.031578064,-0.005180359,-0.053260803,-0.027565002,-0.031951904,-0.041366577,-0.0025939941,-0.008529663,0.012207031,-0.06890869,0.01940918,0.039123535,-0.008434296,0.033107758,0.0352211,0.020793915,0.0071353912,-0.028520584,-0.030920029,-0.008180618,0.070114136,-0.014175415,-0.0012359619,0.000045776367,0.08629227,-0.051700592,-0.07754135,-0.016498566,-0.015331268,-0.044864655,-0.04217148,-0.005420685,-0.008460999,-0.038154602,0.05747223,0.020240784,0.007413864,0.009027481,0.026922226,-0.018918991,0.012096405,0.04254532,-0.05728531,-0.010662079,0.02876091,-0.019536972,0.01614952,-0.0005931854,0.044952393,-0.00390625,0.02508545,0.03439331,0.008852005,0.022172928,-0.00008201599,-0.0032863617,-0.05140686,0.005859375,0.053024292,0.025146484,-0.019942284,-0.011334419,0.01258564,0.015990257,-0.02166748,0.036453247,0.039978027,-0.033798218,0.00076675415,-0.005138397,0.004749298,0.029026031,0.0323925,-0.025564194,0.025335312,-0.030546188,-0.04391861,0.018421173,-0.011249542,0.04883194,0.01543808,0.02312851,-0.032764435,-0.026203156,0.019647598,0.018751144,-0.009168625,0.048986435,0.015720367,0.021831512,-0.03219223,-0.026844025,0.0060043335,-0.026107788,-0.046318054,-0.04046631,0.035526276,0.0024375916,-0.05537033,-0.02425003,-0.04340744,-0.0066947937,0.0019111633,-0.019908905,0.0008430481,-0.038669586,-0.034023285,-0.0014533997,0.00793457,-0.045150757,-0.03302002,-0.020614624,-0.005558014,0.069065094,-0.039173126,-0.00825119,0.03167534,0.018571854,-0.006723404,0.015237808,-0.021053314,-0.016643524,-0.02035141,0.009143829,0.00017166138,0.04996872,0.08148575,-0.008792877,0.018224716,0.01874733,0.008649826,-0.026594162,-0.032094955,0.039243698,0.03283882,0.027730942,0.030176163,-0.04026985,0.015901566,0.033468246,0.013085365,-0.0065927505,0.011677742,-0.013127327,-0.02519226,0.04988098,-0.013015747,0.015609741,0.014896393,0.023586273,0.016117096,0.040584564,0.01984787,0.004398346,-0.0089530945,-0.03900528,-0.0024147034,0.037326813,-0.008106232,-0.052898407,-0.0038452148,-0.05821228,-0.02015686,-0.001739502,-0.013622284,-0.017688751,-0.05283737,0.020702362,-0.050605774,0.027381897,0.0316391,0.0024490356,-0.055805206,-0.056484222,0.023387909,-0.02993393,0.019495964,-0.012732506,-0.008210182,0.01850605,-0.04762268,0.081466675,0.005874634,-0.010238647,0.019134521,-0.004508972,-0.012359619,0.025794983,0.04028511,0.025411606,-0.03328514,0.0031719208,-0.01725769,-0.051498413,-0.035949707,0.010955811,0.008583069,0.06630707,-0.005821228,-0.0024795532,0.03709793,0.013637543,0.022525787,-0.06563187,0.053359985,0.0039367676,-0.060836792,0.04824829,0.027780533,0.03645134,0.013780594,0.02977562,0.017705917,-0.00057029724,-0.034914017,-0.019468307,-0.026908875,0.067222595,0.05558014,-0.021064758,0.031835556,-0.04665947,0.051054,-0.00028038025,0.029193878,0.003993988,-0.07110214,0.06306076,0.014007568,-0.01714325,0.035003662,-0.004722595,0.014993668,0.03897667,-0.023054123,-0.006303787,-0.017751694,0.002111435,-0.008413315,0.017080307,-0.06581879,-0.008491516,0.12903595,-0.006996155,0.05880356,-0.02943039,0.020183563,-0.018550873,0.06975937,0.03355789,0.03824997,0.04037857,-0.046398163,0.006954193,-0.029689789,0.029582977,0.07313156,-0.005428314,-0.045841217,-0.025279999,0.0048294067,0.013130188,0.059028625,0.022529602,0.031074524,-0.011817932,-0.0047683716,-0.014060974,0.031232834,-0.0031795502,-0.018915176,-0.015424728,0.04899597,-0.0131073,-0.023361206,-0.046707153,-0.012523651,-0.0008125305,0.08478165,-0.062747955,-0.026260376,-0.060684204,0.011657715,0.013763428,-0.009056091,0.05002594,-0.004814148,0.0046463013,-0.0072250366,-0.015556335,-0.037773132,0.0308609,0.012107849,0.032539368,0.03591156,-0.0512619,-0.048412323,-0.012073517,-0.005519867,-0.072574615,-0.041452408,-0.040891647,-0.017946243,0.019388199,0.018611908,0.028507233,0.041683197,0.019443512,-0.019191742,0.035518646,-0.017742157,0.07847214,-0.040740967,0.031051636,-0.035736084,0.010360718,0.03430748,0.008317947,0.044736862,-0.0071315765,-0.01648426,-0.008883476,-0.020913124,-0.005423546,-0.009973526,-0.02460289,-0.044252396,-0.032361984,0.054714203,0.00091934204,0.059459686,0.0034065247,0.06443405,-0.027736664,0.003993988,0.036701202,-0.035736084,0.018554688,0.029144287,-0.019836426,0.069698334,0.021060944,0.012462616,0.023517609,0.0021858215,0.02639389,0.031742096,-0.033161163,-0.034664154,-0.084918976,0.027759552,0.030056,0.00016021729,0.008415222,-0.02822113,0.084098816,-0.034959793,-0.024831772,0.020299911,-0.029752731,-0.044506073,0.004787445,0.017642975,0.01127243,0.055496216,0.01977539,-0.038375854,0.013122559,0.035747528,-0.003780365,-0.0005226135,-0.016674042,-0.045539856,-0.039131165,-0.024177551,0.0366745,-0.049545288,0.010528564,0.033737183,-0.04852295,-0.03115654,-0.049951553,-0.017721176,-0.00032234192],"source":""}}}

4、vearch数据库向量存储

(1)、vearch详细介绍

Vearch 是对大规模深度学习向量进行高性能相似搜索的弹性分布式系统。具有以下功能:

1、支持CPU与GPU两种版本。

2、支持实时添加数据到索引。

3、支持单个文档定义多个向量字段, 添加、搜索批量操作。

4、支持数值字段范围过滤与string字段标签过滤。

5、支持IVFPQ、HNSW、二进制等索引方式(HNSW、二进制方式4月下旬发布)。

6、支持Python SDK本地快速开发验证。

7、支持机器学习算法插件方便系统部署使用。

Vearch京东自研开源的项目,具有强大的相似搜索的弹性分布式能力。

(2)、向量存储

python 复制代码
vearch_instance = VearchInstance(vearch_llm_instance=vearch_llm_instance)

import random
async def embed_content (      
      content_generator,
      concurrent_task_limit,
      vearch_instance,
      pbar
      ):
    semaphore = asyncio.Semaphore(concurrent_task_limit)

@handle_error_and_log
    @handle_client_response_type_check
    async def insert(
        self, db_name, space_name, vector_id, **vector_properties
    ) -> VearchRouterOperationResponse:
        if "feature" in vector_properties:
            properties = {**vector_properties}
            del properties["feature"]
            return await self.router.insert(
                db_name,
                space_name,
                vector_id,
                embedding={  # NOTE/FUTURE hard coded
                    "feature": vector_properties["feature"]
                },
                **properties,
            )
        return await self.router.insert(
            db_name, space_name, vector_id, **vector_properties
        )

(3)、相似度查询

bash 复制代码
查询语句:
http://jdh-content-gpt-vector-router.vectorbase.svc.ht09.n.jd.local/content_gpt_db/content_space_m3e/_search
{
    "query":{
        "ids":[
            580670
        ],
        "sum":[
            {
                "field":"embedding",
                "feature":[

                ]
            }
        ]
    },
    "retrieval_param":{
        "parallel_on_queries":1,
        "recall_num":100,
        "nprobe":80,
        "metric_type":"InnerProduct"
    },
    "is_brute_search":0,
    "online_log_level":"debug",
    "quick":false,
    "vector_value":false,
    "client_type":"leader",
    "l2_sqrt":true,
    "size":10
}

三、查重结果及M3E、OpenAi查重相似度效果比较

1、查重相似度验证结果展示

python 复制代码
import asyncio
import aiofiles
import os
import openpyxl
import json
import sys
import re

# from langchain.document_loaders import TextLoader
# from langchain.schema import Document
import numpy as np
import aiohttp
import logging
import asyncio

# Get the directory containing the current file
current_dir = os.path.dirname(os.path.abspath(__file__))

# Get the parent directory (project root directory)
project_root_dir = os.path.dirname(current_dir)

# Add it to sys.path
sys.path.append(project_root_dir)

from shared.VearchInstance import VearchInstance

logger = logging.getLogger(__name__)

async def async_os_walk(root_dir):
    """A simple, async version of os.walk."""
    for root, dirs, files in os.walk(root_dir):
        for filename in files:
            yield root, filename


"""Main execution function"""
from shared.TerminalColor import bcolors

async def main():
    # from shared.VearchOpenAI import VearchOpenAI
    from shared.VearchM3e import VearchM3e
    vearch_instance = VearchInstance(VearchM3e)
    content_vector_store = vearch_instance.content_vector_store

    root_logger = logging.getLogger("")
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )

    force_recreate_spaces = True
    await vearch_instance.client.prewarm()
    if force_recreate_spaces:
      await vearch_instance.ensure_empty()
    else:
      await vearch_instance.ensure()

    # Limit to 10 concurrent tasks.
    concurrent_task_limit = 16
    semaphore = asyncio.Semaphore(concurrent_task_limit)

    vearch_instance.log_configurations(
        "===Concurrency===",
        f"max concurrent requests: {concurrent_task_limit}",
        f"semaphore: {semaphore}",
        root_logger=root_logger,
    )

    # Define paths to the index and content data, and the label file
    new_content_to_analyze_dir = os.path.join(current_dir, "./data/content/")

    # Load the label data from the Excel file
    wb = openpyxl.load_workbook(os.path.join(current_dir, "./data/content.xlsx"))
    sheet = wb.active

    content_id_dict = {}

    async def process_row(row):
        preowned_article_body = row[0].value
        preowned_article_id = vearch_instance.generate_id(preowned_article_body)

        root_logger.info(
            "preowned_article_body: {}, preowned_article_id: {}".format(
                preowned_article_body[:20], preowned_article_id
            )
        )

        content_id_dict[preowned_article_id] = preowned_article_body[:50]

        return await vearch_instance.llm.embed_and_store_with_limit_and_check(
            semaphore=semaphore,
            text=preowned_article_body,
            id=preowned_article_id,
            vector_store=content_vector_store,
            content_type=vearch_instance.content_type_look_up("preowned_article"),
        )

    await asyncio.gather(*[process_row(row) for row in sheet.iter_rows()])

    # Asynchronously walk through every file in the root directory
    async for dirpath, filename in async_os_walk(new_content_to_analyze_dir):
        # Asynchronously build the search index for the document with filename in the dirpath
        content_file_path = os.path.join(dirpath, filename)
        match = re.search(r"(\d+)", filename)
        if match:
            content_id = int(match.group(1))
        else:
            content_id = vearch_instance.generate_id(content_file_path)

        root_logger.info("filename: {}, content_id: {}".format(filename, content_id))
        
        text = await vearch_instance.llm.load_file(file_path=content_file_path)
        embedding = await vearch_instance.llm.embed(text)

        # Asynchronously get the most similar texts and their similarity score for the label
        search_result = await vearch_instance.llm.score_similarity(
            embedding=embedding, vector_store=content_vector_store, min_score=-0.1
        )

        sorted_search_result = sorted(
            search_result, key=lambda hit: hit.score, reverse=True
        )

        for preowned_article in sorted_search_result:
            if preowned_article.id in content_id_dict:
              text = content_id_dict[preowned_article.id]
              root_logger.info(
                  f"{filename}: {bcolors.OKBLUE} score {preowned_article.score}{bcolors.ENDC}: ({bcolors.UNDERLINE}{text[0:100]}{bcolors.ENDC})"
              )

from shared.AsyncThread import start_asyncio_in_new_thread

# Running the main function using asyncio
if __name__ == "__main__":
    async_thread = start_asyncio_in_new_thread()
    async_thread.run(main())

2、M3E、OpenAi查重相似度效果比较

利用M3E和OpenAi不同模型提取的特征生成向量后计算的相似度基本上一致,且M3E提取的特征对中文的支持更好,更细化,导致最终计算分值以后也更加直观,能够快速的验证定位出相似度界限,对于内容查重业务更加友好,且在成本和效率上更具有优势。

四、总结

经过实践,本次处理47万篇内容,经过多轮优化,最终达到向量生成、验证及插入在使用规格配置32c50g的机器同时启用三个线程派发任务,32个进程共享内存的情况下,可在5小时内完成的。相似度搜索及存储到mysql可在20分钟内完成30万数据的处理。

OpenAI在算法研究方面的创新推动了成本的降低。通过引入更高效的算法和模型架构,OpenAI能够在相同的计算资源下取得更好的性能。这意味着开发者可以更快地训练和部署模型,减少了算法开发的时间和成本。但介于目前的技术环境及规则限制,选择一些开源的像M3E之类的模型才是更贴近我们目前的业务需求和日常使用。

利用M3E模型提取的特征对中文的支持也挺好,也更加细化,尤其除了基本的服务器和开发成本外在不需要额外的支出,效率也可以通过并发和增加资源的手段优化,成本和效率方面具有明显优势。768纬度的向量和vearch结合的也更优越。

作者:京东健康 刘继帅

来源:京东云开发者社区 转载请注明来源

相关推荐
时光追逐者8 小时前
一款免费、简单、高效的在线数据库设计工具
数据库·mysql·oracle·sql server
another heaven8 小时前
【软考 2026 最新版 NoSQL 数据库全分类】
数据库·nosql
满天星83035778 小时前
【MySQL】表的操作
linux·服务器·数据库·mysql
yashuk8 小时前
Ubuntu 系统下安装 Nginx
数据库·nginx·ubuntu
F1FJJ8 小时前
VS Code 里管理 PostgreSQL,有哪些选择?主流扩展横向对比
网络·数据库·postgresql·容器
Bdygsl8 小时前
MySQL(8)—— 事务
数据库·mysql
IvorySQL8 小时前
直播回顾| PostgreSQL 18.3 x IvorySQL 5.3:开启 AI 数据库新纪元
数据库·postgresql·开源
编程之升级打怪8 小时前
数据库的实时同步和异步同步
数据库
AI成长日志8 小时前
【GitHub开源项目专栏】强化学习开源框架解析——Ray RLlib vs Stable Baselines3设计哲学对比
开源·github
captain3768 小时前
MySQL增删改查
数据库·mysql