Knowledge Graph知识图谱—6. Public Knowledge Graphs

6. Public Knowledge Graphs

6.1 Knowledge Graph Creation

6.1.1 CyC

Lesson learned no. 1: Trading efforts against accuracy

6.1.2 Freebase

Community based

Like Wikipedia, but more structured

Lesson learned no. 2: Trading formality against number of users

6.1.3 Wikidata

Goal: centralize data from Wikipedia languages

Collaborative editing

Imports other datasets

Includes rich provenance

Lesson learned no. 3: There is not one truth (but allowing for plurality adds complexity)

6.1.4 DBpedia & YAGO

Extraction from Wikipedia using mappings & heuristics

Wikipedia as a Knowledge Graph

Lesson learned no. 4: Heuristics help increasing coverage (at the cost of accuracy)

6.1.5 NELL

Lesson learned no. 5: Quality cannot be maximized without human intervention

6.1.6 Summary of Trade-Offs

  • (Manual) effort vs. accuracy and completeness
  • User involvement (or usability) vs. degree of formality
  • Simplicity vs. support for plurality and provenance

6.2 Non-Public Knowledge Graphs

Many companies have their own private knowledge graphs.

6.3 Comparison of Knowledge Graphs

"KG1 contains less of X than KG2" can mean

-- it actually contains less instances of X

-- it contains equally many or more instances,

but they are not typed with X

6.4 Overlap of Knowledge Graphs

度量 1 的目标是评估知识图谱的重叠度,相对于一个较小的知识图谱。在这里,"重叠度"指的是两个知识图谱之间共享的类别或实体的数量。较小的知识图谱可能是一个子集或者一个相关但规模较小的知识图谱。通过测量两者之间的重叠,研究人员可以了解将新知识集成到较小知识图谱中可能会带来的潜在收益。如果两个知识图谱在一些类别上有重叠,那么将这些类别的信息整合到较小知识图谱中可能会增加其丰富性和全面性。

度量 2 的目标是评估知识图谱的重叠度,相对于两个知识图谱之间显式链接的数量。在这里,"显式链接"指的是直接连接两个实体或类别的关系。通过测量两个知识图谱之间的重叠度,并将其与显式链接的数量进行比较,研究人员可以了解到显式链接的质量和知识图谱之间潜在关联的重要性。如果两个知识图谱之间有很多显式链接,但重叠度较低,那么可能需要改进链接的质量或者考虑其他方式来提高关联性。

链接生成机制并不需要过于准确。即使不同的度量和阈值导致了不同的链接集,但通过类内相关系数的可靠性,可以接受生成机制的多样性。这意味着,为了可靠地估计链接集之间的重叠度,链接生成并不一定需要非常准确,而是需要在统计上可靠。

Summary findings:

DBpedia and YAGO cover roughly the same instances (not much surprising)

NELL is the most complementary to the others

Existing interlinks are insufficient for out-of-the-box parallel usage

6.5 Intermezzo: Knowledge Graph Creation Cost

A few metrics for evaluating KGs: size, degree, interlinking, quality, licensing, ...

Case: manual curation & automatic/heuristic creation

6.6 Entity Extraction from List Pages

Lists form (shallow) hierarchies

Idea: align with category graph

Not all entities on a list page belong to the same category

Idea: Learn classifier to tell subject entities from non-subject entities
Distant learning approach

  • Positive examples: Entities that are in the
    corresponding category
    Negative examples: Entities that are in a category which is disjoint, e.g., Book <> Writer

YAGO uses categories for types but does not analyze them further.

6.7 Cat2Ax: Axiomatizing Wikipedia Categories

CaLiGraph Example: entities co-occur in surface patterns; co-occuring entities share semantic patterns; existing entities co-occur with new entities

6.8 Beyond List Pages

Learning descriptive rules for listings, e.g.

topSection("Discography") → artist.{PageEntity}

Learning across pages to mitigate small data problems
Metrics

Support: no. of listings covered by rule antecedent

Confidence: frequency of rule consequent over all covered listings

Consistency: mean absolute deviation of overall confidence and listing confidence. i.e., does the rule work equally well across all covered listings

Entity Disambiguation

6.9 From DBpedia to DBkWik

One (but not the only!) possible source of coverage bias

Collecting Data from a Multitude of Wikis

Differences to DBpedia

DBpedia has manually created mappings to an ontology

Wikipedia has one page per subject and global infobox conventions (more or less)
Challenges

On-the-fly ontology creation

Instance matching

Schema matching

DBpedia 是一个通用性较强的知识图谱项目

DBkWik 则是专注于特定领域的知识图谱项目

6.10 WebIsALOD

Background: Web table interpretation

Most approaches need typing information, DBpedia etc. have too little coverage on the long tail. Wanted: extensive type database

Main challenge

  1. Original dataset is quite noisy (<10% correct statements) Recap: coverage vs. accuracy
  2. Simple thresholding removes too much knowledge

Approach

  1. Train RandomForest model for predicting correct vs. wrong statements
  2. Using all the provenance information we have
  3. Use model to compute confidence scores

Current ongoing research: Using transformers and a larger training set

Current challenges and works in progress

  1. Distinguishing instances and classes
    i.e.: subclass vs. instance of relations
  2. Splitting instances
    Bauhaus is a goth band
    Bauhaus is a German school
  3. Knowledge extraction from pre and post modifiers
    Bauhaus is a goth band → genre(Bauhaus, Goth)
    Bauhaus is a German school → location(Bauhaus, Germany)
相关推荐
音视频牛哥17 小时前
超清≠清晰:视频系统里的分辨率陷阱与秩序真相
人工智能·机器学习·计算机视觉·音视频·大牛直播sdk·rtsp播放器rtmp播放器·smartmediakit
johnny23317 小时前
AI视频创作工具汇总:MoneyPrinterTurbo、KrillinAI、NarratoAI、ViMax
人工智能·音视频
Coovally AI模型快速验证18 小时前
当视觉语言模型接收到相互矛盾的信息时,它会相信哪个信号?
人工智能·深度学习·算法·机器学习·目标跟踪·语言模型
居7然18 小时前
Attention注意力机制:原理、实现与优化全解析
人工智能·深度学习·大模型·transformer·embedding
Scabbards_18 小时前
KGGEN: 用语言模型从纯文本中提取知识图
人工智能·语言模型·自然语言处理
LeonDL16818 小时前
【通用视觉框架】基于C#+Winform+OpencvSharp开发的视觉框架软件,全套源码,开箱即用
人工智能·c#·winform·opencvsharp·机器视觉软件框架·通用视觉框架·机器视觉框架
AI纪元故事会18 小时前
《目标检测全解析:从R-CNN到DETR,六大经典模型深度对比与实战指南》
人工智能·yolo·目标检测·r语言·cnn
Shang1809893572618 小时前
T41LQ 一款高性能、低功耗的系统级芯片(SoC) 适用于各种AIoT应用智能安防、智能家居方案优选T41L
人工智能·驱动开发·嵌入式硬件·fpga开发·信息与通信·信号处理·t41lq
Bony-19 小时前
用于糖尿病视网膜病变图像生成的GAN
人工智能·神经网络·生成对抗网络
罗西的思考19 小时前
【Agent】 ACE(Agentic Context Engineering)源码阅读笔记---(3)关键创新
人工智能·算法