Knowledge Graph知识图谱—6. Public Knowledge Graphs

6. Public Knowledge Graphs

6.1 Knowledge Graph Creation

6.1.1 CyC

Lesson learned no. 1: Trading efforts against accuracy

6.1.2 Freebase

Community based

Like Wikipedia, but more structured

Lesson learned no. 2: Trading formality against number of users

6.1.3 Wikidata

Goal: centralize data from Wikipedia languages

Collaborative editing

Imports other datasets

Includes rich provenance

Lesson learned no. 3: There is not one truth (but allowing for plurality adds complexity)

6.1.4 DBpedia & YAGO

Extraction from Wikipedia using mappings & heuristics

Wikipedia as a Knowledge Graph

Lesson learned no. 4: Heuristics help increasing coverage (at the cost of accuracy)

6.1.5 NELL

Lesson learned no. 5: Quality cannot be maximized without human intervention

6.1.6 Summary of Trade-Offs

  • (Manual) effort vs. accuracy and completeness
  • User involvement (or usability) vs. degree of formality
  • Simplicity vs. support for plurality and provenance

6.2 Non-Public Knowledge Graphs

Many companies have their own private knowledge graphs.

6.3 Comparison of Knowledge Graphs

"KG1 contains less of X than KG2" can mean

-- it actually contains less instances of X

-- it contains equally many or more instances,

but they are not typed with X

6.4 Overlap of Knowledge Graphs

度量 1 的目标是评估知识图谱的重叠度,相对于一个较小的知识图谱。在这里,"重叠度"指的是两个知识图谱之间共享的类别或实体的数量。较小的知识图谱可能是一个子集或者一个相关但规模较小的知识图谱。通过测量两者之间的重叠,研究人员可以了解将新知识集成到较小知识图谱中可能会带来的潜在收益。如果两个知识图谱在一些类别上有重叠,那么将这些类别的信息整合到较小知识图谱中可能会增加其丰富性和全面性。

度量 2 的目标是评估知识图谱的重叠度,相对于两个知识图谱之间显式链接的数量。在这里,"显式链接"指的是直接连接两个实体或类别的关系。通过测量两个知识图谱之间的重叠度,并将其与显式链接的数量进行比较,研究人员可以了解到显式链接的质量和知识图谱之间潜在关联的重要性。如果两个知识图谱之间有很多显式链接,但重叠度较低,那么可能需要改进链接的质量或者考虑其他方式来提高关联性。

链接生成机制并不需要过于准确。即使不同的度量和阈值导致了不同的链接集,但通过类内相关系数的可靠性,可以接受生成机制的多样性。这意味着,为了可靠地估计链接集之间的重叠度,链接生成并不一定需要非常准确,而是需要在统计上可靠。

Summary findings:

DBpedia and YAGO cover roughly the same instances (not much surprising)

NELL is the most complementary to the others

Existing interlinks are insufficient for out-of-the-box parallel usage

6.5 Intermezzo: Knowledge Graph Creation Cost

A few metrics for evaluating KGs: size, degree, interlinking, quality, licensing, ...

Case: manual curation & automatic/heuristic creation

6.6 Entity Extraction from List Pages

Lists form (shallow) hierarchies

Idea: align with category graph

Not all entities on a list page belong to the same category

Idea: Learn classifier to tell subject entities from non-subject entities
Distant learning approach

  • Positive examples: Entities that are in the
    corresponding category
    Negative examples: Entities that are in a category which is disjoint, e.g., Book <> Writer

YAGO uses categories for types but does not analyze them further.

6.7 Cat2Ax: Axiomatizing Wikipedia Categories

CaLiGraph Example: entities co-occur in surface patterns; co-occuring entities share semantic patterns; existing entities co-occur with new entities

6.8 Beyond List Pages

Learning descriptive rules for listings, e.g.

topSection("Discography") → artist.{PageEntity}

Learning across pages to mitigate small data problems
Metrics

Support: no. of listings covered by rule antecedent

Confidence: frequency of rule consequent over all covered listings

Consistency: mean absolute deviation of overall confidence and listing confidence. i.e., does the rule work equally well across all covered listings

Entity Disambiguation

6.9 From DBpedia to DBkWik

One (but not the only!) possible source of coverage bias

Collecting Data from a Multitude of Wikis

Differences to DBpedia

DBpedia has manually created mappings to an ontology

Wikipedia has one page per subject and global infobox conventions (more or less)
Challenges

On-the-fly ontology creation

Instance matching

Schema matching

DBpedia 是一个通用性较强的知识图谱项目

DBkWik 则是专注于特定领域的知识图谱项目

6.10 WebIsALOD

Background: Web table interpretation

Most approaches need typing information, DBpedia etc. have too little coverage on the long tail. Wanted: extensive type database

Main challenge

  1. Original dataset is quite noisy (<10% correct statements) Recap: coverage vs. accuracy
  2. Simple thresholding removes too much knowledge

Approach

  1. Train RandomForest model for predicting correct vs. wrong statements
  2. Using all the provenance information we have
  3. Use model to compute confidence scores

Current ongoing research: Using transformers and a larger training set

Current challenges and works in progress

  1. Distinguishing instances and classes
    i.e.: subclass vs. instance of relations
  2. Splitting instances
    Bauhaus is a goth band
    Bauhaus is a German school
  3. Knowledge extraction from pre and post modifiers
    Bauhaus is a goth band → genre(Bauhaus, Goth)
    Bauhaus is a German school → location(Bauhaus, Germany)
相关推荐
xiandong201 小时前
240929-CGAN条件生成对抗网络
图像处理·人工智能·深度学习·神经网络·生成对抗网络·计算机视觉
innutritious2 小时前
车辆重识别(2020NIPS去噪扩散概率模型)论文阅读2024/9/27
人工智能·深度学习·计算机视觉
橙子小哥的代码世界2 小时前
【深度学习】05-RNN循环神经网络-02- RNN循环神经网络的发展历史与演化趋势/LSTM/GRU/Transformer
人工智能·pytorch·rnn·深度学习·神经网络·lstm·transformer
985小水博一枚呀4 小时前
【深度学习基础模型】神经图灵机(Neural Turing Machines, NTM)详细理解并附实现代码。
人工智能·python·rnn·深度学习·lstm·ntm
SEU-WYL5 小时前
基于深度学习的任务序列中的快速适应
人工智能·深度学习
OCR_wintone4215 小时前
中安未来 OCR—— 开启高效驾驶证识别新时代
人工智能·汽车·ocr
matlabgoodboy5 小时前
“图像识别技术:重塑生活与工作的未来”
大数据·人工智能·生活
最近好楠啊6 小时前
Pytorch实现RNN实验
人工智能·pytorch·rnn
OCR_wintone4216 小时前
中安未来 OCR—— 开启文字识别新时代
人工智能·深度学习·ocr
学步_技术6 小时前
自动驾驶系列—全面解析自动驾驶线控制动技术:智能驾驶的关键执行器
人工智能·机器学习·自动驾驶·线控系统·制动系统