Knowledge Graph知识图谱—6. Public Knowledge Graphs

6. Public Knowledge Graphs

6.1 Knowledge Graph Creation

6.1.1 CyC

Lesson learned no. 1: Trading efforts against accuracy

6.1.2 Freebase

Community based

Like Wikipedia, but more structured

Lesson learned no. 2: Trading formality against number of users

6.1.3 Wikidata

Goal: centralize data from Wikipedia languages

Collaborative editing

Imports other datasets

Includes rich provenance

Lesson learned no. 3: There is not one truth (but allowing for plurality adds complexity)

6.1.4 DBpedia & YAGO

Extraction from Wikipedia using mappings & heuristics

Wikipedia as a Knowledge Graph

Lesson learned no. 4: Heuristics help increasing coverage (at the cost of accuracy)

6.1.5 NELL

Lesson learned no. 5: Quality cannot be maximized without human intervention

6.1.6 Summary of Trade-Offs

  • (Manual) effort vs. accuracy and completeness
  • User involvement (or usability) vs. degree of formality
  • Simplicity vs. support for plurality and provenance

6.2 Non-Public Knowledge Graphs

Many companies have their own private knowledge graphs.

6.3 Comparison of Knowledge Graphs

"KG1 contains less of X than KG2" can mean

-- it actually contains less instances of X

-- it contains equally many or more instances,

but they are not typed with X

6.4 Overlap of Knowledge Graphs

度量 1 的目标是评估知识图谱的重叠度,相对于一个较小的知识图谱。在这里,"重叠度"指的是两个知识图谱之间共享的类别或实体的数量。较小的知识图谱可能是一个子集或者一个相关但规模较小的知识图谱。通过测量两者之间的重叠,研究人员可以了解将新知识集成到较小知识图谱中可能会带来的潜在收益。如果两个知识图谱在一些类别上有重叠,那么将这些类别的信息整合到较小知识图谱中可能会增加其丰富性和全面性。

度量 2 的目标是评估知识图谱的重叠度,相对于两个知识图谱之间显式链接的数量。在这里,"显式链接"指的是直接连接两个实体或类别的关系。通过测量两个知识图谱之间的重叠度,并将其与显式链接的数量进行比较,研究人员可以了解到显式链接的质量和知识图谱之间潜在关联的重要性。如果两个知识图谱之间有很多显式链接,但重叠度较低,那么可能需要改进链接的质量或者考虑其他方式来提高关联性。

链接生成机制并不需要过于准确。即使不同的度量和阈值导致了不同的链接集,但通过类内相关系数的可靠性,可以接受生成机制的多样性。这意味着,为了可靠地估计链接集之间的重叠度,链接生成并不一定需要非常准确,而是需要在统计上可靠。

Summary findings:

DBpedia and YAGO cover roughly the same instances (not much surprising)

NELL is the most complementary to the others

Existing interlinks are insufficient for out-of-the-box parallel usage

6.5 Intermezzo: Knowledge Graph Creation Cost

A few metrics for evaluating KGs: size, degree, interlinking, quality, licensing, ...

Case: manual curation & automatic/heuristic creation

6.6 Entity Extraction from List Pages

Lists form (shallow) hierarchies

Idea: align with category graph

Not all entities on a list page belong to the same category

Idea: Learn classifier to tell subject entities from non-subject entities
Distant learning approach

  • Positive examples: Entities that are in the
    corresponding category
    Negative examples: Entities that are in a category which is disjoint, e.g., Book <> Writer

YAGO uses categories for types but does not analyze them further.

6.7 Cat2Ax: Axiomatizing Wikipedia Categories

CaLiGraph Example: entities co-occur in surface patterns; co-occuring entities share semantic patterns; existing entities co-occur with new entities

6.8 Beyond List Pages

Learning descriptive rules for listings, e.g.

topSection("Discography") → artist.{PageEntity}

Learning across pages to mitigate small data problems
Metrics

Support: no. of listings covered by rule antecedent

Confidence: frequency of rule consequent over all covered listings

Consistency: mean absolute deviation of overall confidence and listing confidence. i.e., does the rule work equally well across all covered listings

Entity Disambiguation

6.9 From DBpedia to DBkWik

One (but not the only!) possible source of coverage bias

Collecting Data from a Multitude of Wikis

Differences to DBpedia

DBpedia has manually created mappings to an ontology

Wikipedia has one page per subject and global infobox conventions (more or less)
Challenges

On-the-fly ontology creation

Instance matching

Schema matching

DBpedia 是一个通用性较强的知识图谱项目

DBkWik 则是专注于特定领域的知识图谱项目

6.10 WebIsALOD

Background: Web table interpretation

Most approaches need typing information, DBpedia etc. have too little coverage on the long tail. Wanted: extensive type database

Main challenge

  1. Original dataset is quite noisy (<10% correct statements) Recap: coverage vs. accuracy
  2. Simple thresholding removes too much knowledge

Approach

  1. Train RandomForest model for predicting correct vs. wrong statements
  2. Using all the provenance information we have
  3. Use model to compute confidence scores

Current ongoing research: Using transformers and a larger training set

Current challenges and works in progress

  1. Distinguishing instances and classes
    i.e.: subclass vs. instance of relations
  2. Splitting instances
    Bauhaus is a goth band
    Bauhaus is a German school
  3. Knowledge extraction from pre and post modifiers
    Bauhaus is a goth band → genre(Bauhaus, Goth)
    Bauhaus is a German school → location(Bauhaus, Germany)
相关推荐
有Li17 小时前
基于联邦学习与神经架构搜索的可泛化重建:用于加速磁共振成像|文献速递-最新医学人工智能文献
论文阅读·人工智能·文献·医学生
桃花键神17 小时前
从传统到智能:3D 建模流程的演进与 AI 趋势 —— 以 Blender 为例
人工智能·3d·blender
星期天要睡觉17 小时前
计算机视觉(opencv)实战十七——图像直方图均衡化
人工智能·opencv·计算机视觉
大视码垛机17 小时前
速度与安全双突破:大视码垛机重构工业自动化新范式
大数据·数据库·人工智能·机器人·自动化·制造
feifeigo12317 小时前
星座SAR动目标检测(GMTI)
人工智能·算法·目标跟踪
WWZZ202517 小时前
视觉SLAM第10讲:后端2(滑动窗口与位子图优化)
c++·人工智能·后端·算法·ubuntu·机器人·自动驾驶
攻城狮7号17 小时前
HunyuanVideo-Foley模型开源,让AI视频告别“默片时代”
人工智能·hunyuanvideo·foley·混元开源模型·ai音频
IT古董18 小时前
【漫话机器学习系列】003.Agglomerative聚类
人工智能·算法·机器学习
Juchecar18 小时前
一文讲清 torch、torch.nn、torch.nn.functional 及 nn.Module
人工智能
丁学文武18 小时前
FlashAttention(V2)深度解析:从原理到工程实现
人工智能·深度学习·大模型应用·flashattention