Knowledge Graph知识图谱—6. Public Knowledge Graphs

6. Public Knowledge Graphs

6.1 Knowledge Graph Creation

6.1.1 CyC

Lesson learned no. 1: Trading efforts against accuracy

6.1.2 Freebase

Community based

Like Wikipedia, but more structured

Lesson learned no. 2: Trading formality against number of users

6.1.3 Wikidata

Goal: centralize data from Wikipedia languages

Collaborative editing

Imports other datasets

Includes rich provenance

Lesson learned no. 3: There is not one truth (but allowing for plurality adds complexity)

6.1.4 DBpedia & YAGO

Extraction from Wikipedia using mappings & heuristics

Wikipedia as a Knowledge Graph

Lesson learned no. 4: Heuristics help increasing coverage (at the cost of accuracy)

6.1.5 NELL

Lesson learned no. 5: Quality cannot be maximized without human intervention

6.1.6 Summary of Trade-Offs

  • (Manual) effort vs. accuracy and completeness
  • User involvement (or usability) vs. degree of formality
  • Simplicity vs. support for plurality and provenance

6.2 Non-Public Knowledge Graphs

Many companies have their own private knowledge graphs.

6.3 Comparison of Knowledge Graphs

"KG1 contains less of X than KG2" can mean

-- it actually contains less instances of X

-- it contains equally many or more instances,

but they are not typed with X

6.4 Overlap of Knowledge Graphs

度量 1 的目标是评估知识图谱的重叠度,相对于一个较小的知识图谱。在这里,"重叠度"指的是两个知识图谱之间共享的类别或实体的数量。较小的知识图谱可能是一个子集或者一个相关但规模较小的知识图谱。通过测量两者之间的重叠,研究人员可以了解将新知识集成到较小知识图谱中可能会带来的潜在收益。如果两个知识图谱在一些类别上有重叠,那么将这些类别的信息整合到较小知识图谱中可能会增加其丰富性和全面性。

度量 2 的目标是评估知识图谱的重叠度,相对于两个知识图谱之间显式链接的数量。在这里,"显式链接"指的是直接连接两个实体或类别的关系。通过测量两个知识图谱之间的重叠度,并将其与显式链接的数量进行比较,研究人员可以了解到显式链接的质量和知识图谱之间潜在关联的重要性。如果两个知识图谱之间有很多显式链接,但重叠度较低,那么可能需要改进链接的质量或者考虑其他方式来提高关联性。

链接生成机制并不需要过于准确。即使不同的度量和阈值导致了不同的链接集,但通过类内相关系数的可靠性,可以接受生成机制的多样性。这意味着,为了可靠地估计链接集之间的重叠度,链接生成并不一定需要非常准确,而是需要在统计上可靠。

Summary findings:

DBpedia and YAGO cover roughly the same instances (not much surprising)

NELL is the most complementary to the others

Existing interlinks are insufficient for out-of-the-box parallel usage

6.5 Intermezzo: Knowledge Graph Creation Cost

A few metrics for evaluating KGs: size, degree, interlinking, quality, licensing, ...

Case: manual curation & automatic/heuristic creation

6.6 Entity Extraction from List Pages

Lists form (shallow) hierarchies

Idea: align with category graph

Not all entities on a list page belong to the same category

Idea: Learn classifier to tell subject entities from non-subject entities
Distant learning approach

  • Positive examples: Entities that are in the
    corresponding category
    Negative examples: Entities that are in a category which is disjoint, e.g., Book <> Writer

YAGO uses categories for types but does not analyze them further.

6.7 Cat2Ax: Axiomatizing Wikipedia Categories

CaLiGraph Example: entities co-occur in surface patterns; co-occuring entities share semantic patterns; existing entities co-occur with new entities

6.8 Beyond List Pages

Learning descriptive rules for listings, e.g.

topSection("Discography") → artist.{PageEntity}

Learning across pages to mitigate small data problems
Metrics

Support: no. of listings covered by rule antecedent

Confidence: frequency of rule consequent over all covered listings

Consistency: mean absolute deviation of overall confidence and listing confidence. i.e., does the rule work equally well across all covered listings

Entity Disambiguation

6.9 From DBpedia to DBkWik

One (but not the only!) possible source of coverage bias

Collecting Data from a Multitude of Wikis

Differences to DBpedia

DBpedia has manually created mappings to an ontology

Wikipedia has one page per subject and global infobox conventions (more or less)
Challenges

On-the-fly ontology creation

Instance matching

Schema matching

DBpedia 是一个通用性较强的知识图谱项目

DBkWik 则是专注于特定领域的知识图谱项目

6.10 WebIsALOD

Background: Web table interpretation

Most approaches need typing information, DBpedia etc. have too little coverage on the long tail. Wanted: extensive type database

Main challenge

  1. Original dataset is quite noisy (<10% correct statements) Recap: coverage vs. accuracy
  2. Simple thresholding removes too much knowledge

Approach

  1. Train RandomForest model for predicting correct vs. wrong statements
  2. Using all the provenance information we have
  3. Use model to compute confidence scores

Current ongoing research: Using transformers and a larger training set

Current challenges and works in progress

  1. Distinguishing instances and classes
    i.e.: subclass vs. instance of relations
  2. Splitting instances
    Bauhaus is a goth band
    Bauhaus is a German school
  3. Knowledge extraction from pre and post modifiers
    Bauhaus is a goth band → genre(Bauhaus, Goth)
    Bauhaus is a German school → location(Bauhaus, Germany)
相关推荐
IOT.FIVE.NO.17 分钟前
Conda安装pytorch和cuda出现问题的解决记录
人工智能·pytorch·python
苏苏susuus3 小时前
机器学习:load_predict_project
人工智能·机器学习
科技小E3 小时前
打手机检测算法AI智能分析网关V4守护公共/工业/医疗等多场景安全应用
人工智能·安全·智能手机
猿饵块4 小时前
视觉slam--框架
人工智能
yvestine5 小时前
自然语言处理——Transformer
人工智能·深度学习·自然语言处理·transformer
SuperW6 小时前
OPENCV图形计算面积、弧长API讲解(1)
人工智能·opencv·计算机视觉
山海不说话6 小时前
视频行为标注工具BehaviLabel(源码+使用介绍+Windows.Exe版本)
人工智能·python·计算机视觉·视觉检测
虹科数字化与AR7 小时前
安宝特方案丨船舶智造的“AR+AI+作业标准化管理解决方案”(装配)
人工智能·ar·ar眼镜·船舶智造·数字工作流·智能装配
飞哥数智坊8 小时前
Coze实战第13讲:飞书多维表格读取+豆包生图模型,轻松批量生成短剧封面
人工智能
newxtc9 小时前
【配置 YOLOX 用于按目录分类的图片数据集】
人工智能·目标跟踪·分类