Knowledge Graph知识图谱—6. Public Knowledge Graphs

6. Public Knowledge Graphs

6.1 Knowledge Graph Creation

6.1.1 CyC

Lesson learned no. 1: Trading efforts against accuracy

6.1.2 Freebase

Community based

Like Wikipedia, but more structured

Lesson learned no. 2: Trading formality against number of users

6.1.3 Wikidata

Goal: centralize data from Wikipedia languages

Collaborative editing

Imports other datasets

Includes rich provenance

Lesson learned no. 3: There is not one truth (but allowing for plurality adds complexity)

6.1.4 DBpedia & YAGO

Extraction from Wikipedia using mappings & heuristics

Wikipedia as a Knowledge Graph

Lesson learned no. 4: Heuristics help increasing coverage (at the cost of accuracy)

6.1.5 NELL

Lesson learned no. 5: Quality cannot be maximized without human intervention

6.1.6 Summary of Trade-Offs

  • (Manual) effort vs. accuracy and completeness
  • User involvement (or usability) vs. degree of formality
  • Simplicity vs. support for plurality and provenance

6.2 Non-Public Knowledge Graphs

Many companies have their own private knowledge graphs.

6.3 Comparison of Knowledge Graphs

"KG1 contains less of X than KG2" can mean

-- it actually contains less instances of X

-- it contains equally many or more instances,

but they are not typed with X

6.4 Overlap of Knowledge Graphs

度量 1 的目标是评估知识图谱的重叠度,相对于一个较小的知识图谱。在这里,"重叠度"指的是两个知识图谱之间共享的类别或实体的数量。较小的知识图谱可能是一个子集或者一个相关但规模较小的知识图谱。通过测量两者之间的重叠,研究人员可以了解将新知识集成到较小知识图谱中可能会带来的潜在收益。如果两个知识图谱在一些类别上有重叠,那么将这些类别的信息整合到较小知识图谱中可能会增加其丰富性和全面性。

度量 2 的目标是评估知识图谱的重叠度,相对于两个知识图谱之间显式链接的数量。在这里,"显式链接"指的是直接连接两个实体或类别的关系。通过测量两个知识图谱之间的重叠度,并将其与显式链接的数量进行比较,研究人员可以了解到显式链接的质量和知识图谱之间潜在关联的重要性。如果两个知识图谱之间有很多显式链接,但重叠度较低,那么可能需要改进链接的质量或者考虑其他方式来提高关联性。

链接生成机制并不需要过于准确。即使不同的度量和阈值导致了不同的链接集,但通过类内相关系数的可靠性,可以接受生成机制的多样性。这意味着,为了可靠地估计链接集之间的重叠度,链接生成并不一定需要非常准确,而是需要在统计上可靠。

Summary findings:

DBpedia and YAGO cover roughly the same instances (not much surprising)

NELL is the most complementary to the others

Existing interlinks are insufficient for out-of-the-box parallel usage

6.5 Intermezzo: Knowledge Graph Creation Cost

A few metrics for evaluating KGs: size, degree, interlinking, quality, licensing, ...

Case: manual curation & automatic/heuristic creation

6.6 Entity Extraction from List Pages

Lists form (shallow) hierarchies

Idea: align with category graph

Not all entities on a list page belong to the same category

Idea: Learn classifier to tell subject entities from non-subject entities
Distant learning approach

  • Positive examples: Entities that are in the
    corresponding category
    Negative examples: Entities that are in a category which is disjoint, e.g., Book <> Writer

YAGO uses categories for types but does not analyze them further.

6.7 Cat2Ax: Axiomatizing Wikipedia Categories

CaLiGraph Example: entities co-occur in surface patterns; co-occuring entities share semantic patterns; existing entities co-occur with new entities

6.8 Beyond List Pages

Learning descriptive rules for listings, e.g.

topSection("Discography") → artist.{PageEntity}

Learning across pages to mitigate small data problems
Metrics

Support: no. of listings covered by rule antecedent

Confidence: frequency of rule consequent over all covered listings

Consistency: mean absolute deviation of overall confidence and listing confidence. i.e., does the rule work equally well across all covered listings

Entity Disambiguation

6.9 From DBpedia to DBkWik

One (but not the only!) possible source of coverage bias

Collecting Data from a Multitude of Wikis

Differences to DBpedia

DBpedia has manually created mappings to an ontology

Wikipedia has one page per subject and global infobox conventions (more or less)
Challenges

On-the-fly ontology creation

Instance matching

Schema matching

DBpedia 是一个通用性较强的知识图谱项目

DBkWik 则是专注于特定领域的知识图谱项目

6.10 WebIsALOD

Background: Web table interpretation

Most approaches need typing information, DBpedia etc. have too little coverage on the long tail. Wanted: extensive type database

Main challenge

  1. Original dataset is quite noisy (<10% correct statements) Recap: coverage vs. accuracy
  2. Simple thresholding removes too much knowledge

Approach

  1. Train RandomForest model for predicting correct vs. wrong statements
  2. Using all the provenance information we have
  3. Use model to compute confidence scores

Current ongoing research: Using transformers and a larger training set

Current challenges and works in progress

  1. Distinguishing instances and classes
    i.e.: subclass vs. instance of relations
  2. Splitting instances
    Bauhaus is a goth band
    Bauhaus is a German school
  3. Knowledge extraction from pre and post modifiers
    Bauhaus is a goth band → genre(Bauhaus, Goth)
    Bauhaus is a German school → location(Bauhaus, Germany)
相关推荐
东方不败之鸭梨的测试笔记22 分钟前
测试工程师如何利用AI大模型?
人工智能
智能化咨询26 分钟前
(68页PPT)埃森哲XX集团用户主数据治理项目汇报方案(附下载方式)
大数据·人工智能
说私域34 分钟前
分享经济应用:以“开源链动2+1模式AI智能名片S2B2C商城小程序”为例
人工智能·小程序·开源
工业机器视觉设计和实现34 分钟前
我的第三个cudnn程序(cifar10改cifar100)
人工智能·深度学习·机器学习
熊猫钓鱼>_>37 分钟前
PyTorch深度学习框架入门浅析
人工智能·pytorch·深度学习·cnn·nlp·动态规划·微分
Altair澳汰尔1 小时前
成功案例丨仿真+AI技术为快消包装行业赋能提速:基于 AI 的轻量化设计节省数十亿美元
人工智能·ai·仿真·cae·消费品·hyperworks·轻量化设计
祝余Eleanor1 小时前
Day 31 类的定义和方法
开发语言·人工智能·python·机器学习
背心2块钱包邮1 小时前
第6节——微积分基本定理(Fundamental Theorem of Calculus,FTC)
人工智能·python·机器学习·matplotlib
也许是_1 小时前
大模型应用技术之提示词高阶技巧
人工智能
ShiMetaPi1 小时前
SAM(通用图像分割基础模型)丨基于BM1684X模型部署指南
人工智能·算法·ai·开源·bm1684x·算力盒子