6. Public Knowledge Graphs
6.1 Knowledge Graph Creation
6.1.1 CyC
Lesson learned no. 1: Trading efforts against accuracy
6.1.2 Freebase
Community based
Like Wikipedia, but more structured
Lesson learned no. 2: Trading formality against number of users
6.1.3 Wikidata
Goal: centralize data from Wikipedia languages
Collaborative editing
Imports other datasets
Includes rich provenance
Lesson learned no. 3: There is not one truth (but allowing for plurality adds complexity)
6.1.4 DBpedia & YAGO
Extraction from Wikipedia using mappings & heuristics
Wikipedia as a Knowledge Graph
Lesson learned no. 4: Heuristics help increasing coverage (at the cost of accuracy)
6.1.5 NELL
Lesson learned no. 5: Quality cannot be maximized without human intervention
6.1.6 Summary of Trade-Offs
- (Manual) effort vs. accuracy and completeness
- User involvement (or usability) vs. degree of formality
- Simplicity vs. support for plurality and provenance
6.2 Non-Public Knowledge Graphs
Many companies have their own private knowledge graphs.
6.3 Comparison of Knowledge Graphs
"KG1 contains less of X than KG2" can mean
-- it actually contains less instances of X
-- it contains equally many or more instances,
but they are not typed with X
6.4 Overlap of Knowledge Graphs
度量 1 的目标是评估知识图谱的重叠度,相对于一个较小的知识图谱。在这里,"重叠度"指的是两个知识图谱之间共享的类别或实体的数量。较小的知识图谱可能是一个子集或者一个相关但规模较小的知识图谱。通过测量两者之间的重叠,研究人员可以了解将新知识集成到较小知识图谱中可能会带来的潜在收益。如果两个知识图谱在一些类别上有重叠,那么将这些类别的信息整合到较小知识图谱中可能会增加其丰富性和全面性。
度量 2 的目标是评估知识图谱的重叠度,相对于两个知识图谱之间显式链接的数量。在这里,"显式链接"指的是直接连接两个实体或类别的关系。通过测量两个知识图谱之间的重叠度,并将其与显式链接的数量进行比较,研究人员可以了解到显式链接的质量和知识图谱之间潜在关联的重要性。如果两个知识图谱之间有很多显式链接,但重叠度较低,那么可能需要改进链接的质量或者考虑其他方式来提高关联性。
链接生成机制并不需要过于准确。即使不同的度量和阈值导致了不同的链接集,但通过类内相关系数的可靠性,可以接受生成机制的多样性。这意味着,为了可靠地估计链接集之间的重叠度,链接生成并不一定需要非常准确,而是需要在统计上可靠。
Summary findings:
DBpedia and YAGO cover roughly the same instances (not much surprising)
NELL is the most complementary to the others
Existing interlinks are insufficient for out-of-the-box parallel usage
6.5 Intermezzo: Knowledge Graph Creation Cost
A few metrics for evaluating KGs: size, degree, interlinking, quality, licensing, ...
Case: manual curation & automatic/heuristic creation
6.6 Entity Extraction from List Pages
Lists form (shallow) hierarchies
Idea: align with category graph
Not all entities on a list page belong to the same category
Idea: Learn classifier to tell subject entities from non-subject entities
Distant learning approach
- Positive examples: Entities that are in the
corresponding category
Negative examples: Entities that are in a category which is disjoint, e.g., Book <> Writer
YAGO uses categories for types but does not analyze them further.
6.7 Cat2Ax: Axiomatizing Wikipedia Categories
CaLiGraph Example: entities co-occur in surface patterns; co-occuring entities share semantic patterns; existing entities co-occur with new entities
6.8 Beyond List Pages
Learning descriptive rules for listings, e.g.
topSection("Discography") → artist.{PageEntity}
Learning across pages to mitigate small data problems
Metrics
Support: no. of listings covered by rule antecedent
Confidence: frequency of rule consequent over all covered listings
Consistency: mean absolute deviation of overall confidence and listing confidence. i.e., does the rule work equally well across all covered listings
Entity Disambiguation
6.9 From DBpedia to DBkWik
One (but not the only!) possible source of coverage bias
Collecting Data from a Multitude of Wikis
Differences to DBpedia
DBpedia has manually created mappings to an ontology
Wikipedia has one page per subject and global infobox conventions (more or less)
Challenges
On-the-fly ontology creation
Instance matching
Schema matching
DBpedia 是一个通用性较强的知识图谱项目
DBkWik 则是专注于特定领域的知识图谱项目
6.10 WebIsALOD
Background: Web table interpretation
Most approaches need typing information, DBpedia etc. have too little coverage on the long tail. Wanted: extensive type database
Main challenge
- Original dataset is quite noisy (<10% correct statements) Recap: coverage vs. accuracy
- Simple thresholding removes too much knowledge
Approach
- Train RandomForest model for predicting correct vs. wrong statements
- Using all the provenance information we have
- Use model to compute confidence scores
Current ongoing research: Using transformers and a larger training set
Current challenges and works in progress
- Distinguishing instances and classes
i.e.: subclass vs. instance of relations - Splitting instances
Bauhaus is a goth band
Bauhaus is a German school - Knowledge extraction from pre and post modifiers
Bauhaus is a goth band → genre(Bauhaus, Goth)
Bauhaus is a German school → location(Bauhaus, Germany)