Skip to content

KGTK Data Model

The KGTK data model represents knowledge graphs (KG) as a set of nodes and edges, as shown in the figure below that shows a partial KG for the Terminator 2 movie. KGTK uses nodes to represent entities (e.g., terminator2_jd or action), literals (e.g., "Terminator 2"@en), dates (e.g., ^1992-03-30T00:00:00Z/11) and other types of literals (see full specification). A notable feature of KGTK is that edges are also nodes, depicted in the figure using the orange circles. Given that edges are nodes, it is possible to define edges that connect edges to other nodes, as illustrated using the blue arrows.

For example, we can represent that the terminator movie received an academy award for best sound editing by using an edge labeled award between terminator2_jd and academy-best-sound-editing. We can represent that the award was given on March 30, 1992 by using an edge labeled point_in_time from the award edge to ^1992-03-30T00:00:00Z/11, and we can also represent that the award was given to Gary Rydstrom and Gloria Borders using two additional edges labeled winner.


File Format

KGTK represents KGs using TSV files with 4 columns labeled id, node1, label and node2. The id column is a symbol representing an identifier of an edge, corresponding to the orange circles in the diagram above. node1 represents the source of the edge, node2 represents the destination of the edge, and label represents the relation between node1 and node2. Note that the identifiers of edges (e.g., t4) is used in the node1 column to represent an edge whose source is the edge with identifier t4. See File Format for the full specification of the KGTK file format.

id node1 label node2
terminator2_jd label "Terminator 2"@en
terminator2_jd instance_of film
terminator2_jd genre science_fiction
terminator2_jd genre action
t4 terminator2_jd cast a_schwarzenegger
t4 role terminator
t6 terminator2_jd cast l_hamilton
t6 role s_connor
t8 terminator2_jd award academy_best_sound_editing
t8 point_in_time ^1992-03-30T00:00:00Z/11
t8 winner g_rydstrom
t8 winner g_borders
l_hamilton label "Linda Hamilton"@en
a_schwarzenegger label "Arnold Schwarzenegger"@en
film subclass_of visual_artwork
terminator2_jd publication_date ^1984-10-26T00:00:00Z/11
t15 location united_states
terminator2_jd publication_date ^1985-02-08T00:00:00Z/11
t17 location sweden
terminator2_jd duration 108minute
instance_of label "instance of"@en

Relationship To Other KG Data Models

The KGTK data model is a generalization of popular data models used to represent KGs.

Relationship To Property Graphs

Property graphs are a popular data model where sets of attribute/value pairs can be attached to nodes and edges. The KGTK model is a generalization of property graphs because the attribute/value pairs are also edges: the attributes are relations and the values can be arbitrary nodes.

Relationship to RDF

RDF graphs represent KGs using subject/predicate/object triples, corresponding to the node1/label/node2 columns in KGTK. To represent edges about edges, it is necessary to use reification, typically done using rdf:Statement, where the edges are representing using three triples. The KGTK representation is simpler as it does not require the creation of extra triples to represent the edges.

RDF also supports quads, where a fourth element is used to represent a graph. In KGTK the fourth element is an identifier for an edge (every edge has a unique identifier). The KGTK data model is significantly more flexible as it is possible to associate edges with multiple graphs by using multiple edges on edges.

Relationship To RDF*

RDF* is a generalization of RDF that allows using triples in the subject of triples. In KGTK, the same effect is achieved by using the identifier of an edge as the node1 of an edge. KGTK is more flexible in that identifiers of edges can also be used in the node2 position. Furthermore, in KGTK it is possible to define two edges with identical node1/label/node2 values but different identifiers, making it possible to associate different sets of secondary edges with the same subject/predicate/object triple. This is useful in cases where the same subject/predicate/object triples has different provenance information.

Relationship to Wikidata

The KGTK data model is most similar to the Wikidata data model where it is possible to define qualifiers and references for statements. In Wikidata it is not possible to define qualifiers and references on qualifiers, so it is not possible, for example, to represent provenance of qualifiers. KGTK supports definition of an arbitrary number of levels of edges on edges.