unreify-rdf-statements
Summary¶
kgtk unreify-rdf-statements
simplifies data while copying a KGTK file
from input to output, by removing extra nodes caused by RDF statement
reification.
For example, consider the edges in the following table that result from importing an AIDA TA1 ntriples file:
Input Table:
node1 | label | node2 |
---|---|---|
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:confidence | XJAABmv8vGfJZZasjV6DAXY:g4 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:justifiedBy | XJAABmv8vGfJZZasjV6DAXY:g5 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:system | nJAABmv8vGfJZZasjV6DAXY-1: |
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:object | gaia:entities/d1dcefce-badf-4948-bfcf-5d33116fa12c |
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:predicate | nJAABmv8vGfJZZasjV6DAXY-3:Physical.LocatedNear_Place |
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:subject | gaia:relations/d3e1e4df-6c8c-4fd1-8b93-ee49ef238f72 |
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:type | rdf:Statement |
The output of kgtk unreify-rdf-statements
is below. The unified table is easier to
understand as it clearly signals that we have an event and we know the place where
the attack occured. The secondary edges qualify the main edge, giving us context.
Output Table:
id | node1 | label | node2 |
---|---|---|---|
XJAABmv8vGfJZZasjV6DAXY:g3 | gaia:relations/d3e1e4df-6c8c-4fd1-8b93-ee49ef238f72 | nJAABmv8vGfJZZasjV6DAXY-3:Physical.LocatedNear_Place | gaia:entities/d1dcefce-badf-4948-bfcf-5d33116fa12c |
XJAABmv8vGfJZZasjV6DAXY:g3-1 | XJAABmv8vGfJZZasjV6DAXY:g3 | ont:confidence | XJAABmv8vGfJZZasjV6DAXY:g4 |
XJAABmv8vGfJZZasjV6DAXY:g3-2 | XJAABmv8vGfJZZasjV6DAXY:g3 | ont:justifiedBy | XJAABmv8vGfJZZasjV6DAXY:g5 |
XJAABmv8vGfJZZasjV6DAXY:g3-3 | XJAABmv8vGfJZZasjV6DAXY:g3 | ont:system | nJAABmv8vGfJZZasjV6DAXY-1: |
Files¶
Input File¶
The input file is a KGTK file containing reified RDF data (among other
records), such as might have been imported from an ntriples file (see
kgtk import-ntriples
).
node1 | label | node2 |
---|---|---|
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:type | rdf:Statement |
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:object | gaia:entities/d1dcefce-badf-4948-bfcf-5d33116fa12c |
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:predicate | nJAABmv8vGfJZZasjV6DAXY-3:Physical.LocatedNear_Place |
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:subject | gaia:relations/d3e1e4df-6c8c-4fd1-8b93-ee49ef238f72 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:confidence | XJAABmv8vGfJZZasjV6DAXY:g4 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:justifiedBy | XJAABmv8vGfJZZasjV6DAXY:g5 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:system | nJAABmv8vGfJZZasjV6DAXY-1: |
Output File¶
The output file contains the KGTK data from the input file, with reified RDF statements and associated edges replaced with an unreified RDF edge and secondary edges.
node1 | label | node2 | id |
---|---|---|---|
gaia:relations/d3e1e4df-6c8c-4fd1-8b93-ee49ef238f72 | nJAABmv8vGfJZZasjV6DAXY-3:Physical.LocatedNear_Place | gaia:entities/d1dcefce-badf-4948-bfcf-5d33116fa12c | |
XJAABmv8vGfJZZasjV6DAXY:g3 | |||
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:confidence | XJAABmv8vGfJZZasjV6DAXY:g4 | XJAABmv8vGfJZZasjV6DAXY:g3-1 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:justifiedBy | XJAABmv8vGfJZZasjV6DAXY:g5 | XJAABmv8vGfJZZasjV6DAXY:g3-2 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:system | nJAABmv8vGfJZZasjV6DAXY-1: | XJAABmv8vGfJZZasjV6DAXY:g3-3 |
An id
column is added to the output file if it is not present in the input file.
This is used to link secondary edges to the newly reconstituted unreified edge.
At the present time, kgtk unreify-rdf-statements
does not generate id
values for
other edges in the file. This feature may be added in the future.
The edges in the output file are not likely to be in the same order as they appeared in the input file. If you wish to compare the input to the output files, read the section below on Difference Comparison.
Reified File¶
This optional file will receive a copy of just the input data records that matched the
reified RDF statement pattern. The records are the same as they were in the input
file, e.g., an id
column might not be present.
Unreified File¶
This optional file will receive a copy of just the output records that were generated
by by unreifying RDF statements in the input file. The records in this file will be in
the output file's format, e.g., an id
column will be present.
Uninvolved File¶
This optional file will receive a copy of the input data records that did not
match the reified RDF statement pattern. The records are the same as they were in the input
file, e.g., an id
column might not be present.
Pattern Match Parameters¶
kgtk unreify-rdf-statements
has a built-in set of pattern match parameters that
will not change for normal operation. All pattern matches reference the usual
node1
, label
, and node2
columns or their aliases; there are no options to
override the column names.
The Difference Comparison section, below, describes one use case in which overriding the pattern match parameters can be beneficial.
--trigger-label TRIGGER_LABEL_VALUE
A value that identifies the trigger label. (default=rdf:type).
--trigger-node2 TRIGGER_NODE2_VALUE
A value that identifies the trigger node2. (default=rdf:Statement).
--node1-role RDF_SUBJECT_LABEL_VALUE
The label that identifies the edge with the node2 value that will
serve in the node1 role. (default=rdf:subject).
--label-role RDF_PREDICATE_LABEL_VALUE
The label that identifies the edge with the node2 value that will
serve in the label role. (default=rdf:predicate).
--node2-role RDF_OBJECT_LABEL_VALUE
The label that identifies the edge with the node2 value that will
serve in the node2 role. (default=rdf:object).
Cartesian Crossproduct¶
kgtk unreify-rdf-statements
processes multiple subject, predicates, and/or
objects in the reified input edges by generating one set of unreified edges
(both the unreified data edge and any secondary edges) for each combination
of (subject, predicate, object). This processing may be disabled by options
to disallow multiple subjects, multiple predicates, and multiple
objects.
When processing of multiple subjects, predicates, or objects has been disabled,
and multiple subjects, predicates, or objects are encountered in the input
stream, a warning message will be issued (if --verbose
output is enabled)
and the group of input data will not be unreified.
--allow-multiple-subjects [ALLOW_MULTIPLE_SUBJECTS]
When true, allow multiple subjects, resulting in a cartesian
product. (default=True).
--allow-multiple-predicates [ALLOW_MULTIPLE_PREDICATES]
When true, allow multiple predicates, resulting in a cartesian
product. (default=True).
--allow-multiple-objects [ALLOW_MULTIPLE_OBJECTS]
When true, allow multiple objects, resulting in a cartesian product.
(default=True).
Broken Edges¶
Unless a Cartesian Crossproduct is being generated, kgtk unreify-rdf-statements
uses the node1
value of an input reified RDF statement ("XJAABmv8vGfJZZasjV6DAXY:g3")
as the edge id
of the output unreified RDF edge record, and as the node1
value of the
secondary edges. If for some reason there are other edges that refer to
this symbol ("XJAABmv8vGfJZZasjV6DAXY:g3") in the label
or node2
columns, or in
extra columns, they will retain linkage to the unreified edges.
If a Cartesian Crossproduct is being generated, then the node1
value of the input reified
RDF statement as the base for the id
and node1
values used in the generated edges:
XJAABmv8vGfJZZasjV6DAXY:g3-1
XJAABmv8vGfJZZasjV6DAXY:g3-2
...
The width of the suffix is adjusted for the number of crossproduct edges being generated,
i.e. if more than 9 edges were being generated, they would use these id
and node1
values:
XJAABmv8vGfJZZasjV6DAXY:g3-01
XJAABmv8vGfJZZasjV6DAXY:g3-02
...
XJAABmv8vGfJZZasjV6DAXY:g3-10
...
These generated edge and node1
values are designed to keep the generated reified
edges and secondary edges in proximity when sorted by id
, whether Cartesian Crossproducts
are being generated or not. However, when Cartesian Crossproducts are being generated,
then the new id
and node1
values cannot be as easily linked to esternal nodes referencing them.
Difference Comparison¶
kgtk unreify-rdf-statements
sorts its input data as part of detecting
reified RDF statements. Thus, attempting to look for changes between the input
file and the output file using an ordinary difference utility is not likely to
be fruitful. Instead, employ the following strategy:
- add an ID column to the input data if it does not already have one, using
kgtk add-id
- Perhaps without generating ID values, to remove clutter.
kgtk add-id --id-style=empty
- sort the resulting data using
kgtk unreify-rdf-statements
with a disabled pattern match parameter. kgtk unreify-rdf-statements --trigger-label=XXX -o output1.tsv
- Apply
kgtk unreify-rdf-statements
a second time without disabling the pattern match. kgtk unreify-rdf-statements -o output2.tsv
- Compare the two output files.
Usage¶
usage: kgtk unreify-rdf-statements [-h] [-i INPUT_FILE] [-o OUTPUT_FILE]
[--reified-file REIFIED_FILE]
[--unreified-file UNREIFIED_FILE]
[--uninvolved-file UNINVOLVED_FILE]
[--trigger-label TRIGGER_LABEL_VALUE]
[--trigger-node2 TRIGGER_NODE2_VALUE]
[--node1-role RDF_SUBJECT_LABEL_VALUE]
[--label-role RDF_PREDICATE_LABEL_VALUE]
[--node2-role RDF_OBJECT_LABEL_VALUE]
[--allow-multiple-subjects [ALLOW_MULTIPLE_SUBJECTS]]
[--allow-multiple-predicates [ALLOW_MULTIPLE_PREDICATES]]
[--allow-multiple-objects [ALLOW_MULTIPLE_OBJECTS]]
[-v [optional True|False]]
Read a KGTK file, such as might have been created by importing an ntriples file. Search for reified RFD statements and transform them into an unreified form.
An ID column will be added to the output file if not present in the input file.
--reified-file PATH, if specified, will get a copy of the input records that were identified as reified RDF statements.
--uninvolved-file PATH, if specified, will get a copy of the input records that were identified as not being reified RDF statements.
--unreified-file PATH, if specified, will get a copy of the unreified output records, which will still be written to the main output file.
Additional options are shown in expert help.
kgtk --expert unreify-rdb-statements --help
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
The KGTK input file with the reified data. (May be
omitted or '-' for stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--reified-file REIFIED_FILE
A KGTK output file that will contain only the reified
RDF statements. (Optional, use '-' for stdout.)
--unreified-file UNREIFIED_FILE
A KGTK output file that will contain only the
unreified RDF statements. (Optional, use '-' for
stdout.)
--uninvolved-file UNINVOLVED_FILE
A KGTK output file that will contain only the
uninvolved input. (Optional, use '-' for stdout.)
--trigger-label TRIGGER_LABEL_VALUE
A value that identifies the trigger label.
(default=rdf:type).
--trigger-node2 TRIGGER_NODE2_VALUE
A value that identifies the trigger node2.
(default=rdf:Statement).
--node1-role RDF_SUBJECT_LABEL_VALUE
The label that identifies the edge with the node2
value that will serve in the node1 role.
(default=rdf:subject).
--label-role RDF_PREDICATE_LABEL_VALUE
The label that identifies the edge with the node2
value that will serve in the label role.
(default=rdf:predicate).
--node2-role RDF_OBJECT_LABEL_VALUE
The label that identifies the edge with the node2
value that will serve in the node2 role.
(default=rdf:object).
--allow-multiple-subjects [ALLOW_MULTIPLE_SUBJECTS]
When true, allow multiple subjects, resulting in a
cartesian product. (default=True).
--allow-multiple-predicates [ALLOW_MULTIPLE_PREDICATES]
When true, allow multiple predicates, resulting in a
cartesian product. (default=True).
--allow-multiple-objects [ALLOW_MULTIPLE_OBJECTS]
When true, allow multiple objects, resulting in a
cartesian product. (default=True).
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
Examples¶
Example 1¶
kgtk cat -i examples/docs/unreify-rdf-statements-file1.tsv
node1 | label | node2 |
---|---|---|
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:confidence | XJAABmv8vGfJZZasjV6DAXY:g4 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:justifiedBy | XJAABmv8vGfJZZasjV6DAXY:g5 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:system | nJAABmv8vGfJZZasjV6DAXY-1: |
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:object | gaia:entities/d1dcefce-badf-4948-bfcf-5d33116fa12c |
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:predicate | nJAABmv8vGfJZZasjV6DAXY-3:Physical.LocatedNear_Place |
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:subject | gaia:relations/d3e1e4df-6c8c-4fd1-8b93-ee49ef238f72 |
XJAABmv8vGfJZZasjV6DAXY:g3 | rdf:type | rdf:Statement |
kgtk unreify-rdf-statements -i examples/docs/unreify-rdf-statements-file1.tsv
node1 | label | node2 | id |
---|---|---|---|
gaia:relations/d3e1e4df-6c8c-4fd1-8b93-ee49ef238f72 | nJAABmv8vGfJZZasjV6DAXY-3:Physical.LocatedNear_Place | gaia:entities/d1dcefce-badf-4948-bfcf-5d33116fa12c | XJAABmv8vGfJZZasjV6DAXY:g3 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:confidence | XJAABmv8vGfJZZasjV6DAXY:g4 | XJAABmv8vGfJZZasjV6DAXY:g3-1 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:justifiedBy | XJAABmv8vGfJZZasjV6DAXY:g5 | XJAABmv8vGfJZZasjV6DAXY:g3-2 |
XJAABmv8vGfJZZasjV6DAXY:g3 | ont:system | nJAABmv8vGfJZZasjV6DAXY-1: | XJAABmv8vGfJZZasjV6DAXY:g3-3 |
Example 2¶
kgtk cat -i examples/docs/unreify-rdf-statements-file2.tsv
node1 | label | node2 |
---|---|---|
_:g2301 | ont:confidence | _:g2302 |
_:g2301 | ont:justifiedBy | _:g2303 |
_:g2301 | ont:system | rpi1: |
_:g2301 | rdf:object | entity:c6f32b90-6038-40c0-97e4-6d3f7fd76c03 |
_:g2301 | rdf:predicate | ldc:Movement.TransportPerson.SelfMotion_Transporter |
_:g2301 | rdf:subject | event:03a41b2b-e0ef-42f9-a192-433e0abc3a70 |
_:g2301 | rdf:type | rdf:Statement |
_:g3910 | ont:confidence | _:g3911 |
_:g3910 | ont:justifiedBy | _:g3912 |
_:g3910 | ont:system | rpi1: |
_:g3910 | rdf:object | entity:fcb78e77-4962-4fca-977b-aea84bfa3ddd |
_:g3910 | rdf:predicate | ldc:Movement.TransportPerson.SelfMotion_Destination |
_:g3910 | rdf:subject | event:03a41b2b-e0ef-42f9-a192-433e0abc3a70 |
_:g3910 | rdf:type | rdf:Statement |
kgtk unreify-rdf-statements -i examples/docs/unreify-rdf-statements-file2.tsv
node1 | label | node2 | id |
---|---|---|---|
event:03a41b2b-e0ef-42f9-a192-433e0abc3a70 | ldc:Movement.TransportPerson.SelfMotion_Transporter | entity:c6f32b90-6038-40c0-97e4-6d3f7fd76c03 | _:g2301 |
_:g2301 | ont:confidence | _:g2302 | _:g2301-1 |
_:g2301 | ont:justifiedBy | _:g2303 | _:g2301-2 |
_:g2301 | ont:system | rpi1: | _:g2301-3 |
event:03a41b2b-e0ef-42f9-a192-433e0abc3a70 | ldc:Movement.TransportPerson.SelfMotion_Destination | entity:fcb78e77-4962-4fca-977b-aea84bfa3ddd | _:g3910 |
_:g3910 | ont:confidence | _:g3911 | _:g3910-1 |
_:g3910 | ont:justifiedBy | _:g3912 | _:g3910-2 |
_:g3910 | ont:system | rpi1: | _:g3910-3 |