join
Overview¶
Join two KGTK edge files or two KGTK node files.
Summary¶
- The output file contains the union of the columns in the two input files, adjusted for predefined name aliasing.
- Specify
--left-join
to get a left outer join. - Specify
--right-join
to get a right outer join. - Specify both to get a full outer join (equivalent to
kgtk cat
). - Specify neither to get an inner join.
- By default, node files are joined on the id column, while edge files are joined on the node1 column. The label and node2 columns may be added to the edge file join criteria. Alternatively, the left and right file join columns may be listed explicitly.
- Join keys are extracted from one or both input files and stored in memory, then the data files are processed in a second pass. Performance will be poor, and execution may fail, if the files are very large.
- stdin will not work as an input file if join keys are needed from it.
Uses for Join¶
-
to assemble edges from different files, building an enriched set of relationships for selected subjects (
node1
values). -
to assemble additional columns from different files, building an enriched set of additional columns for selected edges or nodes.
-
to select a smaller set of edges (or nodes) from a larger set, to improve the performance of future processing steps by eliminating unwanted edges (or nodes).
-
to perform a logical (set) operation on two sets of edges or nodes.
- intersection (inner join, defined below)
- union (outer join, defined below; also,
kgtk cat
)
Left and Right Input Files¶
kgtk join
takes two KGTK edge or node files (called left
and right
, in a nod to relational
database terminology) and produces a new KGTK edge or node file that contains
a selection of the edges or nodes from the two input files.
The following combinations are accepted:
- both input files are KGTK edge files, or
- both input files are KGTK node files, or,
- at least one input file is a quasi-KGTK file (indicated by the expert options
--mode=NONE
,--left-mode=NONE
, and--right-mode=NONE
).
Join Types¶
Following the terminology used for relational databases:
Join Type | Indicated by | Description |
---|---|---|
inner join | The output file contains only left and right edges or nodes that satisfy the matching criteria discussed below. This is equivalend to a logical intersection. | |
left join | --left-join |
The output file contains all of the left edges or nodes, and only the right edges or nodes that satisfy the matching criteria. |
right join | --right-join |
The output file contains all of the right edges or nodes, and only the left edges or nodes that satisfy the matching criteria. |
outer join | --left-join --right-join |
The output file contains all edges or nodes from both input files. This is equivalend to a logical union, or kgtk cat |
Duplicate Edges or Nodes¶
If either input file contains duplicate edges or nodes, or the two input files contain copies of the edge or node, and these edges or nodes are selected for inclusion in the output file, then the output file will contain duplicate edges or nodes.
The command kgtk compact
can be used to remove duplicate
edges or nodes from the output file.
Joining Normalized Edges¶
A normalized KGTK edge file contains only the following columns:
node1
(or its alias)label
(or its alias)node2
(or its alias)- optionally
id
(or its alias)
The label
column typically expresses a relationship between the values
(symbols, strings, numbers, etc.) in the node1
and node2
columns.
If the two input files to kgtk join
contain only normalized edges then the output file will contain only normalized edges.
The Optional id
Column¶
The id
columns is optional in KGTK edge files. It is required in KGTK node files.
-
If either input file contains an
id
column, then the output file will contain anid
column. -
If an
id
column is missing from an input file, the corresponding edges in the output file will have emptyid
values. -
Existing
id
column values will not be checked for duplication.
Note
See kgtk add-id
if you wish to generate new id
column values or
manipulate existing id
column values.
kgtk calc
may also be useful for modifying id
column values.
Joining Edge Files on node1
Alone¶
By default, kgtk join
matches the node1
column (or its alias) in the left
input edge
file to the node1
column (or its alias) in the right
input edge file. The goal is to
gather relationships, expressed by the label
and node2
columns, pertaining to
specific node1
objects.
Left input files:
node1 | label | node2 |
---|---|---|
block1 | isa | block |
block1 | color | red |
Right input file:
node1 | label | node2 |
---|---|---|
block1 | isa | block |
block1 | size | large |
Output file after joining on node1
:
node1 | label | node2 |
---|---|---|
block1 | isa | block |
block1 | color | red |
block1 | isa | block |
block1 | size | large |
The left and right edges will remain distinct in the output file,
leading to duplicate edges (block1 isa block
).
kgtk compact
may be used to remove duplicate
edges after the join. The output would be:
node1 | label | node2 |
---|---|---|
block1 | color | red |
block1 | isa | block |
block1 | size | large |
After kgtk join
, kgtk lift
may be used to
build additional columns with the assembled information, such
as node1;color
and node1;size
. The result might an edge
file with denormalzed edges (see below).
After kgtk lift
:
node1 | label | node2 | node1;color | node1;size |
---|---|---|---|---|
block1 | isa | block | red | large |
Joining with Additional Columns (Denormalized Edges)¶
A denormalized KGTK edge file contains additional columns beyond (node1
, label
, node2
, and id
).
When one or both of the input edge file are denormalized, the output file will also be denormalized. The output file will contain the union of the additional columns from the input files, with empty values in the additional columns when an input file does not contain a corresponding nonempty value.
Left input file:
node1 | label | node2 | color |
---|---|---|---|
block1 | isa | block | |
block1 | type | cube | red |
Right input file:
node1 | label | node2 | size |
---|---|---|---|
block1 | isa | block | |
block1 | type | cube | large |
Output file after joining on node1
:
node1 | label | node2 | color | size |
---|---|---|---|---|
block1 | isa | block | ||
block1 | type | cube | red | |
block1 | isa | block | ||
block1 | type | cube | large |
kgtk compact
may be used to remove duplicate edges
edges and compact the additional columns after the join:
node1 | label | node2 | color | size |
---|---|---|---|---|
block1 | isa | block | ||
block1 | type | cube | red | large |
Prefixing Additional Columns¶
--left-prefix LEFT_PREFIX
and --right-prefix RIGHT_PREFIX
provide optional
prefixes for the additional columns in the output field.
Without prefixing, additional columns with the same name in the left and right input
files become a single column in the output file:
Left input file:
node1 | label | node2 | color |
---|---|---|---|
block1 | isa | block | |
block1 | type | cube | red |
Right input file:
node1 | label | node2 | color |
---|---|---|---|
block1 | isa | block | |
block1 | type | cube | green |
Output file after joining without prefixing:
node1 | label | node2 | color |
---|---|---|---|
block1 | isa | block | |
block1 | type | cube | red |
block1 | isa | block | |
block1 | type | cube | green |
Output file after joining on node1
with --left-prefix=l- --right-prefix=r-
:
node1 | label | node2 | l-color | r-color |
---|---|---|---|---|
block1 | isa | block | ||
block1 | type | cube | red | |
block1 | isa | block | ||
block1 | type | cube | green |
Joining Edges on node1
, label
, node2
, and optionally id
¶
Another use for kgtk join
is to assemble a set of additional columns for
edges with entries in two different files. Joining on edge identity can be
requested by specifying --join-on-label --join-on-node2
. If the id
column is significant for edge identity, --join-on-id
can also be specified.
Left input file:
node1 | label | node2 | color |
---|---|---|---|
block1 | material | wood | |
block1 | type | cube | red |
block2 | material | steel | |
block2 | type | cube | blue |
Right input file:
node1 | label | node2 | size |
---|---|---|---|
block1 | type | cube | large |
block3 | type | cube | small |
Output file after an inner join on (node1
, label
, node2
):
node1 | label | node2 | color | size |
---|---|---|---|---|
block1 | type | cube | red | |
block1 | type | cube | large |
kgtk compact
can then be used to
compact the additional columns after the join:
node1 | label | node2 | color | size |
---|---|---|---|---|
block1 | type | cube | red | large |
Joining Node Files: Set Intersection¶
Node files may be joined to perform a set intersection using the default
join type (omit --left-join
and --right-join
).
Left input file:
id |
---|
block1 |
block2 |
Right input file:
id |
---|
block1 |
block3 |
Output file after the inner join:
id |
---|
block1 |
Joining Node Files: Set Union¶
Node files may be joined to perform a set union. Use --left-join --right-join
to
perform an outer join.
Left input file:
id |
---|
block1 |
block2 |
Right input file:
id |
---|
block1 |
block3 |
Output file after the outer join:
id |
---|
block1 |
block2 |
block1 |
block3 |
After using kgtk compact
to remove duplicates:
id |
---|
block1 |
block2 |
block3 |
Joining Node Files: Merging Additional Columns¶
Node files may be joined to merge the additional columns found in the two input files.
Left input file:
id | shape |
---|---|
block1 | cube |
Right input file:
id | color |
---|---|
block1 | red |
Output file:
id | shape | color |
---|---|---|
block1 | cube | |
block1 | red |
After using kgtk compact
to compact the entries:
id | shape | color |
---|---|---|
block1 | cube | red |
Joining on Arbitrary Columns¶
--left-file-join-columns LEFT_JOIN_COLUMNS ...
and --right-file-join-columns RIGHT_JOIN_COLUMNS ...
may be used to join on arbitrary sets of left and right file columns. It is necessary to use these
options when processing quasi-KGTK files (--mode=NONE
, --left-mode=NONE
, --right-mode=NONE
).
Memory Usage and stdin Limitation¶
kgtk join
builds in-memory key sets from one or both input files. This may cause performance
problems if the join requires more memory than is available.
Join Type | Input File Key Sets |
---|---|
Inner join | Key sets are built in memory for both the left and right input files. |
Left join | A key set is built in memory for the left input file. |
Right join | A key set is built in memory for the right input file. |
Outer join | No key sets are built in memory. |
Note
kgtk join
builds in-memory key sets from one or both input files. Standard input (stdin)
from a pipe may not be used for an input file for which a key set is built in memory.
Note
kgtk ifexists
provides an experimental join mode that works with
presorted input files and uses much less memory. In the future, kgtk join
may offer support for
presorted input files.
Bending the Rules¶
To join an edge file to a node file, or to join quasi-KGTK files, use the following options (enable expert mode for usage information):
Option | Description |
---|---|
--left-mode=NONE |
Treat the left input file as a quasi-KGTK file. --left-file-join-columns must be supplied. |
--mode=NONE |
Treat both input files as quasi-KGTK files. Both --left-file-join-columns and --right-file-join-columns must be supplied. |
--right-mode=NONE |
Treat the right input file as a quasi-KGTK file. --right-file-join-columns must be supplied. |
kgtk join / compact
¶
The pipeline kgtk join / compact
is a frequent idiom to remove duplicates and
compact additional columns in the output from a join operation.
Info
In the future, kgtk join --compact
will run kgtk compact
automatically
on the output of kgtk join
. This will reduce the
number of command options that may need to be specified. It may increase
performance as well. Issue #116.
Usage¶
usage: kgtk join [-h] [--left-file LEFT_FILE] [--right-file RIGHT_FILE]
[-o OUTPUT_FILE] [--join-on-id [JOIN_ON_ID]]
[--join-on-label [JOIN_ON_LABEL]]
[--join-on-node2 [JOIN_ON_NODE2]] [--left-prefix LEFT_PREFIX]
[--left-file-join-columns LEFT_JOIN_COLUMNS [LEFT_JOIN_COLUMNS ...]]
[--left-join [LEFT_JOIN]] [--right-prefix RIGHT_PREFIX]
[--right-file-join-columns RIGHT_JOIN_COLUMNS [RIGHT_JOIN_COLUMNS ...]]
[--right-join [RIGHT_JOIN]] [-v [optional True|False]]
Join two KGTK edge files or two KGTK node files.
Join keys are extracted from one or both input files and stored in memory,
then the data files are processed in a second pass. stdin will not work as an
input file if join keys are needed from it.
The output file contains the union of the columns in the two
input files, adjusted for predefined name aliasing.
Specify --left-join to get a left outer join.
The output file will contain all records from the
left input file, along with records from the right
input file with matching join column values.
Specify --right-join to get a right outer join.
The output file will contain all records from the
right input file, along with records from the left
input file with matching join column values.
Specify both --left-join and --right-join to get a full outer
join (equivalent to cat or set union).
The output file will contain all records from both
the left input file and the right input file.
Specify neither --left-join nor --right-join to get an inner
join. If there are no columns beyond the join columns, then
this is equivalent to set intersection.
The output file will contain records from the left
input file and from the right input file for which
the join column value match.
By default, node files are joined on the id column, while edge files are joined
on the node1 column. The label and node2 columns may be added to the edge file
join criteria. Alternatively, the left and right file join columns may be
listed explicitly.
To join an edge file to a node file, or to join quasi-KGTK files, use the
following option (enable expert mode for more information):
--mode=NONE
Expert mode provides additional command arguments.
optional arguments:
-h, --help show this help message and exit
--left-file LEFT_FILE
The left-side KGTK file to join (required). (May be
omitted or '-' for stdin.)
--right-file RIGHT_FILE
The right-side KGTK file to join (required). (May be
omitted or '-' for stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--join-on-id [JOIN_ON_ID]
If both input files are edge files, include the id
column in the join (default=False).
--join-on-label [JOIN_ON_LABEL]
If both input files are edge files, include the label
column in the join (default=False).
--join-on-node2 [JOIN_ON_NODE2]
If both input files are edge files, include the node2
column in the join (default=False).
--left-prefix LEFT_PREFIX
An optional prefix applied to left file column names
in the output file (default=None).
--left-file-join-columns LEFT_JOIN_COLUMNS [LEFT_JOIN_COLUMNS ...]
Left file join columns (default=None).
--left-join [LEFT_JOIN]
Perform a left outer join (default=False).
--right-prefix RIGHT_PREFIX, --prefix RIGHT_PREFIX
An optional prefix applied to right file column names
in the output file (default=None).
--right-file-join-columns RIGHT_JOIN_COLUMNS [RIGHT_JOIN_COLUMNS ...]
Right file join columns (default=None).
--right-join [RIGHT_JOIN]
Perform a right outer join (default=False).
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
Examples¶
Normalized Sample Data¶
Suppose that file1.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/join-file1.tsv
node1 | label | node2 |
---|---|---|
john | zipcode | 12345 |
peter | zipcode | 12040 |
steve | zipcode | 45601 |
and file2.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/join-file2.tsv
node1 | label | node2 |
---|---|---|
john | position | programmer |
peter | position | engineer |
edward | position | supervisor |
Do an inner join on two KGTK normalized edge files on node1
.¶
The output will contain edges from the left and right files
got only those node1
values that appear in both files.
kgtk join --left-file examples/docs/join-file1.tsv \
--right-file examples/docs/join-file2.tsv
The result will be the following table in KGTK format:
node1 | label | node2 |
---|---|---|
john | zipcode | 12345 |
peter | zipcode | 12040 |
john | position | programmer |
peter | position | engineer |
Do a left outer join on two KGTK normalized edge files on node1
.¶
The output will contain all edges from the left file,
and any edges from the right file with a node1
value that
matches at least one edge in the left file.
kgtk join --left-join \
--left-file examples/docs/join-file1.tsv \
--right-file examples/docs/join-file2.tsv
The result will be the following table in KGTK format:
node1 | label | node2 |
---|---|---|
john | zipcode | 12345 |
peter | zipcode | 12040 |
steve | zipcode | 45601 |
john | position | programmer |
peter | position | engineer |
Do a right outer join on two KGTK normalized edge files on node1
.¶
The output will contain all edges from the right file,
and any edges from the left file with a node1
value that
matches at least one edge in the right file.
kgtk join --right-join \
--left-file examples/docs/join-file1.tsv \
--right-file examples/docs/join-file2.tsv
The result will be the following table in KGTK format:
node1 | label | node2 |
---|---|---|
john | zipcode | 12345 |
peter | zipcode | 12040 |
john | position | programmer |
peter | position | engineer |
edward | position | supervisor |
Do a full outer join on two KGTK normalized edge files on node1
.¶
This produces the same output as the kgtk cat
command,
and is included here for completeness.
kgtk join --left-join --right-join \
--left-file examples/docs/join-file1.tsv \
--right-file examples/docs/join-file2.tsv
The result will be the following table in KGTK format:
node1 | label | node2 |
---|---|---|
john | zipcode | 12345 |
peter | zipcode | 12040 |
steve | zipcode | 45601 |
john | position | programmer |
peter | position | engineer |
edward | position | supervisor |
Denormalized Sample Data¶
Suppose that file3.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/join-file3.tsv
node1 | label | node2 | location |
---|---|---|---|
john | zipcode | 12345 | home |
john | zipcode | 12346 | work |
peter | zipcode | 12040 | home |
peter | zipcode | 12040 | work |
steve | zipcode | 45601 | home |
steve | zipcode | 45601 | work |
and file4.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/join-file4.tsv
node1 | label | node2 | years |
---|---|---|---|
john | position | programmer | 3 |
peter | position | engineer | 2 |
edward | position | supervisor | 10 |
john | laptop | dell | 4 |
peter | laptop | apple | 7 |
Do an inner join on two KGTK denormalized edge files on node1
.¶
The output will contain edges from the left and right files
got only those node1
values that appear in both files.
kgtk join --left-file examples/docs/join-file3.tsv \
--right-file examples/docs/join-file4.tsv
The result will be the following table in KGTK format:
node1 | label | node2 | location | years |
---|---|---|---|---|
john | zipcode | 12345 | home | |
john | zipcode | 12346 | work | |
peter | zipcode | 12040 | home | |
peter | zipcode | 12040 | work | |
john | position | programmer | 3 | |
peter | position | engineer | 2 | |
john | laptop | dell | 4 | |
peter | laptop | apple | 7 |
Do a left outer join on two KGTK denormalized edge files on node1
.¶
The output will contain all edges from the left file,
and any edges from the right file with a node1
value that
matches at least one edge in the left file.
kgtk join --left-join \
--left-file examples/docs/join-file3.tsv \
--right-file examples/docs/join-file4.tsv
The result will be the following table in KGTK format:
node1 | label | node2 | location | years |
---|---|---|---|---|
john | zipcode | 12345 | home | |
john | zipcode | 12346 | work | |
peter | zipcode | 12040 | home | |
peter | zipcode | 12040 | work | |
steve | zipcode | 45601 | home | |
steve | zipcode | 45601 | work | |
john | position | programmer | 3 | |
peter | position | engineer | 2 | |
john | laptop | dell | 4 | |
peter | laptop | apple | 7 |
Do a right outer join on two KGTK denormalized edge files on node1
.¶
The output will contain all edges from the right file,
and any edges from the left file with a node1
value that
matches at least one edge in the right file.
kgtk join --right-join \
--left-file examples/docs/join-file3.tsv \
--right-file examples/docs/join-file4.tsv
The result will be the following table in KGTK format:
node1 | label | node2 | location | years |
---|---|---|---|---|
john | zipcode | 12345 | home | |
john | zipcode | 12346 | work | |
peter | zipcode | 12040 | home | |
peter | zipcode | 12040 | work | |
john | position | programmer | 3 | |
peter | position | engineer | 2 | |
edward | position | supervisor | 10 | |
john | laptop | dell | 4 | |
peter | laptop | apple | 7 |
Do a full outer join on two KGTK denormalized edge files on node1
.¶
This produces the same output as the kgtk cat
command,
and is included here for completeness.
kgtk join --left-join --right-join \
--left-file examples/docs/join-file3.tsv \
--right-file examples/docs/join-file4.tsv
The result will be the following table in KGTK format:
node1 | label | node2 | location | years |
---|---|---|---|---|
john | zipcode | 12345 | home | |
john | zipcode | 12346 | work | |
peter | zipcode | 12040 | home | |
peter | zipcode | 12040 | work | |
steve | zipcode | 45601 | home | |
steve | zipcode | 45601 | work | |
john | position | programmer | 3 | |
peter | position | engineer | 2 | |
edward | position | supervisor | 10 | |
john | laptop | dell | 4 | |
peter | laptop | apple | 7 |