ifexists
Overview¶
The ifexists command filters a KGTK file (the input file specified by --input-file
, which defaults to standard input),
passing through only those rows for which one or more specified columns
match records in a second KGTK file (the filter file, specified by --filter-on
).
Note
The kgtk ifnotexists
command computes the inverse output of this command.
Memory Usage Options¶
This implementation of ifexists
is written in Python. By default, it builds
an in-memory dictionary of the key values it finds in the --filter-on
file
before processing the --input-file
in a single pass. Performance will be poor, and execution may
fail, if the --filter-on
file is too large for the key dictionary to fit into main memory.
If the input file is small, the --cache-input
option can be used to tell the code to cache the --input-file
instead of the --filter-on
file.
After cacheing the --input-file
, the code will make a single pass through the --filter-on
file.
If both the --input-file
and the --filter-on
file are too large to hold in
memory, then you should presort the input and filter files on their key
columns using kgtk sort
, followed by using kgtk filter --presorted
to avoid caching
either file.
Output Record Order¶
Normally, input records are passed in order to the output file. However, when
the input file is cached (--cache-input
), the default is for the output
records to ordered by key value (alpha sort), then by input order. If you
wish the output file to retain the input file's order when cacheing the
input file, use the `--preserve-order
option.
Key Fields¶
The names of the fields used match records may be supplied by the user using
the --input-keys
and --filter-keys
option.
Each option may take a variable number of space-separated field names. If keys are not supplied,
the following defaults will be used, which depend on the KGTK file type
(edge or node) of the input and filter files.
Input File Type | Filter File Type | Key fields |
---|---|---|
edge | edge | input.node1 == filter.node1 and |
input.label == filter.label and | ||
input.node2 == filter.node2 | ||
node | node | input.id == filter.id |
edge | node | input.node1 == filter.id |
node | edge | input.id == filter.node1 |
Note
The number of input file keys must match the number of output file keys, after taking into consideration the default keys. So, if you want to match an edge file's node1 value to a nonstandard column in a node file, only the --filter-keys
option needs to be specified.
Optional Output Files¶
The --reject-file
, when specified, will receive any input records
that failed the filter test and were not written to the output file.
The --matched-filter-file
, when specified, will receive a copy of
any filter records that found a match in the input file.
The --unmatched-filter-file
, when specified, will receive a copy of
any filter records that did not find a match in the input file.
Experimental Join Facility¶
The kgtk ifexists
command contains experimental support for performing
a join. The join output file (which may be the primary output file)
will contain the union of the columns found in the --input-file
and the --filter-on
file, and may contain
records from both file. At the present time, please refer to
kgtk --expert ifexists --help
and the KGTK source code files for more
details on this facility.
Usage¶
usage: kgtk ifexists [-h] [-i INPUT_FILE] [--filter-on FILTER_FILE]
[-o OUTPUT_FILE] [--reject-file REJECT_FILE]
[--matched-filter-file MATCHED_FILTER_FILE]
[--unmatched-filter-file UNMATCHED_FILTER_FILE]
[--input-keys [INPUT_KEYS ...]]
[--filter-keys [FILTER_KEYS ...]]
[--cache-input [True|False]]
[--preserve-order [True|False]]
[--presorted [True|False]] [-v [optional True|False]]
Filter a KGTK file based on whether one or more records exist in a second KGTK file with matching values for one or more fields.
Additional options are shown in expert help.
kgtk --expert ifexists --help
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
The KGTK input file. (May be omitted or '-' for
stdin.)
--filter-on FILTER_FILE, --filter-file FILTER_FILE
The KGTK file to filter against. (May be omitted or
'-' for stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--reject-file REJECT_FILE
The KGTK file for input records that fail the filter.
(Optional, use '-' for stdout.)
--matched-filter-file MATCHED_FILTER_FILE
The KGTK file for filter records that matched at least
one input record. (Optional, use '-' for stdout.)
--unmatched-filter-file UNMATCHED_FILTER_FILE
The KGTK file for filter records that did not match
any input records. (Optional, use '-' for stdout.)
--input-keys [INPUT_KEYS ...], --left-keys [INPUT_KEYS ...]
The key columns in the file being filtered
(default=None).
--filter-keys [FILTER_KEYS ...], --right-keys [FILTER_KEYS ...]
The key columns in the filter-on file (default=None).
--cache-input [True|False]
Cache the input file instead of the filter keys
(default=False).
--preserve-order [True|False]
Preserve record order when cacheing the input file.
(default=False).
--presorted [True|False]
When True, assume that the input and filter files are
both presorted. Use a merge-style algorithm that does
not require caching either file. (default=False).
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
Examples¶
Sample Data¶
Suppose that ifexists-file1.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/ifexists-file1.tsv
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
p1 | peter | title | manager | ||
p2 | peter | zipcode | 12040 | home | |
p3 | peter | zipcode | 12040 | work | 6 |
s1 | steve | title | supervisor | ||
s2 | steve | zipcode | 45601 | 3 | |
s3 | steve | zipcode | 45601 | work | |
j1 | john | title | programmer | ||
j2 | john | zipcode | 12345 | home | 10 |
j2 | john | zipcode | 12346 | ||
k1 | kathy | title | owner | ||
k2 | kathy | zipcode | 12040 | home | |
k3 | kathy | zipcode | 12040 | work | 6 |
Note
This is a KGTK edge file.
Suppose that ifexists-file2.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/ifexists-file2.tsv
node1 | label | node2 |
---|---|---|
peter | zipcode | 12040 |
Note
This is a KGTK edge file.
Suppose that ifexists-file3.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/ifexists-file3.tsv
id |
---|
steve |
john |
Note
This is a KGTK node file.
Suppose that ifexists-file4.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/ifexists-file4.tsv
id |
---|
peter |
john |
Note
This is a KGTK node file.
Suppose that ifexists-file5.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/ifexists-file5.tsv
id |
---|
home |
Note
This is a KGTK node file.
Suppose that ifexists-file6.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/ifexists-file6.tsv --mode NONE
label | node2 |
---|---|
zipcode | 12040 |
zipcode | 45601 |
zipcode | 45601 |
zipcode | 52040 |
zipcode | 62040 |
zipcode | 72040 |
Note
This is not a valid KGTK file, as it does not meet the mandatory column requirements for an edge file nor a node file.
Suppose that ifexists-file7.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/ifexists-file7.tsv
id |
---|
j1 |
s1 |
Note
This is a KGTK node file.
Filter an Edge File on Another Edge File.¶
kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
--filter-on examples/docs/ifexists-file2.tsv
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
p2 | peter | zipcode | 12040 | home | |
p3 | peter | zipcode | 12040 | work | 6 |
Note
Since both the input file and the filter file are KGTK edge files, the default key field comparisons are:
input.node1 == filter.node1 and input.label == filter.label and input.node2 == filter.node2
The id
fields are not part of this comparison (and the
id
field isn't present in examples/docs/ifexists-file2.tsv
).
Filter an Edge File on a Node File¶
kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
--filter-on examples/docs/ifexists-file3.tsv
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
s1 | steve | title | supervisor | ||
s2 | steve | zipcode | 45601 | 3 | |
s3 | steve | zipcode | 45601 | work | |
j1 | john | title | programmer | ||
j2 | john | zipcode | 12345 | home | 10 |
j2 | john | zipcode | 12346 |
Note
Since the input file is a KGTK edge file and the filter file is a KGTK node file, the default key field comparison is:
input.node1 == filter.id
Filter a Node File on a Node File¶
kgtk ifexists --input-file examples/docs/ifexists-file4.tsv \
--filter-on examples/docs/ifexists-file3.tsv
id |
---|
john |
Note
Since the input file and the filter files are both KGTK node files, the default key field comparison is:
input.id == filter.id
Filter an Edge File on a Node File Using an Alternate Input Column¶
kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
--filter-on examples/docs/ifexists-file5.tsv \
--input-keys location
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
p2 | peter | zipcode | 12040 | home | |
j2 | john | zipcode | 12345 | home | 10 |
k2 | kathy | zipcode | 12040 | home |
Note
This used the key field comparison:
input.location == filter.id
Filter an Edge File on a Nonstandard File¶
We want to filter a KGTK edge file agains a file that not
a valid KGTK file (it is almost a KGTK edge file, but it is
missing the node1
column). We can use the expert option
--filter-mode NONE
to disable the mandatory column check
on the filter file, then use --input-keys
and --filter-keys
to specify the columns that we want to compare.
kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
--filter-on examples/docs/ifexists-file6.tsv \
--filter-mode NONE \
--input-keys label node2 \
--filter-keys label node2
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
p2 | peter | zipcode | 12040 | home | |
p3 | peter | zipcode | 12040 | work | 6 |
s2 | steve | zipcode | 45601 | 3 | |
s3 | steve | zipcode | 45601 | work | |
k2 | kathy | zipcode | 12040 | home | |
k3 | kathy | zipcode | 12040 | work | 6 |
Note
This used the key field comparison:
input.label == filter.label and input.node2 == filter.node2
Filter an Edge File: Filter Matches¶
We want to filter a KGTK edge file agains a file that not
a valid KGTK file (it is almost a KGTK edge file, but it is
missing the node1
column). We can use the expert option
--filter-mode NONE
to disable the mandatory column check
on the filter file, then use --input-keys
and --filter-keys
to specify the columns that we want to compare.
Furthermore, we want to see which filter records found at least one matching input record, and which filter records did not find a match.
kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
--filter-on examples/docs/ifexists-file6.tsv \
--filter-mode NONE \
--input-keys label node2 \
--filter-keys label node2 \
--matched-filter-file ifexists-matched-filter.tsv \
--unmatched-filter-file ifexists-unmatched-filter.tsv
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
p2 | peter | zipcode | 12040 | home | |
p3 | peter | zipcode | 12040 | work | 6 |
s2 | steve | zipcode | 45601 | 3 | |
s3 | steve | zipcode | 45601 | work | |
k2 | kathy | zipcode | 12040 | home | |
k3 | kathy | zipcode | 12040 | work | 6 |
kgtk cat -i ifexists-matched-filter.tsv --mode NONE
label | node2 |
---|---|
zipcode | 12040 |
zipcode | 45601 |
zipcode | 45601 |
kgtk cat -i ifexists-unmatched-filter.tsv --mode NONE
label | node2 |
---|---|
zipcode | 52040 |
zipcode | 62040 |
zipcode | 72040 |
Note
Since the filter file was missing a mandatory KGTK column (node1
), the
matched and unmatched filter output files are also missing that
column. Thus, the kgtk cat
commands that disply them also need
--mode NONE
.
Filter an Edge File By id¶
This is another example of filtering an input file using an
alternate input file key column. examples/docs/ifexists-file7.tsv
contains a list of edge ids that we want to retain in the output file.
kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
--filter-on examples/docs/ifexists-file7.tsv \
--input-keys id
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
s1 | steve | title | supervisor | ||
j1 | john | title | programmer |
Note
This used the key field comparison:
input.id == filter.id
Filter an Edge File By id: Reject File¶
This is another example of filtering an input file using an
alternate input file key column. examples/docs/ifexists-file7.tsv
contains a list of edge ids that we want to retain in the output file.
However, we also want to obtain the records that were rejected to
check that the records that were rejected match our expectations, or
perhaps to apply different processing to them.
kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
--filter-on examples/docs/ifexists-file7.tsv \
--reject-file ifexists-rejects.tsv \
--input-keys id
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
s1 | steve | title | supervisor | ||
j1 | john | title | programmer |
kgtk cat -i ifexists-rejects.tsv
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
p1 | peter | title | manager | ||
p2 | peter | zipcode | 12040 | home | |
p3 | peter | zipcode | 12040 | work | 6 |
s2 | steve | zipcode | 45601 | 3 | |
s3 | steve | zipcode | 45601 | work | |
j2 | john | zipcode | 12345 | home | 10 |
j2 | john | zipcode | 12346 | ||
k1 | kathy | title | owner | ||
k2 | kathy | zipcode | 12040 | home | |
k3 | kathy | zipcode | 12040 | work | 6 |
Note
If the intent of the filter was to separate all title records by edge
ID, then the reject file shows that id p1
was omitted from the filter file.
Filter a Small Input File on a Large Filter File¶
Although the example data files are very small, this example command shows how to filter a small input file against a large filter file:
kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
--filter-on examples/docs/ifexists-file3.tsv \
--cache-input
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
j1 | john | title | programmer | ||
j2 | john | zipcode | 12345 | home | 10 |
j2 | john | zipcode | 12346 | ||
s1 | steve | title | supervisor | ||
s2 | steve | zipcode | 45601 | 3 | |
s3 | steve | zipcode | 45601 | work |
Note
Since the input file is a KGTK edge file and the filter file is a KGTK node file, the default key field comparison is:
input.node1 == filter.id
Because we are cacheing the input file, the output edges have been reordered by the input key, then by order.
Filter a Small Input File on a Large Filter File, Preserving Order¶
Although the example data files are very small, this example command shows how to filter a small input file against a large filter file, preserving the input file's order:
kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
--filter-on examples/docs/ifexists-file3.tsv \
--cache-input --preserve-order
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
s1 | steve | title | supervisor | ||
s2 | steve | zipcode | 45601 | 3 | |
s3 | steve | zipcode | 45601 | work | |
j1 | john | title | programmer | ||
j2 | john | zipcode | 12345 | home | 10 |
j2 | john | zipcode | 12346 |
Note
Since the input file is a KGTK edge file and the filter file is a KGTK node file, the default key field comparison is:
input.node1 == filter.id
The output edges appear in the same order as the input edges.
Filter a Large Input File on a Large Filter File¶
Although the example data files are very small, this example command shows how to filter a large input file against a large filter file by sorting the two files.
We will explicitly tell the kgtk sort
command which
columns to sort on.
kgtk sort --input-file examples/docs/ifexists-file1.tsv \
--output-file ifexists-file1-sorted-by-node1.tsv \
--column node1
kgtk sort --input-file examples/docs/ifexists-file3.tsv \
--output-file ifexists-file3-sorted-by-id.tsv \
--column id
kgtk ifexists --input-file ifexists-file1-sorted-by-node1.tsv \
--filter-on ifexists-file3-sorted-by-id.tsv \
--presorted
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
j1 | john | title | programmer | ||
j2 | john | zipcode | 12345 | home | 10 |
j2 | john | zipcode | 12346 | ||
s1 | steve | title | supervisor | ||
s2 | steve | zipcode | 45601 | 3 | |
s3 | steve | zipcode | 45601 | work |
Note
Since the input file is a KGTK edge file and the filter file is a KGTK node file, the default key field comparison is:
input.node1 == filter.id