Skip to content

ifexists

Overview

The ifexists command filters a KGTK file (the input file specified by --input-file, which defaults to standard input), passing through only those rows for which one or more specified columns match records in a second KGTK file (the filter file, specified by --filter-on).

Note

The kgtk ifnotexists command computes the inverse output of this command.

Memory Usage Options

This implementation of ifexists is written in Python. By default, it builds an in-memory dictionary of the key values it finds in the --filter-on file before processing the --input-file in a single pass. Performance will be poor, and execution may fail, if the --filter-on file is too large for the key dictionary to fit into main memory.

If the input file is small, the --cache-input option can be used to tell the code to cache the --input-file instead of the --filter-on file. After cacheing the --input-file, the code will make a single pass through the --filter-on file.

If both the --input-file and the --filter-on file are too large to hold in memory, then you should presort the input and filter files on their key columns using kgtk sort, followed by using kgtk filter --presorted to avoid caching either file.

Output Record Order

Normally, input records are passed in order to the output file. However, when the input file is cached (--cache-input), the default is for the output records to ordered by key value (alpha sort), then by input order. If you wish the output file to retain the input file's order when cacheing the input file, use the `--preserve-order option.

Key Fields

The names of the fields used match records may be supplied by the user using the --input-keys and --filter-keys option. Each option may take a variable number of space-separated field names. If keys are not supplied, the following defaults will be used, which depend on the KGTK file type (edge or node) of the input and filter files.

Input File Type Filter File Type Key fields
edge edge input.node1 == filter.node1 and
input.label == filter.label and
input.node2 == filter.node2
node node input.id == filter.id
edge node input.node1 == filter.id
node edge input.id == filter.node1

Note

The number of input file keys must match the number of output file keys, after taking into consideration the default keys. So, if you want to match an edge file's node1 value to a nonstandard column in a node file, only the --filter-keys option needs to be specified.

Optional Output Files

The --reject-file, when specified, will receive any input records that failed the filter test and were not written to the output file.

The --matched-filter-file, when specified, will receive a copy of any filter records that found a match in the input file.

The --unmatched-filter-file, when specified, will receive a copy of any filter records that did not find a match in the input file.

Experimental Join Facility

The kgtk ifexists command contains experimental support for performing a join. The join output file (which may be the primary output file) will contain the union of the columns found in the --input-file and the --filter-on file, and may contain records from both file. At the present time, please refer to kgtk --expert ifexists --help and the KGTK source code files for more details on this facility.

Usage

usage: kgtk ifexists [-h] [-i INPUT_FILE] [--filter-on FILTER_FILE]
                     [-o OUTPUT_FILE] [--reject-file REJECT_FILE]
                     [--matched-filter-file MATCHED_FILTER_FILE]
                     [--unmatched-filter-file UNMATCHED_FILTER_FILE]
                     [--input-keys [INPUT_KEYS ...]]
                     [--filter-keys [FILTER_KEYS ...]]
                     [--cache-input [True|False]]
                     [--preserve-order [True|False]]
                     [--presorted [True|False]] [-v [optional True|False]]

Filter a KGTK file based on whether one or more records exist in a second KGTK file with matching values for one or more fields.

Additional options are shown in expert help.
kgtk --expert ifexists --help

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input-file INPUT_FILE
                        The KGTK input file. (May be omitted or '-' for
                        stdin.)
  --filter-on FILTER_FILE, --filter-file FILTER_FILE
                        The KGTK file to filter against. (May be omitted or
                        '-' for stdin.)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        The KGTK output file. (May be omitted or '-' for
                        stdout.)
  --reject-file REJECT_FILE
                        The KGTK file for input records that fail the filter.
                        (Optional, use '-' for stdout.)
  --matched-filter-file MATCHED_FILTER_FILE
                        The KGTK file for filter records that matched at least
                        one input record. (Optional, use '-' for stdout.)
  --unmatched-filter-file UNMATCHED_FILTER_FILE
                        The KGTK file for filter records that did not match
                        any input records. (Optional, use '-' for stdout.)
  --input-keys [INPUT_KEYS ...], --left-keys [INPUT_KEYS ...]
                        The key columns in the file being filtered
                        (default=None).
  --filter-keys [FILTER_KEYS ...], --right-keys [FILTER_KEYS ...]
                        The key columns in the filter-on file (default=None).
  --cache-input [True|False]
                        Cache the input file instead of the filter keys
                        (default=False).
  --preserve-order [True|False]
                        Preserve record order when cacheing the input file.
                        (default=False).
  --presorted [True|False]
                        When True, assume that the input and filter files are
                        both presorted. Use a merge-style algorithm that does
                        not require caching either file. (default=False).

  -v [optional True|False], --verbose [optional True|False]
                        Print additional progress messages (default=False).

Examples

Sample Data

Suppose that ifexists-file1.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/ifexists-file1.tsv
id node1 label node2 location years
p1 peter title manager
p2 peter zipcode 12040 home
p3 peter zipcode 12040 work 6
s1 steve title supervisor
s2 steve zipcode 45601 3
s3 steve zipcode 45601 work
j1 john title programmer
j2 john zipcode 12345 home 10
j2 john zipcode 12346
k1 kathy title owner
k2 kathy zipcode 12040 home
k3 kathy zipcode 12040 work 6

Note

This is a KGTK edge file.

Suppose that ifexists-file2.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/ifexists-file2.tsv
node1 label node2
peter zipcode 12040

Note

This is a KGTK edge file.

Suppose that ifexists-file3.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/ifexists-file3.tsv
id
steve
john

Note

This is a KGTK node file.

Suppose that ifexists-file4.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/ifexists-file4.tsv
id
peter
john

Note

This is a KGTK node file.

Suppose that ifexists-file5.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/ifexists-file5.tsv
id
home

Note

This is a KGTK node file.

Suppose that ifexists-file6.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/ifexists-file6.tsv --mode NONE
label node2
zipcode 12040
zipcode 45601
zipcode 45601
zipcode 52040
zipcode 62040
zipcode 72040

Note

This is not a valid KGTK file, as it does not meet the mandatory column requirements for an edge file nor a node file.

Suppose that ifexists-file7.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/ifexists-file7.tsv
id
j1
s1

Note

This is a KGTK node file.

Filter an Edge File on Another Edge File.

kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
              --filter-on examples/docs/ifexists-file2.tsv
id node1 label node2 location years
p2 peter zipcode 12040 home
p3 peter zipcode 12040 work 6

Note

Since both the input file and the filter file are KGTK edge files, the default key field comparisons are:

input.node1 == filter.node1 and input.label == filter.label and input.node2 == filter.node2

The id fields are not part of this comparison (and the id field isn't present in examples/docs/ifexists-file2.tsv).

Filter an Edge File on a Node File

kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
              --filter-on examples/docs/ifexists-file3.tsv
id node1 label node2 location years
s1 steve title supervisor
s2 steve zipcode 45601 3
s3 steve zipcode 45601 work
j1 john title programmer
j2 john zipcode 12345 home 10
j2 john zipcode 12346

Note

Since the input file is a KGTK edge file and the filter file is a KGTK node file, the default key field comparison is:

input.node1 == filter.id

Filter a Node File on a Node File

kgtk ifexists --input-file examples/docs/ifexists-file4.tsv \
              --filter-on examples/docs/ifexists-file3.tsv
id
john

Note

Since the input file and the filter files are both KGTK node files, the default key field comparison is:

input.id == filter.id

Filter an Edge File on a Node File Using an Alternate Input Column

kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
              --filter-on examples/docs/ifexists-file5.tsv \
              --input-keys location
id node1 label node2 location years
p2 peter zipcode 12040 home
j2 john zipcode 12345 home 10
k2 kathy zipcode 12040 home

Note

This used the key field comparison:

input.location == filter.id

Filter an Edge File on a Nonstandard File

We want to filter a KGTK edge file agains a file that not a valid KGTK file (it is almost a KGTK edge file, but it is missing the node1 column). We can use the expert option --filter-mode NONE to disable the mandatory column check on the filter file, then use --input-keys and --filter-keys to specify the columns that we want to compare.

kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
              --filter-on examples/docs/ifexists-file6.tsv \
              --filter-mode NONE \
              --input-keys label node2 \
              --filter-keys label node2
id node1 label node2 location years
p2 peter zipcode 12040 home
p3 peter zipcode 12040 work 6
s2 steve zipcode 45601 3
s3 steve zipcode 45601 work
k2 kathy zipcode 12040 home
k3 kathy zipcode 12040 work 6

Note

This used the key field comparison:

input.label == filter.label and input.node2 == filter.node2

Filter an Edge File: Filter Matches

We want to filter a KGTK edge file agains a file that not a valid KGTK file (it is almost a KGTK edge file, but it is missing the node1 column). We can use the expert option --filter-mode NONE to disable the mandatory column check on the filter file, then use --input-keys and --filter-keys to specify the columns that we want to compare.

Furthermore, we want to see which filter records found at least one matching input record, and which filter records did not find a match.

kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
              --filter-on examples/docs/ifexists-file6.tsv \
              --filter-mode NONE \
              --input-keys label node2 \
              --filter-keys label node2 \
              --matched-filter-file ifexists-matched-filter.tsv \
              --unmatched-filter-file ifexists-unmatched-filter.tsv
id node1 label node2 location years
p2 peter zipcode 12040 home
p3 peter zipcode 12040 work 6
s2 steve zipcode 45601 3
s3 steve zipcode 45601 work
k2 kathy zipcode 12040 home
k3 kathy zipcode 12040 work 6
kgtk cat -i ifexists-matched-filter.tsv --mode NONE
label node2
zipcode 12040
zipcode 45601
zipcode 45601
kgtk cat -i ifexists-unmatched-filter.tsv --mode NONE
label node2
zipcode 52040
zipcode 62040
zipcode 72040

Note

Since the filter file was missing a mandatory KGTK column (node1), the matched and unmatched filter output files are also missing that column. Thus, the kgtk cat commands that disply them also need --mode NONE.

Filter an Edge File By id

This is another example of filtering an input file using an alternate input file key column. examples/docs/ifexists-file7.tsv contains a list of edge ids that we want to retain in the output file.

kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
              --filter-on examples/docs/ifexists-file7.tsv \
              --input-keys id
id node1 label node2 location years
s1 steve title supervisor
j1 john title programmer

Note

This used the key field comparison:

input.id == filter.id

Filter an Edge File By id: Reject File

This is another example of filtering an input file using an alternate input file key column. examples/docs/ifexists-file7.tsv contains a list of edge ids that we want to retain in the output file. However, we also want to obtain the records that were rejected to check that the records that were rejected match our expectations, or perhaps to apply different processing to them.

kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
              --filter-on examples/docs/ifexists-file7.tsv \
              --reject-file ifexists-rejects.tsv \
              --input-keys id
id node1 label node2 location years
s1 steve title supervisor
j1 john title programmer
kgtk cat -i ifexists-rejects.tsv
id node1 label node2 location years
p1 peter title manager
p2 peter zipcode 12040 home
p3 peter zipcode 12040 work 6
s2 steve zipcode 45601 3
s3 steve zipcode 45601 work
j2 john zipcode 12345 home 10
j2 john zipcode 12346
k1 kathy title owner
k2 kathy zipcode 12040 home
k3 kathy zipcode 12040 work 6

Note

If the intent of the filter was to separate all title records by edge ID, then the reject file shows that id p1 was omitted from the filter file.

Filter a Small Input File on a Large Filter File

Although the example data files are very small, this example command shows how to filter a small input file against a large filter file:

kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
              --filter-on examples/docs/ifexists-file3.tsv \
              --cache-input
id node1 label node2 location years
j1 john title programmer
j2 john zipcode 12345 home 10
j2 john zipcode 12346
s1 steve title supervisor
s2 steve zipcode 45601 3
s3 steve zipcode 45601 work

Note

Since the input file is a KGTK edge file and the filter file is a KGTK node file, the default key field comparison is:

input.node1 == filter.id

Because we are cacheing the input file, the output edges have been reordered by the input key, then by order.

Filter a Small Input File on a Large Filter File, Preserving Order

Although the example data files are very small, this example command shows how to filter a small input file against a large filter file, preserving the input file's order:

kgtk ifexists --input-file examples/docs/ifexists-file1.tsv \
              --filter-on examples/docs/ifexists-file3.tsv \
              --cache-input --preserve-order
id node1 label node2 location years
s1 steve title supervisor
s2 steve zipcode 45601 3
s3 steve zipcode 45601 work
j1 john title programmer
j2 john zipcode 12345 home 10
j2 john zipcode 12346

Note

Since the input file is a KGTK edge file and the filter file is a KGTK node file, the default key field comparison is:

input.node1 == filter.id

The output edges appear in the same order as the input edges.

Filter a Large Input File on a Large Filter File

Although the example data files are very small, this example command shows how to filter a large input file against a large filter file by sorting the two files.

We will explicitly tell the kgtk sort command which columns to sort on.

kgtk sort --input-file examples/docs/ifexists-file1.tsv \
          --output-file ifexists-file1-sorted-by-node1.tsv \
          --column node1
kgtk sort --input-file examples/docs/ifexists-file3.tsv \
          --output-file ifexists-file3-sorted-by-id.tsv \
          --column id
kgtk ifexists --input-file ifexists-file1-sorted-by-node1.tsv \
              --filter-on ifexists-file3-sorted-by-id.tsv \
              --presorted
id node1 label node2 location years
j1 john title programmer
j2 john zipcode 12345 home 10
j2 john zipcode 12346
s1 steve title supervisor
s2 steve zipcode 45601 3
s3 steve zipcode 45601 work

Note

Since the input file is a KGTK edge file and the filter file is a KGTK node file, the default key field comparison is:

input.node1 == filter.id