Skip to content

filter

Overview

kgtk filter selects edges from an edge file. The current implementation uses a simple pattern language as a filter, and ignores reification.

Filters and Patterns

Filters are composed of three patterns separated by semicolons:

node1-pattern ; label-pattern ; node2-pattern

Pattern Description
node1-pattern This pattern applies to the node1 column (or its alias), unless a different column is selected with the --node1 SUBJ_COL option.
label-pattern This pattern applies to the label column (or its alias), unless a different column is selected with the --label PRED_COL option.
node2-pattern This pattern applies to the node2 column (or its alias), unless a different column is selected with the --node2 OBJ_COL option.

Each of the patterns in a filter can consist of a list of symbols (words) separated using commas, a number (when --numeric is specified), or a regular expression (when --regex is specified).

A complete filter requires two semicolons (;;) with one or more nonempty patterns. By default, all nonempty patterns in a filter must match an input edge for the input edge to match the filter; however, the --or option may be specified to allow an input edge to match when any nonempty pattern matches. The --invert option may be used to invert the sense of the filter, causing matching input edges to be written to the reject file, and non-matching edges to be written to the output file.

Note

If semicolon (;) is part of what you want to match, you may use --pattern-separator SEPARATOR to supply a separator other then semicolon.

Note

If comma (,) is part of what you want to match, you may use --word-separator SEPARATOR to supply a separator other then comma.

Numeric Patterns

--numeric (short for --numeric True or --numeric=True) indicates that the patterns in a filter are numbers instead of comma-separated lists of symbols. Note that comma-separated lists of numbers are not supported at present. When using numeric patterns, --match-type MATCH_TYPE determines the type of numeric comparison that takes place.

Match Type Description
eq The edge field's value must be equal to the pattern value.
ne The edge field's value must not be equal to the pattern value.
gt The edge field's value must be greater than the pattern value.
ge The edge field's value must be greater than or equal to the pattern value.
lt The edge field's value must be less than the pattern value.
le The edge field's value must be less than or equal to the pattern value.

If the edge field has an empty value, the default action is to fail the comparison. However, when --pass-empty-value (short for --pass-empty-value True or --pass-empty-value=True) is specified, empty values in the edge field will pass the comparison.

Note

At the present time, if an edge filed is non-empty but not a valid numeric value, kgtk filter will report an error and exit. In the future, there may be options to control the action taken when a non-numeric edge field value is encountered during processing.

Regular Expression Patterns

--regex (short for --regex True or --regex=True) indicates that the patterns in a filter are regular expressions instead of comma-separated lists. When using regular expressions as patterns, --match-type MATCH_TYPE determines the type of regular expression match that takes place.

Match Type Description
fullmatch The full field must match the regular expression.
It is not necessary to start the regular expression with ^ nor end it with $.
match The regular expression must match the beginning of the field. It is not necessary for it to match the entire field.
It is not necessary to start the regular expression with ^.
This is the default match type.
search The regular expression must match somewhere in the field.

Fancy Patterns

--fancy (short for --fancy True or --fancy=True) indicates that a filter may contain symbols, numbers, or regular expressions in a comma-separated list. Prefix strings are used to determine how a particular filter element is interpreted.

Prefix Interpretation
: string test: does the edge value match the symbol after the prefix?
= numeric test: is the edge value equal to the comparison value after the prefix?
!= numeric test: is the edge value not equal to the comparison value after the prefix?
> numeric test: is the edge value greater than the comparison value after the prefix?
>= numeric test: is the edge value greater than or equal to the comparison value after the prefix?
< numeric test: is the edge value less than the comparison value after the prefix?
<= numeric test: is the edge value less than or equal to the comparison value after the prefix?
~ regular expression test: does the edge value satisfy the pattern after the prefix?

The regular expression test uses the match type determined by --match-type MATCH_TYPE, as described in the Regular Expression patterns section, above.

Note

At the present time, fancy patterns will execute much slower than ordinary patterns due to insufficient optimization.

Multiple Filters

kgtk filter reads a single input file. It will write one or more output files and/or a reject file. When there are multiple output files, each output file must have its own filter. Output files and filters are paired by order. We recommend listing each filter and output file as a pair on the command line, as shown in one of the examples, below. Input edges that do not match any filter may be written to a reject file (--reject-file REJECT_FILE).

When there are multiple output files, --first-match-only determines whether input edges are copied to the first matching output file (when True) or to all matching output files (when False, the default). When True, it can also trigger the use of an optimized code path, which may produce substantial savings when the total number of alternatives is large.

Caveats

Note

At the present time, the --first-match-only, --invert, --match-type, --node2, --or, --label, --numeric, --regex, --node1, --label, and --node2 options apply to all filters and patterns in the kgtk filter invocation. In particular, there is no support for mixing non-regex patterns with regex patterns, other than converting the non-regex pattern to a regex pattern by hand. Similarly, numeric patterns cannot be mixed with non-numeric patterns.

Usage

usage: kgtk filter [-h] [-i INPUT_FILE] [-o OUTPUT_FILE [OUTPUT_FILE ...]]
                   [--reject-file REJECT_FILE] -p PATTERNS [PATTERNS ...]
                   [--node1 SUBJ_COL] [--label PRED_COL] [--node2 OBJ_COL]
                   [--or [True|False]] [--invert [True|False]]
                   [--regex [True|False]] [--numeric [True|False]]
                   [--fancy [True|False]]
                   [--match-type {fullmatch,match,search,eq,ne,gt,ge,lt,le}]
                   [--first-match-only [True|False]]
                   [--pass-empty-value [True|False]]
                   [--pattern-separator PATTERN_SEPARATOR]
                   [--word-separator WORD_SEPARATOR]
                   [--show-version [True/False]] [-v [optional True|False]]

Filter KGTK file based on values in the node1 (subject), label (predicate), and node2 (object) fields.  Optionally filter based on regular expressions.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input-file INPUT_FILE
                        The KGTK input file. (May be omitted or '-' for
                        stdin.)
  -o OUTPUT_FILE [OUTPUT_FILE ...], --output-file OUTPUT_FILE [OUTPUT_FILE ...]
                        The KGTK output file for records that pass the filter.
                        Multiple output file may be specified, each with their
                        own pattern. (May be omitted or '-' for stdout.)
  --reject-file REJECT_FILE
                        The KGTK reject file for records that fail the filter.
                        (Optional, use '-' for stdout.)
  -p PATTERNS [PATTERNS ...], --pattern PATTERNS [PATTERNS ...]
                        Pattern to filter on, for instance, " ; P154 ; ".
                        Multiple patterns may be specified when there are
                        mutiple output files.
  --node1 SUBJ_COL, --subj SUBJ_COL
                        The subject column, default is node1 or its alias.
  --label PRED_COL, --pred PRED_COL
                        The predicate column, default is label or its alias.
  --node2 OBJ_COL, --obj OBJ_COL
                        The object column, default is node2 or its alias.
  --or [True|False]     'Or' the clauses of the pattern. (default=False).
  --invert [True|False]
                        Invert the result of applying the pattern.
                        (default=False).
  --regex [True|False]  When True, treat the filter clauses as regular
                        expressions. (default=False).
  --numeric [True|False]
                        When True, treat the filter clauses as numeric values
                        for comparison. (default=False).
  --fancy [True|False]  When True, treat the filter clauses as strings,
                        numbers, or regular expressions. (default=False).
  --match-type {fullmatch,match,search,eq,ne,gt,ge,lt,le}
                        Which type of regular expression match: fullmatch,
                        match, search, eq, ne, gt, ge, lt, le.
                        (default=match).
  --first-match-only [True|False]
                        If true, write only to the file with the first
                        matching pattern. If false, write to all files with
                        matching patterns. (default=False).
  --pass-empty-value [True|False]
                        If true, empty data values will pass a numeric
                        pattern. If false, write to all files with matching
                        patterns. (default=False).
  --pattern-separator PATTERN_SEPARATOR
                        The separator between the pattern components.
                        (default=;.
  --word-separator WORD_SEPARATOR
                        The separator between the words in a pattern
                        component. (default=,.
  --show-version [True/False]
                        Print the version of this program. (default=False).

  -v [optional True|False], --verbose [optional True|False]
                        Print additional progress messages (default=False).

Examples

Sample Data

Let us assume we have a KGTK file with movie data, such as the following (also available for download here):

kgtk cat -i examples/docs/movies_reduced.tsv
id node1 label node2
t1 terminator label 'The Terminator'@en
t2 terminator instance_of film
t3 terminator genre action
t4 terminator genre science_fiction
t5 terminator publication_date ^1984-10-26T00:00:00Z/11
t6 t5 location united_states
t7 terminator publication_date ^1985-02-08T00:00:00Z/11
t8 t7 location sweden
t9 terminator director james_cameron
t10 terminator cast arnold_schwarzenegger
t11 t10 role terminator
t12 terminator cast michael_biehn
t13 t12 role kyle_reese
t14 terminator cast linda_hamilton
t15 t14 role sarah_connor
t16 terminator duration 108
t17 terminator award national_film_registry
t18 t17 point_in_time ^2008-01-01T00:00:00Z/9

Let us use this file (or a close derivative) in the following examples.

Selecting Edges with a Matching node1 (Subject)

Select all edges that have the subject terminator (in the node1 column or its alias):

kgtk filter -i examples/docs/movies_reduced.tsv \
            -p " terminator; ; "
Result:

id node1 label node2
t1 terminator label 'The Terminator'@en
t2 terminator instance_of film
t3 terminator genre action
t4 terminator genre science_fiction
t5 terminator publication_date ^1984-10-26T00:00:00Z/11
t7 terminator publication_date ^1985-02-08T00:00:00Z/11
t9 terminator director james_cameron
t10 terminator cast arnold_schwarzenegger
t12 terminator cast michael_biehn
t14 terminator cast linda_hamilton
t16 terminator duration 108
t17 terminator award national_film_registry

Selecting Edges without a Matching node1 (Subject)

Select all edges that do not have the subject terminator (in the node1 column or its alias):

kgtk filter -i examples/docs/movies_reduced.tsv \
            --invert -p " terminator; ; "
Result:

id node1 label node2
t6 t5 location united_states
t8 t7 location sweden
t11 t10 role terminator
t13 t12 role kyle_reese
t15 t14 role sarah_connor
t18 t17 point_in_time ^2008-01-01T00:00:00Z/9

Selecting Edges with Matching label (Predicate)

Select all edges that have property genre (in the label column or its alias):

kgtk filter -i examples/docs/movies_reduced.tsv \
            -p " ; genre ; "

Info

examples/docs/movies_reduced.tsv should be replaced by the path to your .tsv file.

Result:

id node1 label node2
t3 terminator genre action
t4 terminator genre science_fiction

Selecting Edges by Matching an Alternate Predicate Column

By default, KGTK will assume there is a label column for the predicate pattern However, you can specify any other column to filter. For example, if we had a column called genre in the input file:

kgtk filter -i examples/docs/movies_reduced_with_genre_column.tsv \
            --label genre  -p " ;action ; "

Results:

id node1 label node2 genre
t1 terminator label 'The Terminator'@en action

Selecting Edges with Multiple Possible Predicate Matches

Select all edges that have properties genre or cast:

kgtk filter -i examples/docs/movies_reduced.tsv \
            -p " ; genre, cast ; "

Result:

id node1 label node2
t3 terminator genre action
t4 terminator genre science_fiction
t10 terminator cast arnold_schwarzenegger
t12 terminator cast michael_biehn
t14 terminator cast linda_hamilton

Selecting Edges with Multiple Possible Predicate Matches and Custom Separators

Select all edges that have properties genre or cast, using : to separate the component patterns in the filter and using '|' to separate the alternative words:

kgtk filter -i examples/docs/movies_reduced.tsv \
            --pattern-separator : \
        --word-separator '|' \
            -p " : genre|cast : "

Result:

id node1 label node2
t3 terminator genre action
t4 terminator genre science_fiction
t10 terminator cast arnold_schwarzenegger
t12 terminator cast michael_biehn
t14 terminator cast linda_hamilton

Selecting Edges with a Matching node2 (Object)

Select all edges that have arnold_schwarzenegger as the object (in the node2 column or its alias):

kgtk filter -i examples/docs/movies_reduced.tsv \
            -p " ; ; arnold_schwarzenegger"

Result:

id node1 label node2
t10 terminator cast arnold_schwarzenegger

Selecting Edges with Both a label and node2 Match

Select all edges that have predicate values role or cast (in the label column or its alias) and object terminator (in the node2 column or its alias):

kgtk filter -i examples/docs/movies_reduced.tsv \
            -p " ; role, cast ; terminator "

Result:

id node1 label node2
t11 t10 role terminator

Selecting Edges with a label or node2 Match

Select all edges that have predicate values role or cast (in the label column or its alias), or object sweden (in the node2 column or its alias):

kgtk filter -i examples/docs/movies_reduced.tsv \
            --or -p " ; role, cast ; sweden "

Result:

id node1 label node2
t8 t7 location sweden
t10 terminator cast arnold_schwarzenegger
t11 t10 role terminator
t12 terminator cast michael_biehn
t13 t12 role kyle_reese
t14 terminator cast linda_hamilton
t15 t14 role sarah_connor

Sending Different Edges to Different Files

Send edges with property cast to one file, edges with property genre to another file, and the remaining edges to a third file:

kgtk filter -i examples/docs/movies_reduced.tsv \
            -p "; cast ;" -o cast.tsv \
            -p "; genre ;" -o genre.tsv \
            --reject-file others.tsv
(No standard output)

Result:

kgtk cat -i cast.tsv
id node1 label node2
t10 terminator cast arnold_schwarzenegger
t12 terminator cast michael_biehn
t14 terminator cast linda_hamilton
kgtk cat -i genre.tsv
id node1 label node2
t3 terminator genre action
t4 terminator genre science_fiction
kgtk cat -i others.tsv
id node1 label node2
t1 terminator label 'The Terminator'@en
t2 terminator instance_of film
t5 terminator publication_date ^1984-10-26T00:00:00Z/11
t6 t5 location united_states
t7 terminator publication_date ^1985-02-08T00:00:00Z/11
t8 t7 location sweden
t9 terminator director james_cameron
t11 t10 role terminator
t13 t12 role kyle_reese
t15 t14 role sarah_connor
t16 terminator duration 108
t17 terminator award national_film_registry
t18 t17 point_in_time ^2008-01-01T00:00:00Z/9

Sending Different Edges to Different Files Without First Match

Send edges with label property genre to one file, edges with node2 object action to another file, and ignore other edges.

kgtk filter -i examples/docs/movies_reduced.tsv \
            -p "; genre ;" -o genre.tsv \
            -p "; ; action" -o action.tsv
(No standard output)

Result:

kgtk cat -i genre.tsv
id node1 label node2
t3 terminator genre action
t4 terminator genre science_fiction
kgtk cat -i action.tsv
id node1 label node2
t3 terminator genre action

Note

The edge terminator/genre/action appears in both the genre and action output files.

Sending Different Edges to Different Files with First Match

Send edges with property genre to one file, edges with object action to another file, ignoring other edges. Specify --first-match-only to ensure that a given edge will be sent to at most one output file.

kgtk filter -i examples/docs/movies_reduced.tsv \
            --first-match-only \
            -p "; genre ;" -o genre.tsv \
            -p "; ; action" -o action.tsv
(No standard output)

Result:

kgtk cat -i genre.tsv
id node1 label node2
t3 terminator genre action
t4 terminator genre science_fiction
kgtk cat -i action.tsv
id node1 label node2

Note

The edge terminator/genre/action appears in only the genre output file.

Sending Different Edges to Diferent Files with Unselected Edges to Standard Output

Send edges with property genre to one file, edges with object action to another file, and pass the remaining edges to standard output. Specify --first-match-only to ensure that a given edge will be sent to at most one output file.

kgtk filter -i examples/docs/movies_reduced.tsv \
            --first-match-only \
            -p "; genre ;" -o genre.tsv \
            -p "; ; action" -o action.tsv \
            --reject-file -

Result:

id node1 label node2
t1 terminator label 'The Terminator'@en
t2 terminator instance_of film
t5 terminator publication_date ^1984-10-26T00:00:00Z/11
t6 t5 location united_states
t7 terminator publication_date ^1985-02-08T00:00:00Z/11
t8 t7 location sweden
t9 terminator director james_cameron
t10 terminator cast arnold_schwarzenegger
t11 t10 role terminator
t12 terminator cast michael_biehn
t13 t12 role kyle_reese
t14 terminator cast linda_hamilton
t15 t14 role sarah_connor
t16 terminator duration 108
t17 terminator award national_film_registry
t18 t17 point_in_time ^2008-01-01T00:00:00Z/9
kgtk cat -i genre.tsv
id node1 label node2
t3 terminator genre action
t4 terminator genre science_fiction
kgtk cat -i action.tsv
id node1 label node2

Selecting Edges with Numeric Comparisons

Consider the following input file:

kgtk cat -i examples/docs/movies_durations.tsv
id node1 label node2
t16 terminator duration 108
s18 terminator2_jd duration 137
x1 terminator_dark_fate_trailer duration 3

Select movies that are 108 minutes long:

kgtk filter -i examples/docs/movies_durations.tsv \
            --numeric --match-type eq \
            -p ";;108"
Result:

id node1 label node2
t16 terminator duration 108

Select movies that are not 108 minutes long:

kgtk filter -i examples/docs/movies_durations.tsv \
            --numeric --match-type ne \
            -p ";;108"
Result:

id node1 label node2
s18 terminator2_jd duration 137
x1 terminator_dark_fate_trailer duration 3

Select movies that are greater than 108 minutes long:

kgtk filter -i examples/docs/movies_durations.tsv \
            --numeric --match-type gt \
            -p ";;108"
Result:

id node1 label node2
s18 terminator2_jd duration 137

Select movies that are greater than or equal to 108 minutes long:

kgtk filter -i examples/docs/movies_durations.tsv \
            --numeric --match-type ge \
            -p ";;108"
Result:

id node1 label node2
t16 terminator duration 108
s18 terminator2_jd duration 137

Select movies that are less than 108 minutes long:

kgtk filter -i examples/docs/movies_durations.tsv \
            --numeric --match-type lt \
            -p ";;108"
Result:

id node1 label node2
x1 terminator_dark_fate_trailer duration 3

Select movies that are less than or equal to 108 minutes long:

kgtk filter -i examples/docs/movies_durations.tsv \
            --numeric --match-type le \
            -p ";;108"
Result:

id node1 label node2
t16 terminator duration 108
x1 terminator_dark_fate_trailer duration 3

Selecting Edges where the Subject Starts with t1

Select all edges with a subject value that starts with the letters t1 (with unnecessary spaces trimmed out of the filter):

kgtk filter -i examples/docs/movies_reduced.tsv \
            --regex --match-type match \
            -p "t1;;"
Result:

id node1 label node2
t11 t10 role terminator
t13 t12 role kyle_reese
t15 t14 role sarah_connor
t18 t17 point_in_time ^2008-01-01T00:00:00Z/9

Selecting Edges where the Object Starts with a Digit

Select all edges with an object value that starts with a Digit:

kgtk filter -i examples/docs/movies_reduced.tsv \
            --regex --match-type fullmatch \
            -p ';;[0-9].+'

Result:

id node1 label node2
t16 terminator duration 108

Fancy Patterns: Selecting Edges with a String and a Numeric comparison

Identify all movies with a duration of at least 90 minutes:

kgtk filter -i examples/docs/movies_reduced.tsv \
            --fancy \
            -p ';:duration;>=90'

Result:

id node1 label node2
t16 terminator duration 108

Identify all movies with a duration greater than two hours (120 minutes):

kgtk filter -i examples/docs/movies_reduced.tsv \
            --fancy \
            -p ';:duration;>120'

Result:

id node1 label node2