unique

Overview¶

The kgtk unique command reads a KGTK file, constructing a second KGTK file containing the unique values found in one or more of the columns of the input file. Each unique value may be accompanied by an occurence count, depending on the format selected for the output file.

In the default output format, the output file is a KGTK edge file. The node1 column contains the unique values, the label column value is count, and the node2 column contains the unique count.

Naming the Column(s) to Count¶

The --columns COLUMN_NAME ... option (which may be abbreviated --column COLUMN_NAME) specifies the name(s) of the column(s) to count unique values. If not specified, the default is the node2 column or its alias.

Processing Empty Values¶

Since KGTK edge files cannot have an empty node1 column, the --empty EMPTY_VALUE option provides a substitute value (e.g. NONE) that will be used in the ouput KGTK file to represent empty values in the input KGTK file. When the empty value is itself empty, (the default) empty values in the input file will not be included in the output file.

What to Call the Count in the Output¶

The value used in the label column of the output file, normally count, may be changed with the --label LABEL_VALUE option.

The --prefix PREFIX option supplies a prefix to the unique values in the output file.

Output Format Options¶

The --format xxx option selects an output format:

Format	Description
`--format edge`	This format creates a KGTK edge file. The `node1` column contains the unique values, the `label` column value is `count` (which may be changed with `--label LABEL_VALUE`), and the `node2` column contains the unique count. This is the default output format.
`--format node`	This format creates a KGTK node file. The value (prefixed if requested) appears in the `id` column of the output file, and new columns (prefixed) are created for each unique value found in the specified column in the input file.
`--format node-counts`	This format creates a KGTK node file with two columns. The `id` column will contain the (optionally prefixed) unique values, while the second column, named `count`, unless changed by `--label LABEL_VALUE`, will contain the count.
`--format node-only`	This creates a KGTK node file with a single column, the `id` column, containing the unique values. The counts are computed but not written.

Quick Input Edge Filtering¶

Using the --where WHERE_COLUMN_NAME and --in WHERE_VALUES... options, you can restrict the count to edges where the value in a specified column matches a list of specified values. More sophisticated filtering can be obtained by running kgtk filter to provide the input to kgtk unique.

Input Value Filtering¶

Using the `--value-filter VALUE_FILTER_RE" and "--value-match-type" options, you can restrict the set of values to ones that match a particular regular expression.

Processing Large Files¶

kgtk unique normally builds an in-memory dictionary of the unique values and counts. Performance will be poor, and execution may fail, if there are a very large number of unique values, causing main memory to be exhausted. If you run out of main memory, you should presort the input file with kgtk sort and use kgtk unique --presorted to avoid building the in-memory dictionary.

Note

This optimization is available when only a single input column is being processed for unique values.

Usage¶

usage: kgtk unique [-h] [-i INPUT_FILE] [-o OUTPUT_FILE]
                   [-c [COLUMN_NAME ...]] [--empty EMPTY_VALUE]
                   [--label LABEL_VALUE]
                   [--format {edge,node,node-counts,node-only}]
                   [--prefix PREFIX] [--where WHERE_COLUMN_NAME]
                   [--in WHERE_VALUES [WHERE_VALUES ...]]
                   [--value-filter VALUE_FILTER_RE]
                   [--value-match-type {fullmatch,match,search}]
                   [--presorted [True|False]] [--min-count MIN_COUNT]
                   [--max-count MAX_COUNT] [-v [optional True|False]]

Count the unique values in a column in a KGTK file. Write the unique values and counts as a new KGTK file.

Additional options are shown in expert help.
kgtk --expert unique --help

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input-file INPUT_FILE
                        The KGTK input file. (May be omitted or '-' for
                        stdin.)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        The KGTK output file. (May be omitted or '-' for
                        stdout.)
  -c [COLUMN_NAME ...], --column [COLUMN_NAME ...], --columns [COLUMN_NAME ...]
                        The column(s) to count unique values (default=node2 or
                        its alias).
  --empty EMPTY_VALUE   A value to substitute for empty values (default=).
  --label LABEL_VALUE   The output file label column value (default=count).
  --format {edge,node,node-counts,node-only}
                        The output file format and mode (default=edge).
  --prefix PREFIX       The value prefix (default=).
  --where WHERE_COLUMN_NAME
                        The name of a column for a record selection test.
                        (default=None).
  --in WHERE_VALUES [WHERE_VALUES ...]
                        The list of values for a record selection test.
                        (default=None).
  --value-filter VALUE_FILTER_RE
                        A regular expression filter on the extracted values.
                        (default=).
  --value-match-type {fullmatch,match,search}
                        Which type of regular expression value match:
                        fullmatch, match, search. (default=match).
  --presorted [True|False]
                        When True, the input file is presorted.
                        (default=False).
  --min-count MIN_COUNT
                        The minimum count output filter (does not apply to
                        'node' format). The minimum and maximum count filters
                        are conjunctive. (default=0).
  --max-count MAX_COUNT
                        The minimum count output filter (does not apply to
                        'node' format). The minimum and maximum count filters
                        are conjunctive. (default=999999999999999).

  -v [optional True|False], --verbose [optional True|False]
                        Print additional progress messages (default=False).

Examples¶

Suppose that file1.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/unique-file1.tsv

node1	label	node2	location	years
eric	zipcode	12040	work	5
john	zipcode	12345	home	10
john	zipcode	12346
john	zipcode	12347
peter	zipcode	12040	home
peter	zipcode	12040	work	6
steve	zipcode	45601		3
steve	zipcode	45601	work

Count the unique values in the `location` column.¶

kgtk unique -i examples/docs/unique-file1.tsv --column location

node1	label	node2
home	count	2
work	count	3

Count the unique values in the `location` column, using the value `NONE` for empty values.¶

kgtk unique -i examples/docs/unique-file1.tsv --column location \
            --empty NONE

node1	label	node2
NONE	count	3
home	count	2
work	count	3

Count the unique values in the `location` column, using the value `NONE` for empty values, but use the `node` format for the output file.¶

kgtk unique -i examples/docs/unique-file1.tsv --column location \
            --empty NONE --format node

id	NONE	home	work
location	3	2	3

Give each column name a prefix.¶

kgtk unique -i examples/docs/unique-file1.tsv --column location \
            --empty NONE --format node --prefix 'location;'

id	location;NONE	location;home	location;work
location	3	2	3

Filter the input file and create an edge-type output file.¶

kgtk unique -i examples/docs/unique-file1.tsv --column location \
            --where node1 --in peter

node1	label	node2
home	count	1
work	count	1

Count the unique values in the `node1`, `label`, `node2`, and `location` columns.¶

kgtk unique -i examples/docs/unique-file1.tsv \
            --columns node1 label node2 location

node1	label	node2
12040	count	3
12345	count	1
12346	count	1
12347	count	1
45601	count	2
eric	count	1
home	count	2
john	count	3
peter	count	2
steve	count	2
work	count	3
zipcode	count	8

Count the unique all-alpha values in the `node1`, `label`, `node2`, and `location` columns.¶

kgtk unique -i examples/docs/unique-file1.tsv \
            --columns node1 label node2 location \
            --value-filter '[A-Za-z]+'

node1	label	node2
eric	count	1
home	count	2
john	count	3
peter	count	2
steve	count	2
work	count	3
zipcode	count	8

Filter using the Minimum Count¶

Without filtering:

kgtk unique -i examples/docs/unique-file1.tsv --column node1

node1	label	node2
eric	count	1
john	count	3
peter	count	2
steve	count	2

Requiring a minimum count of 2:

kgtk unique -i examples/docs/unique-file1.tsv --column node1 --min-count 2

node1	label	node2
john	count	3
peter	count	2
steve	count	2

Filter using the Maximum Count¶

Without filtering:

kgtk unique -i examples/docs/unique-file1.tsv --column node1

node1	label	node2
eric	count	1
john	count	3
peter	count	2
steve	count	2

Requiring a maximum count of 1:

kgtk unique -i examples/docs/unique-file1.tsv --column node1 --max-count 1

node1	label	node2
eric	count	1

Filter using the Minimum and Maximum Counts¶

Without filtering:

kgtk unique -i examples/docs/unique-file1.tsv --column node1

node1	label	node2
eric	count	1
john	count	3
peter	count	2
steve	count	2

Filter with the 'and' of the minimum and maximum counts:

kgtk unique -i examples/docs/unique-file1.tsv --column node1 \
            --min-count 2 --max-count 2

node1	label	node2
peter	count	2
steve	count	2

unique

Overview¶

Naming the Column(s) to Count¶

Processing Empty Values¶

What to Call the Count in the Output¶

Output Format Options¶

Quick Input Edge Filtering¶

Input Value Filtering¶

Processing Large Files¶

Usage¶

Examples¶

Count the unique values in the location column.¶

Count the unique values in the location column, using the value NONE for empty values.¶

Count the unique values in the location column, using the value NONE for empty values, but use the node format for the output file.¶

Give each column name a prefix.¶

Filter the input file and create an edge-type output file.¶

Count the unique values in the node1, label, node2, and location columns.¶

Count the unique all-alpha values in the node1, label, node2, and location columns.¶

Filter using the Minimum Count¶

Filter using the Maximum Count¶

Filter using the Minimum and Maximum Counts¶

Count the unique values in the `location` column.¶

Count the unique values in the `location` column, using the value `NONE` for empty values.¶

Count the unique values in the `location` column, using the value `NONE` for empty values, but use the `node` format for the output file.¶

Count the unique values in the `node1`, `label`, `node2`, and `location` columns.¶

Count the unique all-alpha values in the `node1`, `label`, `node2`, and `location` columns.¶