Skip to content

compact

Overview

kgtk compact

The compact command copies its input file to its output file, compacting repeated items into multi-valued edges (| lists). Compact is intended to operate on KGTK node files or on the additional columns of KGTK denormalized edge files. It should not be used to compact the node2 column of a KGTK edge file.

kgtk deduplicate

kgtk deduplicate is an alias for kgtk compact --deduplicate. In this mode, duplicate edges are removed without compacting any columns into multi-valued edges (| lists).

All columns will be selected as key columns, except for columns that are included in the --keep-first first. if --compact-id is specified, than the ID column (or its alias) will be included in the --keep-first list. Columns included in --columns will be first, in the order specified, then the remaining columns (except for columns that are included in the --keep-first list) in the order that they appear in the file's header record.

Creating Multi-value Edges

Suppose you have a KGTK edge file such as:

node1 label node2 genre
terminator2_jd isa movie science_fiction
terminator2_jd isa movie action

The compacted result would be:

node1 label node2 genre
terminator2_jd isa movie action|science_fiction

Note

The key columns (see below) in this example are (node1, label, node2).

Key Columns

Compaction occurs by grouping records on a set of key columns, then compacting the records into a single output record.

When --deduplicate=TRUE, all columns will be used as key columns, other than --keep-first columns.

For KGTK node files, the default key is (id). The --columns KEY_COLUMN_NAMES ... option may be used to add additional columns to this list.

For KGTK edge files without an id column, the default key is (node1, label, node2). The --columns KEY_COLUMN_NAMES ... option may be used to add additional columns to this list.

For KGTK edge files with an id column, the default key is (node1, label, node2, id). The --columns KEY_COLUMN_NAMES ... option may be used to add additional columns to this list. The --compact-id option may be used to remove the id column from this list.

When --mode=NONE is specified, there is no default key. The --columns KEY_COLUMN_NAMES ... option MUST be used to add additional columns to this list.

Note

The key column order with an id column is not the same as is used in some other KGTK commands. It may change in the future.

id Generation

kgtk compact may be used to generate id column values. The expert option --id-style may be used to select the style of the id. See the kgtk add-id command for adidtional details on --id-style and related options.

Processing Large Files

By default, the input file is sorted in memory to achieve the grouping necessary for the compaction algorithm. This may cause memory usage issues for large input files. This may be solved by sorting the input file using kgtk sort, then using kgtk compact --presorted.

Compacting node2 Is Discouraged

If you have a KGTK edge file with normalized edges (no additional columns), you might want to compact the node2 column using (node1, label) as the key.

For example, using movies as the topic:

node1 label node2
terminator2_jd genre science_fiction
terminator2_jd genre action

You intend to create:

node1 label node2
terminator2_jd genre action|science_fiction

This would result in an invalid KGTK file, as the node2 column is not allowed to contain multi-value edges (| lists) according to the KGTK File Specification.

Note

If you insist on compacting the node2 column, you can do so using:

kgtk compact --mode=NONE --columns node1 label

Reporting or Filtering Output Rows with Lists

kgtk compact --report-lists causes output rows containing one or more lists to be reported to the error file.

kgtk compact --exclude-lists causes output rows containing one or more lists to be excluded from the output file.

kgtk compact --output-only-lists will write only output rows containing one or more lists to the output file.

Note

--exclude-lists and --output-only-lists may not be used together.

Usage

usage: kgtk compact [-h] [-i INPUT_FILE] [-o OUTPUT_FILE]
                    [--list-output-file LIST_OUTPUT_FILE]
                    [--columns KEY_COLUMN_NAMES [KEY_COLUMN_NAMES ...]]
                    [--compact-id [True|False]] [--deduplicate [True|False]]
                    [--lists-in-input [LISTS_IN_INPUT]]
                    [--keep-first KEEP_FIRST_NAMES [KEEP_FIRST_NAMES ...]]
                    [--presorted [True|False]] [--verify-sort [True|False]]
                    [--report-lists [REPORT_LISTS]]
                    [--exclude-lists [EXCLUDE_LISTS]]
                    [--output-only-lists [OUTPUT_ONLY_LISTS]]
                    [--build-id [True|False]]
                    [--overwrite-id [optional true|false]]
                    [--verify-id-unique [optional true|false]]
                    [--value-hash-width VALUE_HASH_WIDTH]
                    [--claim-id-hash-width CLAIM_ID_HASH_WIDTH]
                    [--claim-id-column-name CLAIM_ID_COLUMN_NAME]
                    [--id-separator ID_SEPARATOR] [-v [optional True|False]]

Copy a KGTK file, compacting multiple records into | lists. 

By default, the input file is sorted in memory to achieve the grouping necessary for the compaction algorithm. This may cause  memory usage issues for large input files. If the input file has already been sorted (or at least grouped), the `--presorted` option may be used.

Additional options are shown in expert help.
kgtk --expert compact --help

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input-file INPUT_FILE
                        The KGTK input file. (May be omitted or '-' for
                        stdin.)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        The KGTK output file. (May be omitted or '-' for
                        stdout.)
  --list-output-file LIST_OUTPUT_FILE
                        A KGTK output file that will contain only the rows
                        containing lists. This file will have the same columns
                        as the primary output file. (Optional, use '-' for
                        stdout.)
  --columns KEY_COLUMN_NAMES [KEY_COLUMN_NAMES ...]
                        The key columns to identify records for compaction.
                        (default=id for node files, (node1, label, node2, id)
                        for edge files).
  --compact-id [True|False]
                        Indicate that the ID column in KGTK edge files should
                        be compacted. Normally, if the ID column exists, it is
                        not compacted, as there are use cases that need to
                        maintain distinct lists of secondary edges for each ID
                        value. (default=False).
  --deduplicate [True|False]
                        Treat all columns as key columns, overriding --columns
                        and --compact-id. This will remove completely
                        duplicate records without compacting any new lists.
                        (default=False).
  --lists-in-input [LISTS_IN_INPUT]
                        Assume that the input file may contain lists (disable
                        when certain it does not). (default=True).
  --keep-first KEEP_FIRST_NAMES [KEEP_FIRST_NAMES ...]
                        If compaction results in a list of values for any
                        column on this list, keep only the first value after
                        sorting. (default=none).
  --presorted [True|False]
                        Indicate that the input has been presorted (or at
                        least pregrouped) (default=False).
  --verify-sort [True|False]
                        If the input has been presorted, verify its
                        consistency (disable if only pregrouped).
                        (default=True).
  --report-lists [REPORT_LISTS]
                        When True, report records with lists to the error
                        output. (default=False).
  --exclude-lists [EXCLUDE_LISTS]
                        When True, exclude records with lists from the output.
                        (default=False).
  --output-only-lists [OUTPUT_ONLY_LISTS]
                        When True, only records containing lists will be
                        written to the primary output file. (default=False).
  --build-id [True|False]
                        Build id values in an id column. (default=False).
  --overwrite-id [optional true|false]
                        When true, replace existing ID values. When false,
                        copy existing ID values. When --overwrite-id is
                        omitted, it defaults to False. When --overwrite-id is
                        supplied without an argument, it is True.
  --verify-id-unique [optional true|false]
                        When true, verify ID uniqueness using an in-memory set
                        of IDs. When --verify-id-unique is omitted, it
                        defaults to False. When --verify-id-unique is supplied
                        without an argument, it is True.
  --value-hash-width VALUE_HASH_WIDTH
                        How many characters should be used in a value hash?
                        (default=6)
  --claim-id-hash-width CLAIM_ID_HASH_WIDTH
                        How many characters should be used to hash the claim
                        ID? 0 means do not hash the claim ID. (default=8)
  --claim-id-column-name CLAIM_ID_COLUMN_NAME
                        The name of the claim_id column. (default=claim_id)
  --id-separator ID_SEPARATOR
                        The separator user between ID subfields. (default=-)

  -v [optional True|False], --verbose [optional True|False]
                        Print additional progress messages (default=False).

Examples

Compact with Builtin Sorting

Suppose that file2.tsv, which is not presorted, contains the following table in KGTK format:

node1 label node2 location years
steve zipcode 45601 cabin
john zipcode 12345 home 10
steve zipcode 45601 4
john zipcode 12346
peter zipcode 12040 home
steve zipcode 45601 home 1
peter zipcode 12040 work 5
peter zipcode 12040 6
steve zipcode 45601 3
peter zipcode 12040 cabin
steve zipcode 45601 5
steve zipcode 45601 work 2

Compacting with built-in sorting:

kgtk compact -i examples/docs/compact-file2.tsv

The output will be the following table in KGTK format:

node1 label node2 location years
john zipcode 12345 home 10
john zipcode 12346
peter zipcode 12040 cabin|home|work 5|6
steve zipcode 45601 cabin|home|work 1|2|3|4|5

Compact with Improperly Sorted Input

This example demonstrates that feeding a non-presorted file to kgtk compact --presorted generates an error.

kgtk compact -i examples/docs/compact-file2.tsv --presorted

The output will begin with the following on stdout:

node1 label node2 location years
steve zipcode 45601 cabin

The output will end with the following error message on stderr:

Line 3 sort violation going down: prev='john|zipcode|12345' curr='steve|zipcode|45601'

Compact with Presorted Input

Suppose that file1.tsv contains the following table in KGTK format: (Note: The years column means years employed, not age.)

kgtk cat -i examples/docs/compact-file1.tsv
node1 label node2 location years
john zipcode 12345 home 10
john zipcode 12346
peter zipcode 12040 home
peter zipcode 12040 cabin
peter zipcode 12040 work 5
peter zipcode 12040 6
steve zipcode 45601 3
steve zipcode 45601 4
steve zipcode 45601 5
steve zipcode 45601 home 1
steve zipcode 45601 work 2
steve zipcode 45601 cabin
kgtk compact -i examples/docs/compact-file1.tsv --presorted

The output will be the following table in KGTK format:

node1 label node2 location years
john zipcode 12345 home 10
john zipcode 12346
peter zipcode 12040 cabin|home|work 5|6
steve zipcode 45601 cabin|home|work 1|2|3|4|5

Compact with External Sorting, No id

This example demonstrates a pipeline that sorts edges without an id field, using kgtk sort, before kgtk compact:

kgtk sort -i examples/docs/compact-file2.tsv \
   / compact --presorted

The output will be the following table in KGTK format:

node1 label node2 location years
john zipcode 12345 home 10
john zipcode 12346
peter zipcode 12040 cabin|home|work 5|6
steve zipcode 45601 cabin|home|work 1|2|3|4|5

Note

Normally, additional options would be passed to kgtk sort to control the amount of memory used, the maximum number of threads, and the location of the temporary files.

Compact with External Sorting, with id

Suppose that file5.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/compact-file5.tsv
id node1 label node2 location years
E01 steve zipcode 45601 cabin
E02 john zipcode 12345 home 10
E03 steve zipcode 45601 4
E04 john zipcode 12346
E05 peter zipcode 12040 home
E06 steve zipcode 45601 home 1
E07 peter zipcode 12040 work 5
E08 peter zipcode 12040 6
E09 steve zipcode 45601 3
E10 peter zipcode 12040 cabin
E11 steve zipcode 45601 5
E12 steve zipcode 45601 work 2

This example demonstrates a pipeline that sorts edges with an id field, using kgtk sort, before kgtk compact:

kgtk sort -i examples/docs/compact-file5.tsv \
   / compact --presorted \
             --columns id node1 label node2

The output will be the following table in KGTK format:

id node1 label node2 location years
E01 steve zipcode 45601 cabin
E02 john zipcode 12345 home 10
E03 steve zipcode 45601 4
E04 john zipcode 12346
E05 peter zipcode 12040 home
E06 steve zipcode 45601 home 1
E07 peter zipcode 12040 work 5
E08 peter zipcode 12040 6
E09 steve zipcode 45601 3
E10 peter zipcode 12040 cabin
E11 steve zipcode 45601 5
E12 steve zipcode 45601 work 2

Note

kgtk compact and kgtk sort use different default key column orders for KGTK edge files with an id column, so it is necessary to specify --columns for one or both of the commands. This behavior may change in the future.

Note

Normally, additional options would be passed to kgtk sort to control the amount of memory used, the maximum number of threads, and the location of the temporary files.

Compact with Default Keys

Suppose that file3.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/compact-file3.tsv
node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3 home
peter zipcode 12040 4 cabin
peter zipcode 12040 4 work 5
peter zipcode 12040 4 6
steve zipcode 45601 5 3
steve zipcode 45601 5 4
steve zipcode 45601 5 5
steve zipcode 45601 6 home 1
steve zipcode 45601 6 work 2
steve zipcode 45601 6 cabin

Compacting with the tuple (node1, label, node2, id) (the default for a KGTK edge file) as the key:

kgtk compact -i examples/docs/compact-file3.tsv

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3 home
peter zipcode 12040 4 cabin|work 5|6
steve zipcode 45601 5 3|4|5
steve zipcode 45601 6 cabin|home|work 1|2

Note

The default key is (node1, label, node2, id).

Compact with Default Keys and --compact-id

Compacting with the tuple (node1, label, node2) and --compact-id.

kgtk compact -i examples/docs/compact-file3.tsv \
             --compact-id

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3|4 cabin|home|work 5|6
steve zipcode 45601 5|6 cabin|home|work 1|2|3|4|5

Note

The default key is (node1, label, node2, id), byt --compact-id removes id from the default key.

Compact with Default Keys and --keep-first

Compacting with the tuple (node1, label, node2, id) (the default for a KGTK edge file) as the key, and --keep-first location years.

kgtk compact -i examples/docs/compact-file3.tsv \
             --keep-first location years

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3 home
peter zipcode 12040 4 cabin 5
steve zipcode 45601 5 3
steve zipcode 45601 6 home 1

Note

The default key is (node1, label, node2, id).

Compacting on the ID Column

Since the id values are not duplicated between (node1, label, node2) tuples in the previous example, compacting on just the id column yields the same results.

kgtk compact -i examples/docs/compact-file3.tsv --columns id

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3 home
peter zipcode 12040 4 cabin|work 5|6
steve zipcode 45601 5 3|4|5
steve zipcode 45601 6 cabin|home|work 1|2

Compacting on (node1, label, node2)

Compacting with the tuple (node1, label, node2) as the key (removing the id column from the default for a KGTK edge file):

kgtk compact -i examples/docs/compact-file3.tsv --compact-id

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3|4 cabin|home|work 5|6
steve zipcode 45601 5|6 cabin|home|work 1|2|3|4|5

Deduplication with Builtin Sorting

Suppose that file4.tsv contains the following table in KGTK format, which is not presorted and which contains some duplicate lines:

kgtk cat -i examples/docs/compact-file4.tsv
node1 label node2 location years
john zipcode 12345 home 10
steve zipcode 45601 work 2
john zipcode 12346
peter zipcode 12040 home
peter zipcode 12040 cabin
john zipcode 12345 home 10
peter zipcode 12040 work 5
peter zipcode 12040 6
steve zipcode 45601 3
steve zipcode 45601 3
peter zipcode 12040 cabin
steve zipcode 45601 4
steve zipcode 45601 5
steve zipcode 45601 home 1
steve zipcode 45601 work 2
steve zipcode 45601 cabin

Deduplicating with built-in sorting:

kgtk deduplicate -i examples/docs/compact-file4.tsv

The output will be the following table in KGTK format:

node1 label node2 location years
john zipcode 12345 home 10
john zipcode 12346
peter zipcode 12040 6
peter zipcode 12040 cabin
peter zipcode 12040 home
peter zipcode 12040 work 5
steve zipcode 45601 3
steve zipcode 45601 4
steve zipcode 45601 5
steve zipcode 45601 cabin
steve zipcode 45601 home 1
steve zipcode 45601 work 2

The output is sorted and duplicate lines have been removed, without creating any new multi-valued edges (| lists).

Deduplication with --keep-first

kgtk deduplicate -i examples/docs/compact-file4.tsv \
                 --keep-first location years

The output will be the following table in KGTK format:

node1 label node2 location years
john zipcode 12345 home 10
john zipcode 12346
peter zipcode 12040 home 5
steve zipcode 45601 work 2

Deduplication with --keep-first and an id Column

kgtk deduplicate -i examples/docs/compact-file3.tsv \
                 --keep-first location years

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3 home
peter zipcode 12040 4 cabin 5
steve zipcode 45601 5 3
steve zipcode 45601 6 home 1

Deduplication with --compact-id and --keep-first

kgtk deduplicate -i examples/docs/compact-file3.tsv \
                 --keep-first location years \
                 --compact-id

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3 home 5
steve zipcode 45601 5 home 3

Reporting Rows with Lists

kgtk compact -i examples/docs/compact-file3.tsv --report-lists

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3 home
peter zipcode 12040 4 cabin|work 5|6
steve zipcode 45601 5 3|4|5
steve zipcode 45601 6 cabin|home|work 1|2

The following records will be reported to standard error:

'peter\tzipcode\t12040\t4\tcabin|work\t5|6'
'steve\tzipcode\t45601\t5\t\t3|4|5'
'steve\tzipcode\t45601\t6\tcabin|home|work\t1|2'

Excluding Rows with Lists from the Primary Output File

kgtk compact -i examples/docs/compact-file3.tsv --exclude-lists

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3 home

Sending Only Rows with Lists to the Primary Output File

kgtk compact -i examples/docs/compact-file3.tsv --output-only-lists

The output will be the following table in KGTK format:

node1 label node2 id location years
peter zipcode 12040 4 cabin|work 5|6
steve zipcode 45601 5 3|4|5
steve zipcode 45601 6 cabin|home|work 1|2

Sending Rows with Lists to the List Output File

kgtk compact -i examples/docs/compact-file3.tsv \
             --list-output-file compact-list-output.tsv

The standard output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3 home
peter zipcode 12040 4 cabin|work 5|6
steve zipcode 45601 5 3|4|5
steve zipcode 45601 6 cabin|home|work 1|2

The list output file will contain the following table in KGTK format:

kgtk cat -i compact-list-output.tsv
node1 label node2 id location years
peter zipcode 12040 4 cabin|work 5|6
steve zipcode 45601 5 3|4|5
steve zipcode 45601 6 cabin|home|work 1|2

Sending Only Rows without Lists to the Primary Output File, and Rows with Lists to the List Output File

kgtk compact -i examples/docs/compact-file3.tsv \
     --output-file compact-output.tsv \
     --exclude-lists \
     --list-output-file compact-list-output.tsv

The primary output file will contain the following table in KGTK format:

kgtk cat -i compact-output.tsv
node1 label node2 id location years
john zipcode 12345 1 home 10
john zipcode 12346 2
peter zipcode 12040 3 home

The list output file will contain the following table in KGTK format:

kgtk cat -i compact-list-output.tsv
node1 label node2 id location years
peter zipcode 12040 4 cabin|work 5|6
steve zipcode 45601 5 3|4|5
steve zipcode 45601 6 cabin|home|work 1|2

Building New, Unique IDs for the Compacted Edges.

kgtk compact -i examples/docs/compact-file3.tsv \
             --build-id --overwrite-id

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 E1 home 10
john zipcode 12346 E2
peter zipcode 12040 E3 home
peter zipcode 12040 E4 cabin|work 5|6
steve zipcode 45601 E5 3|4|5
steve zipcode 45601 E6 cabin|home|work 1|2

Building New, Unique IDs for the Compacted Edges, Compacting the ID column, Too.

kgtk compact -i examples/docs/compact-file3.tsv \
             --build-id --overwrite-id --compact-id

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 E1 home 10
john zipcode 12346 E2
peter zipcode 12040 E3 cabin|home|work 5|6
steve zipcode 45601 E4 cabin|home|work 1|2|3|4|5

Expert Example: Using --id-style=node1-label-node2

Using the expert option --id-style=node1-label-node2, you can generate IDs that concatenate (node1, label, node2).

kgtk compact -i examples/docs/compact-file3.tsv \
             --build-id --overwrite-id --compact-id \
         --id-style=node1-label-node2

The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 john-zipcode-12345 home 10
john zipcode 12346 john-zipcode-12346
peter zipcode 12040 peter-zipcode-12040 cabin|home|work 5|6
steve zipcode 45601 steve-zipcode-45601 cabin|home|work 1|2|3|4|5

Expert Example: Using --id-style=node1-label-node2-num

kgtk compact -i examples/docs/compact-file3.tsv \
             --build-id --overwrite-id --compact-id \
         --id-style=node1-label-node2-num
The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 john-zipcode-12345-0000 home 10
john zipcode 12346 john-zipcode-12346-0000
peter zipcode 12040 peter-zipcode-12040-0000 cabin|home|work 5|6
steve zipcode 45601 steve-zipcode-45601-0000 cabin|home|work 1|2|3|4|5

Expert Example: Using --id-style=node1-label-node2-id

kgtk compact -i examples/docs/compact-file3.tsv \
             --build-id --overwrite-id --compact-id \
         --id-style=node1-label-node2-id
The output will be the following table in KGTK format:

node1 label node2 id location years
john zipcode 12345 john-zipcode-12345-1 home 10
john zipcode 12346 john-zipcode-12346-2
peter zipcode 12040 peter-zipcode-12040-3|4 cabin|home|work 5|6
steve zipcode 45601 steve-zipcode-45601-5|6 cabin|home|work 1|2|3|4|5

Expert Example: Compacting on (node1, label)

Compacting with the tuple (node1, label) as the key (removing the id and node2 columns from the default for a KGTK edge file) may produce an invalid KGTK file. Nonetheless, there may be occasions when this is what you want to do:

kgtk compact -i examples/docs/compact-file3.tsv \
             --mode=NONE --columns node1 label

The output will be the following table in quasi-KGTK format:

node1 label node2 id location years
john zipcode 12345|12346 1|2 home 10
peter zipcode 12040 3|4 cabin|home|work 5|6
steve zipcode 45601 5|6 cabin|home|work 1|2|3|4|5