compact
Overview¶
kgtk compact
¶
The compact command copies its input file to its output file, compacting
repeated items into multi-valued edges (|
lists).
Compact is intended to operate on KGTK node
files or on the additional columns of KGTK denormalized edge files.
It should not be used to compact the node2
column of a KGTK edge file.
kgtk deduplicate
¶
kgtk deduplicate
is an alias for kgtk compact --deduplicate
. In this mode,
duplicate edges are removed without compacting any columns into
multi-valued edges (|
lists).
All columns will be selected as key columns, except for columns that are
included in the --keep-first
list. if --compact-id
is specified, than
the ID column (or its alias) will be included in the --keep-first
list. Columns included in --columns
will be first, in the order specified,
then the remaining columns (except for columns that are included in the
--keep-first
list) in the order that they appear in the file's header
record.
However, unless --columns
is specified, the standard key columns (id
, for
KGTK node files, (node1
, label
, node2
, and optional 'id) for KGTK edge
files) may not be used as
--keep-first` columns. This command parsing
constraint may be removed in the future.
Creating Multi-value Edges¶
Suppose you have a KGTK edge file such as:
node1 | label | node2 | genre |
---|---|---|---|
terminator2_jd | isa | movie | science_fiction |
terminator2_jd | isa | movie | action |
The compacted result would be:
node1 | label | node2 | genre |
---|---|---|---|
terminator2_jd | isa | movie | action|science_fiction |
Note
The key columns (see below) in this example are (node1
, label
, node2
).
Key Columns¶
Compaction occurs by grouping records on a set of key columns, then compacting the records into a single output record.
When --deduplicate=TRUE
, all columns will be used as key columns, other than --keep-first
columns.
For KGTK node files, the default key is (id
).
The --columns KEY_COLUMN_NAMES ...
option may be used to add additional columns to this list.
For KGTK edge files without an id
column, the default key is (node1
, label
, node2
).
The --columns KEY_COLUMN_NAMES ...
option may be used to add additional columns to this list.
For KGTK edge files with an id
column, the default key is (node1
, label
, node2
, id
).
The --columns KEY_COLUMN_NAMES ...
option may be used to add additional columns to this list.
The --compact-id
option may be used to remove the id
column from this list.
When --mode=NONE
is specified, there is no default key.
The --columns KEY_COLUMN_NAMES ...
option MUST be used to add additional columns to this list.
Note
The key column order with an id
column is not the same as is used in some
other KGTK commands. It may change in the future.
id
Generation¶
kgtk compact
may be used to generate id
column values.
The expert option --id-style
may be used to select the style of the id.
See the kgtk add-id
command for adidtional
details on --id-style
and related options.
Processing Large Files¶
By default, the input file is sorted in memory to achieve the
grouping necessary for the compaction algorithm. This may cause
memory usage issues for large input files. This may be solved by
sorting the input file using kgtk sort
,
then using kgtk compact --presorted
.
Compacting node2
Is Discouraged¶
If you have a KGTK edge file with normalized edges (no additional columns),
you might want to compact the node2
column using (node1
, label
) as the
key.
For example, using movies as the topic:
node1 | label | node2 |
---|---|---|
terminator2_jd | genre | science_fiction |
terminator2_jd | genre | action |
You intend to create:
node1 | label | node2 |
---|---|---|
terminator2_jd | genre | action|science_fiction |
This would result in an invalid KGTK file, as the node2
column is
not allowed to contain multi-value edges (|
lists) according to the
KGTK File Specification.
Note
If you insist on compacting the node2
column, you can do so using:
kgtk compact --mode=NONE --columns node1 label
Reporting or Filtering Output Rows with Lists¶
kgtk compact --report-lists
causes output rows containing one or more
lists to be reported to the error file.
kgtk compact --exclude-lists
causes output rows containing one or more
lists to be excluded from the output file.
kgtk compact --output-only-lists
will write only output rows containing one or more
lists to the output file.
Note
--exclude-lists
and --output-only-lists
may not be used together.
Usage¶
usage: kgtk compact [-h] [-i INPUT_FILE] [-o OUTPUT_FILE]
[--list-output-file LIST_OUTPUT_FILE]
[--columns KEY_COLUMN_NAMES [KEY_COLUMN_NAMES ...]]
[--compact-id [True|False]] [--deduplicate [True|False]]
[--lists-in-input [LISTS_IN_INPUT]]
[--keep-first KEEP_FIRST_NAMES [KEEP_FIRST_NAMES ...]]
[--presorted [True|False]] [--verify-sort [True|False]]
[--report-lists [REPORT_LISTS]]
[--exclude-lists [EXCLUDE_LISTS]]
[--output-only-lists [OUTPUT_ONLY_LISTS]]
[--build-id [True|False]]
[--overwrite-id [optional true|false]]
[--verify-id-unique [optional true|false]]
[--value-hash-width VALUE_HASH_WIDTH]
[--claim-id-hash-width CLAIM_ID_HASH_WIDTH]
[--claim-id-column-name CLAIM_ID_COLUMN_NAME]
[--id-separator ID_SEPARATOR] [-v [optional True|False]]
Copy a KGTK file, compacting multiple records into | lists.
By default, the input file is sorted in memory to achieve the grouping necessary for the compaction algorithm. This may cause memory usage issues for large input files. If the input file has already been sorted (or at least grouped), the `--presorted` option may be used.
Additional options are shown in expert help.
kgtk --expert compact --help
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
The KGTK input file. (May be omitted or '-' for
stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--list-output-file LIST_OUTPUT_FILE
A KGTK output file that will contain only the rows
containing lists. This file will have the same columns
as the primary output file. (Optional, use '-' for
stdout.)
--columns KEY_COLUMN_NAMES [KEY_COLUMN_NAMES ...]
The key columns to identify records for compaction.
(default=id for node files, (node1, label, node2, id)
for edge files).
--compact-id [True|False]
Indicate that the ID column in KGTK edge files should
be compacted. Normally, if the ID column exists, it is
not compacted, as there are use cases that need to
maintain distinct lists of secondary edges for each ID
value. (default=False).
--deduplicate [True|False]
Treat all columns as key columns, overriding --columns
and --compact-id. This will remove completely
duplicate records without compacting any new lists.
(default=False).
--lists-in-input [LISTS_IN_INPUT]
Assume that the input file may contain lists (disable
when certain it does not). (default=True).
--keep-first KEEP_FIRST_NAMES [KEEP_FIRST_NAMES ...]
If compaction results in a list of values for any
column on this list, keep only the first value after
sorting. (default=none).
--presorted [True|False]
Indicate that the input has been presorted (or at
least pregrouped) (default=False).
--verify-sort [True|False]
If the input has been presorted, verify its
consistency (disable if only pregrouped).
(default=True).
--report-lists [REPORT_LISTS]
When True, report records with lists to the error
output. (default=False).
--exclude-lists [EXCLUDE_LISTS]
When True, exclude records with lists from the output.
(default=False).
--output-only-lists [OUTPUT_ONLY_LISTS]
When True, only records containing lists will be
written to the primary output file. (default=False).
--build-id [True|False]
Build id values in an id column. (default=False).
--overwrite-id [optional true|false]
When true, replace existing ID values. When false,
copy existing ID values. When --overwrite-id is
omitted, it defaults to False. When --overwrite-id is
supplied without an argument, it is True.
--verify-id-unique [optional true|false]
When true, verify ID uniqueness using an in-memory set
of IDs. When --verify-id-unique is omitted, it
defaults to False. When --verify-id-unique is supplied
without an argument, it is True.
--value-hash-width VALUE_HASH_WIDTH
How many characters should be used in a value hash?
(default=6)
--claim-id-hash-width CLAIM_ID_HASH_WIDTH
How many characters should be used to hash the claim
ID? 0 means do not hash the claim ID. (default=8)
--claim-id-column-name CLAIM_ID_COLUMN_NAME
The name of the claim_id column. (default=claim_id)
--id-separator ID_SEPARATOR
The separator user between ID subfields. (default=-)
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
Examples¶
Compact with Builtin Sorting¶
Suppose that file2.tsv
, which is not presorted,
contains the following table in KGTK format:
node1 | label | node2 | location | years |
---|---|---|---|---|
steve | zipcode | 45601 | cabin | |
john | zipcode | 12345 | home | 10 |
steve | zipcode | 45601 | 4 | |
john | zipcode | 12346 | ||
peter | zipcode | 12040 | home | |
steve | zipcode | 45601 | home | 1 |
peter | zipcode | 12040 | work | 5 |
peter | zipcode | 12040 | 6 | |
steve | zipcode | 45601 | 3 | |
peter | zipcode | 12040 | cabin | |
steve | zipcode | 45601 | 5 | |
steve | zipcode | 45601 | work | 2 |
Compacting with built-in sorting:
kgtk compact -i examples/docs/compact-file2.tsv
The output will be the following table in KGTK format:
node1 | label | node2 | location | years |
---|---|---|---|---|
john | zipcode | 12345 | home | 10 |
john | zipcode | 12346 | ||
peter | zipcode | 12040 | cabin|home|work | 5|6 |
steve | zipcode | 45601 | cabin|home|work | 1|2|3|4|5 |
Compact with Improperly Sorted Input¶
This example demonstrates that feeding a non-presorted
file to kgtk compact --presorted
generates an error.
kgtk compact -i examples/docs/compact-file2.tsv --presorted
The output will begin with the following on stdout:
node1 | label | node2 | location | years |
---|---|---|---|---|
steve | zipcode | 45601 | cabin |
The output will end with the following error message on stderr:
Line 3 sort violation going down: prev='john|zipcode|12345' curr='steve|zipcode|45601'
Compact with Presorted Input¶
Suppose that file1.tsv
contains the following table in KGTK format:
(Note: The years
column means years employed, not age.)
kgtk cat -i examples/docs/compact-file1.tsv
node1 | label | node2 | location | years |
---|---|---|---|---|
john | zipcode | 12345 | home | 10 |
john | zipcode | 12346 | ||
peter | zipcode | 12040 | home | |
peter | zipcode | 12040 | cabin | |
peter | zipcode | 12040 | work | 5 |
peter | zipcode | 12040 | 6 | |
steve | zipcode | 45601 | 3 | |
steve | zipcode | 45601 | 4 | |
steve | zipcode | 45601 | 5 | |
steve | zipcode | 45601 | home | 1 |
steve | zipcode | 45601 | work | 2 |
steve | zipcode | 45601 | cabin |
kgtk compact -i examples/docs/compact-file1.tsv --presorted
The output will be the following table in KGTK format:
node1 | label | node2 | location | years |
---|---|---|---|---|
john | zipcode | 12345 | home | 10 |
john | zipcode | 12346 | ||
peter | zipcode | 12040 | cabin|home|work | 5|6 |
steve | zipcode | 45601 | cabin|home|work | 1|2|3|4|5 |
Compact with External Sorting, No id
¶
This example demonstrates a pipeline that sorts edges without an id
field,
using kgtk sort
, before kgtk compact
:
kgtk sort -i examples/docs/compact-file2.tsv \
/ compact --presorted
The output will be the following table in KGTK format:
node1 | label | node2 | location | years |
---|---|---|---|---|
john | zipcode | 12345 | home | 10 |
john | zipcode | 12346 | ||
peter | zipcode | 12040 | cabin|home|work | 5|6 |
steve | zipcode | 45601 | cabin|home|work | 1|2|3|4|5 |
Note
Normally, additional options would be passed to kgtk sort
to
control the amount of memory used, the maximum number of threads,
and the location of the temporary files.
Compact with External Sorting, with id
¶
Suppose that compact-file5.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/compact-file5.tsv
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
E01 | steve | zipcode | 45601 | cabin | |
E02 | john | zipcode | 12345 | home | 10 |
E03 | steve | zipcode | 45601 | 4 | |
E04 | john | zipcode | 12346 | ||
E05 | peter | zipcode | 12040 | home | |
E06 | steve | zipcode | 45601 | home | 1 |
E07 | peter | zipcode | 12040 | work | 5 |
E08 | peter | zipcode | 12040 | 6 | |
E09 | steve | zipcode | 45601 | 3 | |
E10 | peter | zipcode | 12040 | cabin | |
E11 | steve | zipcode | 45601 | 5 | |
E12 | steve | zipcode | 45601 | work | 2 |
This example demonstrates a pipeline that sorts edges with an id
field,
using kgtk sort
, before kgtk compact
:
kgtk sort -i examples/docs/compact-file5.tsv \
/ compact --presorted \
--columns id node1 label node2
The output will be the following table in KGTK format:
id | node1 | label | node2 | location | years |
---|---|---|---|---|---|
E01 | steve | zipcode | 45601 | cabin | |
E02 | john | zipcode | 12345 | home | 10 |
E03 | steve | zipcode | 45601 | 4 | |
E04 | john | zipcode | 12346 | ||
E05 | peter | zipcode | 12040 | home | |
E06 | steve | zipcode | 45601 | home | 1 |
E07 | peter | zipcode | 12040 | work | 5 |
E08 | peter | zipcode | 12040 | 6 | |
E09 | steve | zipcode | 45601 | 3 | |
E10 | peter | zipcode | 12040 | cabin | |
E11 | steve | zipcode | 45601 | 5 | |
E12 | steve | zipcode | 45601 | work | 2 |
Note
kgtk compact
and kgtk sort
use different default key
column orders for KGTK edge files with an id
column, so it
is necessary to specify --columns
for one or both of the
commands. This behavior may change in the future.
Note
Normally, additional options would be passed to kgtk sort
to
control the amount of memory used, the maximum number of threads,
and the location of the temporary files.
Compact with Default Keys¶
Suppose that file3.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/compact-file3.tsv
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3 | home | |
peter | zipcode | 12040 | 4 | cabin | |
peter | zipcode | 12040 | 4 | work | 5 |
peter | zipcode | 12040 | 4 | 6 | |
steve | zipcode | 45601 | 5 | 3 | |
steve | zipcode | 45601 | 5 | 4 | |
steve | zipcode | 45601 | 5 | 5 | |
steve | zipcode | 45601 | 6 | home | 1 |
steve | zipcode | 45601 | 6 | work | 2 |
steve | zipcode | 45601 | 6 | cabin |
Compacting with the tuple (node1
, label
, node2
, id
) (the default
for a KGTK edge file) as the key:
kgtk compact -i examples/docs/compact-file3.tsv
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3 | home | |
peter | zipcode | 12040 | 4 | cabin|work | 5|6 |
steve | zipcode | 45601 | 5 | 3|4|5 | |
steve | zipcode | 45601 | 6 | cabin|home|work | 1|2 |
Note
The default key is (node1
, label
, node2
, id
).
Compact with Default Keys and --compact-id
¶
Compacting with the tuple (node1
, label
, node2
) and --compact-id
.
kgtk compact -i examples/docs/compact-file3.tsv \
--compact-id
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3|4 | cabin|home|work | 5|6 |
steve | zipcode | 45601 | 5|6 | cabin|home|work | 1|2|3|4|5 |
Note
The default key is (node1
, label
, node2
, id
),
byt --compact-id
removes id
from the default key.
Compact with Default Keys and --keep-first
¶
Compacting with the tuple (node1
, label
, node2
, id
) (the default
for a KGTK edge file) as the key, and --keep-first location years
.
kgtk compact -i examples/docs/compact-file3.tsv \
--keep-first location years
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3 | home | |
peter | zipcode | 12040 | 4 | cabin | 5 |
steve | zipcode | 45601 | 5 | 3 | |
steve | zipcode | 45601 | 6 | home | 1 |
Note
The default key is (node1
, label
, node2
, id
).
Compacting on the ID Column¶
Since the id
values are not duplicated between (node1
, label
, node2
)
tuples in the previous example, compacting on just the id
column yields the same results.
kgtk compact -i examples/docs/compact-file3.tsv --columns id
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3 | home | |
peter | zipcode | 12040 | 4 | cabin|work | 5|6 |
steve | zipcode | 45601 | 5 | 3|4|5 | |
steve | zipcode | 45601 | 6 | cabin|home|work | 1|2 |
Compacting on (node1
, label
, node2
)¶
Compacting with the tuple (node1
, label
, node2
) as the key (removing
the id
column from the default for a KGTK edge file):
kgtk compact -i examples/docs/compact-file3.tsv --compact-id
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3|4 | cabin|home|work | 5|6 |
steve | zipcode | 45601 | 5|6 | cabin|home|work | 1|2|3|4|5 |
Compacting the node2
Column with --keep-first
¶
Normally, the node2
column should not be compacted, because the KGTK File
Specification prohibits lists in that column. However, you can use
--keep-first
to keep just the first node2
value for a (node1,
label`)
combination in a KGTK Edge file, since the implied list being built does not
actually get written to the output KGTK file.
Suppose that compact-file6.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/compact-file6.tsv
node1 | label | node2 |
---|---|---|
john | zipcode | 12345 |
steve | zipcode | 45601 |
john | zipcode | 12346 |
peter | zipcode | 12040 |
peter | zipcode | 12040 |
john | zipcode | 12345 |
peter | zipcode | 12040 |
peter | zipcode | 12040 |
steve | zipcode | 45601 |
steve | zipcode | 45601 |
peter | zipcode | 12040 |
steve | zipcode | 45601 |
steve | zipcode | 45601 |
steve | zipcode | 45601 |
steve | zipcode | 45601 |
steve | zipcode | 45601 |
Compact the node2
column with:
kgtk compact -i examples/docs/compact-file6.tsv --columns node1 label --keep-first node2
node1 | label | node2 |
---|---|---|
john | zipcode | 12345 |
peter | zipcode | 12040 |
steve | zipcode | 45601 |
Deduplication with Builtin Sorting¶
Suppose that file4.tsv
contains the following table in KGTK format,
which is not presorted and which contains some duplicate lines:
kgtk cat -i examples/docs/compact-file4.tsv
node1 | label | node2 | location | years |
---|---|---|---|---|
john | zipcode | 12345 | home | 10 |
steve | zipcode | 45601 | work | 2 |
john | zipcode | 12346 | ||
peter | zipcode | 12040 | home | |
peter | zipcode | 12040 | cabin | |
john | zipcode | 12345 | home | 10 |
peter | zipcode | 12040 | work | 5 |
peter | zipcode | 12040 | 6 | |
steve | zipcode | 45601 | 3 | |
steve | zipcode | 45601 | 3 | |
peter | zipcode | 12040 | cabin | |
steve | zipcode | 45601 | 4 | |
steve | zipcode | 45601 | 5 | |
steve | zipcode | 45601 | home | 1 |
steve | zipcode | 45601 | work | 2 |
steve | zipcode | 45601 | cabin |
Deduplicating with built-in sorting:
kgtk deduplicate -i examples/docs/compact-file4.tsv
The output will be the following table in KGTK format:
node1 | label | node2 | location | years |
---|---|---|---|---|
john | zipcode | 12345 | home | 10 |
john | zipcode | 12346 | ||
peter | zipcode | 12040 | 6 | |
peter | zipcode | 12040 | cabin | |
peter | zipcode | 12040 | home | |
peter | zipcode | 12040 | work | 5 |
steve | zipcode | 45601 | 3 | |
steve | zipcode | 45601 | 4 | |
steve | zipcode | 45601 | 5 | |
steve | zipcode | 45601 | cabin | |
steve | zipcode | 45601 | home | 1 |
steve | zipcode | 45601 | work | 2 |
The output is sorted and duplicate lines have been removed, without creating any new
multi-valued edges (|
lists).
Deduplication with --keep-first
¶
kgtk deduplicate -i examples/docs/compact-file4.tsv \
--keep-first location years
The output will be the following table in KGTK format:
node1 | label | node2 | location | years |
---|---|---|---|---|
john | zipcode | 12345 | home | 10 |
john | zipcode | 12346 | ||
peter | zipcode | 12040 | home | 5 |
steve | zipcode | 45601 | work | 2 |
Deduplication with --keep-first
and an id
Column¶
kgtk deduplicate -i examples/docs/compact-file3.tsv \
--keep-first location years
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3 | home | |
peter | zipcode | 12040 | 4 | cabin | 5 |
steve | zipcode | 45601 | 5 | 3 | |
steve | zipcode | 45601 | 6 | home | 1 |
Deduplication with --compact-id
and --keep-first
¶
kgtk deduplicate -i examples/docs/compact-file3.tsv \
--keep-first location years \
--compact-id
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3 | home | 5 |
steve | zipcode | 45601 | 5 | home | 3 |
Reporting Rows with Lists¶
kgtk compact -i examples/docs/compact-file3.tsv --report-lists
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3 | home | |
peter | zipcode | 12040 | 4 | cabin|work | 5|6 |
steve | zipcode | 45601 | 5 | 3|4|5 | |
steve | zipcode | 45601 | 6 | cabin|home|work | 1|2 |
The following records will be reported to standard error:
'peter\tzipcode\t12040\t4\tcabin|work\t5|6'
'steve\tzipcode\t45601\t5\t\t3|4|5'
'steve\tzipcode\t45601\t6\tcabin|home|work\t1|2'
Excluding Rows with Lists from the Primary Output File¶
kgtk compact -i examples/docs/compact-file3.tsv --exclude-lists
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3 | home |
Sending Only Rows with Lists to the Primary Output File¶
kgtk compact -i examples/docs/compact-file3.tsv --output-only-lists
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
peter | zipcode | 12040 | 4 | cabin|work | 5|6 |
steve | zipcode | 45601 | 5 | 3|4|5 | |
steve | zipcode | 45601 | 6 | cabin|home|work | 1|2 |
Sending Rows with Lists to the List Output File¶
kgtk compact -i examples/docs/compact-file3.tsv \
--list-output-file compact-list-output.tsv
The standard output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3 | home | |
peter | zipcode | 12040 | 4 | cabin|work | 5|6 |
steve | zipcode | 45601 | 5 | 3|4|5 | |
steve | zipcode | 45601 | 6 | cabin|home|work | 1|2 |
The list output file will contain the following table in KGTK format:
kgtk cat -i compact-list-output.tsv
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
peter | zipcode | 12040 | 4 | cabin|work | 5|6 |
steve | zipcode | 45601 | 5 | 3|4|5 | |
steve | zipcode | 45601 | 6 | cabin|home|work | 1|2 |
Sending Only Rows without Lists to the Primary Output File, and Rows with Lists to the List Output File¶
kgtk compact -i examples/docs/compact-file3.tsv \
--output-file compact-output.tsv \
--exclude-lists \
--list-output-file compact-list-output.tsv
The primary output file will contain the following table in KGTK format:
kgtk cat -i compact-output.tsv
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | 1 | home | 10 |
john | zipcode | 12346 | 2 | ||
peter | zipcode | 12040 | 3 | home |
The list output file will contain the following table in KGTK format:
kgtk cat -i compact-list-output.tsv
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
peter | zipcode | 12040 | 4 | cabin|work | 5|6 |
steve | zipcode | 45601 | 5 | 3|4|5 | |
steve | zipcode | 45601 | 6 | cabin|home|work | 1|2 |
Building New, Unique IDs for the Compacted Edges.¶
kgtk compact -i examples/docs/compact-file3.tsv \
--build-id --overwrite-id
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | E1 | home | 10 |
john | zipcode | 12346 | E2 | ||
peter | zipcode | 12040 | E3 | home | |
peter | zipcode | 12040 | E4 | cabin|work | 5|6 |
steve | zipcode | 45601 | E5 | 3|4|5 | |
steve | zipcode | 45601 | E6 | cabin|home|work | 1|2 |
Building New, Unique IDs for the Compacted Edges, Compacting the ID column, Too.¶
kgtk compact -i examples/docs/compact-file3.tsv \
--build-id --overwrite-id --compact-id
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | E1 | home | 10 |
john | zipcode | 12346 | E2 | ||
peter | zipcode | 12040 | E3 | cabin|home|work | 5|6 |
steve | zipcode | 45601 | E4 | cabin|home|work | 1|2|3|4|5 |
Expert Example: Using --id-style=node1-label-node2
¶
Using the expert option --id-style=node1-label-node2
, you can generate IDs
that concatenate (node1, label, node2).
kgtk compact -i examples/docs/compact-file3.tsv \
--build-id --overwrite-id --compact-id \
--id-style=node1-label-node2
The output will be the following table in KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | john-zipcode-12345 | home | 10 |
john | zipcode | 12346 | john-zipcode-12346 | ||
peter | zipcode | 12040 | peter-zipcode-12040 | cabin|home|work | 5|6 |
steve | zipcode | 45601 | steve-zipcode-45601 | cabin|home|work | 1|2|3|4|5 |
Expert Example: Using --id-style=node1-label-node2-num
¶
kgtk compact -i examples/docs/compact-file3.tsv \
--build-id --overwrite-id --compact-id \
--id-style=node1-label-node2-num
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | john-zipcode-12345-0000 | home | 10 |
john | zipcode | 12346 | john-zipcode-12346-0000 | ||
peter | zipcode | 12040 | peter-zipcode-12040-0000 | cabin|home|work | 5|6 |
steve | zipcode | 45601 | steve-zipcode-45601-0000 | cabin|home|work | 1|2|3|4|5 |
Expert Example: Using --id-style=node1-label-node2-id
¶
kgtk compact -i examples/docs/compact-file3.tsv \
--build-id --overwrite-id --compact-id \
--id-style=node1-label-node2-id
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345 | john-zipcode-12345-1 | home | 10 |
john | zipcode | 12346 | john-zipcode-12346-2 | ||
peter | zipcode | 12040 | peter-zipcode-12040-3|4 | cabin|home|work | 5|6 |
steve | zipcode | 45601 | steve-zipcode-45601-5|6 | cabin|home|work | 1|2|3|4|5 |
Expert Example: Compacting on (node1
, label
)¶
Compacting with the tuple (node1
, label
) as the key (removing
the id
and node2
columns from the default for a KGTK edge file)
may produce an invalid KGTK file. Nonetheless, there may be occasions
when this is what you want to do:
kgtk compact -i examples/docs/compact-file3.tsv \
--mode=NONE --columns node1 label
The output will be the following table in quasi-KGTK format:
node1 | label | node2 | id | location | years |
---|---|---|---|---|---|
john | zipcode | 12345|12346 | 1|2 | home | 10 |
peter | zipcode | 12040 | 3|4 | cabin|home|work | 5|6 |
steve | zipcode | 45601 | 5|6 | cabin|home|work | 1|2|3|4|5 |