normalize-edges
Overview¶
kgtk normalize¶
kgtk normalize removes additional columns from a KGTK edge file.
It implements two column removal patterns:
- It reverses
kgtk lift, then - it converts the remaining additional columns to normalized secondary edges.
kgtk lower¶
This alias for kgtk normalize removes additional columns from a KGTK edge file,
reversing kgtk lift. It does not convert other additional columns to secondary edges.
kgtk normalize-edges¶
This alias for kgtk normalize converts all (or selected) additional columns
in a KGTK edge file to secondary edges.
kgtk normalize-nodes¶
kgtk normalize-nodes converts KGTK node files to normalized
KGTK edge files.
Note
kgtk normalize-nodes is currently implemented as a seperate
command. In the future, kgtk normalize may provide the same functionality when the
input file is a KGTK node file, with kgtk normalize-nodes as an alias.
Reversing Default kgtk lift¶
By default, kgtk lift creates the following lifted columns:
node1;labellabel;labelnode2;label
The following input file:
| node1 | label | node2 | node1;label | label;label | node2;label |
|---|---|---|---|---|---|
| Q1 | P1 | Q2 | item1 | isa | group1 |
Would be transformed by kgtk lower into:
| node1 | label | node2 |
|---|---|---|
| Q1 | P1 | Q2 |
| Q1 | label | item1 |
| Q2 | label | group1 |
| P1 | label | isa |
Conversion to Secondary Edges Requires id¶
If converting additional columns to secondary edges is requested via kgtk normalize-edges
or kgtk normalize --normalize (the default), the input KGTK edge file
must contain an id column.
Note
In the future, there may be an option to generate id values on input edges as needed.
Until then, use the kgtk add-id command to generate
id field values prior to kgtk normalize-edges.
Converting Additional Columns to Normalized Secondary Edges¶
Additional columns that aren't lowered may ba converted to normalized secondary edges.
The following input file:
| id | node1 | label | node2 | confidence | reference |
|---|---|---|---|---|---|
| E1 | Q1 | P1 | Q2 | 0.9 | Wikidata |
Would be transformed by kgtk normalize-edges to:
| id | node1 | label | node2 |
|---|---|---|---|
| E1 | Q1 | P1 | Q2 |
| E1 | confidence | 0.9 | |
| E1 | reference | Wikidata |
Note
The newly generated secondary edges do not themselves have id fields unless the --add-id
option is specified.
You may also use the kgtk add-id command to generate
id field values after kgtk normalize-edges.
Selecting the Additional Columns to Normalize¶
The --columns option may be used to select the columns to normalize.
This option has the aliases --columns-to-lower and --columns-to-remove,
which may be used to increase the legibility of scripts that use the
kgtk normalize command or its aliases.
Additional columns that are not selected for normalization are passed through to the output file.
Sending New Edges to a Seperate File¶
By default, newly created edges are sent to the primary output file, along with edges from the input file.
--new-edges-file NEW_EDGES_FILE may be used to route newly created edges
to a seperate file.
Deduplicating New Edges¶
By default, newly created edges are deduplicated. The first instance generated is written to the appropriate output file (either the standard output file or the new edges output file).
--deduplicate False disables new edge deduplication.
Deduplication Memory Usage¶
Deduplication uses an in-memory dictionary. It is not suitable for use
with processing large files with large numbers of unique newly-generated edges.
In this case, use kgtk sort and kgtk compact to deduplicate the new
edges as additional processing steps.
kgtk normalize --deduplicate-new-edges ... / sort / compact
Expanding Lists¶
List of values in additional columns being lowered/normalized are expanded into seperate output edges for each nonempty element of the list.
Note
This is required by the KGTK File Format v2.0,
which prohibits lists in the node2 column.
Generating ID Values¶
kgtk normalize will generate ID values for output
edges that were generated as a result of normalization. This code
is somewhat experimental, and may be revised in the future. Alternatively,
the output from kgtk normalize may be piped to kgtk add-id.
Usage¶
usage: kgtk normalize [-h] [-i INPUT_FILE] [-o OUTPUT_FILE]
[--new-edges-file NEW_EDGES_FILE]
[--columns COLUMNS_TO_LOWER [COLUMNS_TO_LOWER ...]]
[--add-id [True|False]] [--lower [True|False]]
[--normalize [True|False]]
[--deduplicate-new-edges [True|False]]
[--overwrite-id [optional true|false]]
[--verify-id-unique [optional true|false]]
[--value-hash-width VALUE_HASH_WIDTH]
[--claim-id-hash-width CLAIM_ID_HASH_WIDTH]
[--claim-id-column-name CLAIM_ID_COLUMN_NAME]
[--id-separator ID_SEPARATOR] [-v [optional True|False]]
Normalize a KGTK edge file by removing columns that match a "lift" pattern and converting remaining additional columns to new edges.
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
The KGTK input file. (May be omitted or '-' for
stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--new-edges-file NEW_EDGES_FILE
An optional output file for new edges (normalized
and/or lowered). If omitted, new edges will go in the
main output file. (Optional, use '-' for stdout.)
--columns COLUMNS_TO_LOWER [COLUMNS_TO_LOWER ...], --columns-to-lower COLUMNS_TO_LOWER [COLUMNS_TO_LOWER ...], --columns-to-remove COLUMNS_TO_LOWER [COLUMNS_TO_LOWER ...]
Columns to lower and remove as a space-separated list.
(default=all columns other than key columns)
--add-id [True|False]
When True, add an id column to the output (if not
already present). (default=False)
--lower [True|False] When True, lower columns that match a lift pattern.
(default=True)
--normalize [True|False]
When True, normalize columns that do not match a lift
pattern. (default=True)
--deduplicate-new-edges [True|False]
When True, deduplicate new edges. Not suitable for
large files. (default=True).
--overwrite-id [optional true|false]
When true, replace existing ID values. When false,
copy existing ID values. When --overwrite-id is
omitted, it defaults to False. When --overwrite-id is
supplied without an argument, it is True.
--verify-id-unique [optional true|false]
When true, verify ID uniqueness using an in-memory set
of IDs. When --verify-id-unique is omitted, it
defaults to False. When --verify-id-unique is supplied
without an argument, it is True.
--value-hash-width VALUE_HASH_WIDTH
How many characters should be used in a value hash?
(default=6)
--claim-id-hash-width CLAIM_ID_HASH_WIDTH
How many characters should be used to hash the claim
ID? 0 means do not hash the claim ID. (default=8)
--claim-id-column-name CLAIM_ID_COLUMN_NAME
The name of the claim_id column. (default=claim_id)
--id-separator ID_SEPARATOR
The separator user between ID subfields. (default=-)
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
Examples¶
Sample Data¶
Suppose file1.tsv` contains the following table in KGTK format:
kgtk cat -i examples/docs/normalize-file1.tsv
| node1 | label | node2 | node1;label | label;label | node2;label |
|---|---|---|---|---|---|
| Q1 | P1 | Q5 | "Elmo" | "instance of" | "homo sapiens"|"human" |
| Q1 | P2 | Q6 | "Elmo" | "amigo"|"friend" | "Fred" |
| Q6 | P1 | Q5 | "Fred" | "instance of" | "homo sapiens"|"human" |
Note
This file was generated using kgtk lift:
kgtk lift --input-file examples/docs/lift-file4.tsv -o examples/docs/normalize-file1.tsv
Reversing kgtk lift with kgtk lower¶
kgtk lower -i examples/docs/normalize-file1.tsv
| node1 | label | node2 |
|---|---|---|
| Q1 | P1 | Q5 |
| Q1 | label | "Elmo" |
| P1 | label | "instance of" |
| Q5 | label | "homo sapiens" |
| Q5 | label | "human" |
| Q1 | P2 | Q6 |
| P2 | label | "amigo" |
| P2 | label | "friend" |
| Q6 | label | "Fred" |
| Q6 | P1 | Q5 |
Note
The node1;label, label;label, and node2;label columns were
recognized as lift-pattern additional columns. They were removed
from the output file and their contents generated as label records.
Note
The list "amigo"|"friend" in the input file generated two output records.
Reversing kgtk lift with kgtk lower and Without Deduplication¶
kgtk lower -i examples/docs/normalize-file1.tsv \
--deduplicate-new-edges False
| node1 | label | node2 |
|---|---|---|
| Q1 | P1 | Q5 |
| Q1 | label | "Elmo" |
| P1 | label | "instance of" |
| Q5 | label | "homo sapiens" |
| Q5 | label | "human" |
| Q1 | P2 | Q6 |
| Q1 | label | "Elmo" |
| P2 | label | "amigo" |
| P2 | label | "friend" |
| Q6 | label | "Fred" |
| Q6 | P1 | Q5 |
| Q6 | label | "Fred" |
| P1 | label | "instance of" |
| Q5 | label | "homo sapiens" |
| Q5 | label | "human" |
Reversing kgtk lift with kgtk lower / sort /compact¶
kgtk lower -i examples/docs/normalize-file1.tsv \
--deduplicate-new-edges False \
/ sort \
/ compact
| node1 | label | node2 |
|---|---|---|
| P1 | label | "instance of" |
| P2 | label | "amigo" |
| P2 | label | "friend" |
| Q1 | P1 | Q5 |
| Q1 | P2 | Q6 |
| Q1 | label | "Elmo" |
| Q5 | label | "homo sapiens" |
| Q5 | label | "human" |
| Q6 | P1 | Q5 |
| Q6 | label | "Fred" |
Note
This processing pipeline is suitable for use with larger files.
Typically, additional parameters will be passed to kgtk sort
to control the number of threads use, the amount of main memory
used, and the filesystem location for temporary files. These
details have been omitted for clarity.
Reversing Just kgtk lift with kgtk normalize¶
kgtk normalize -i examples/docs/normalize-file1.tsv \
--normalize False
| node1 | label | node2 |
|---|---|---|
| Q1 | P1 | Q5 |
| Q1 | label | "Elmo" |
| P1 | label | "instance of" |
| Q5 | label | "homo sapiens" |
| Q5 | label | "human" |
| Q1 | P2 | Q6 |
| P2 | label | "amigo" |
| P2 | label | "friend" |
| Q6 | label | "Fred" |
| Q6 | P1 | Q5 |
Note
--normalize False is required because the input file does not have an id column.
Directing New Edges to a Seperate File¶
kgtk lower -i examples/docs/normalize-file1.tsv \
--new-edges new.tsv
| node1 | label | node2 |
|---|---|---|
| Q1 | P1 | Q5 |
| Q1 | P2 | Q6 |
| Q6 | P1 | Q5 |
kgtk cat -i new.tsv
| node1 | label | node2 |
|---|---|---|
| Q1 | label | "Elmo" |
| P1 | label | "instance of" |
| Q5 | label | "homo sapiens" |
| Q5 | label | "human" |
| P2 | label | "amigo" |
| P2 | label | "friend" |
| Q6 | label | "Fred" |
Sample Data with id and a Non-lift Additional Column¶
Suppose file2.tsv` contains the following table in KGTK format:
kgtk cat -i examples/docs/normalize-file2.tsv
| node1 | label | node2 | node1;label | label;label | node2;label | id | confidence |
|---|---|---|---|---|---|---|---|
| Q1 | P1 | Q5 | "Elmo" | "instance of" | "homo sapiens"|"human" | E1 | 0.3 |
| Q1 | P2 | Q6 | "Elmo" | "amigo"|"friend" | "Fred" | E2 | 0.9 |
| Q6 | P1 | Q5 | "Fred" | "instance of" | "homo sapiens"|"human" | E3 | 0.8 |
Normalizing Both Lift and Non-lift Additional Columns¶
kgtk normalize -i examples/docs/normalize-file2.tsv
| node1 | label | node2 | id |
|---|---|---|---|
| Q1 | P1 | Q5 | E1 |
| Q1 | label | "Elmo" | |
| P1 | label | "instance of" | |
| Q5 | label | "homo sapiens" | |
| Q5 | label | "human" | |
| E1 | confidence | 0.3 | |
| Q1 | P2 | Q6 | E2 |
| P2 | label | "amigo" | |
| P2 | label | "friend" | |
| Q6 | label | "Fred" | |
| E2 | confidence | 0.9 | |
| Q6 | P1 | Q5 | E3 |
| E3 | confidence | 0.8 |
Note
The additional columns have been removed from the output file. Their contents appear as a mixture of label edges and secondary edges.
Normalizing Just a Non-lift Additional Column¶
Let's normalize just the non-lift additional column:
kgtk normalize-edges -i examples/docs/normalize-file2.tsv \
--columns-to-remove confidence
| node1 | label | node2 | node1;label | label;label | node2;label | id |
|---|---|---|---|---|---|---|
| Q1 | P1 | Q5 | "Elmo" | "instance of" | "homo sapiens"|"human" | E1 |
| E1 | confidence | 0.3 | ||||
| Q1 | P2 | Q6 | "Elmo" | "amigo"|"friend" | "Fred" | E2 |
| E2 | confidence | 0.9 | ||||
| Q6 | P1 | Q5 | "Fred" | "instance of" | "homo sapiens"|"human" | E3 |
| E3 | confidence | 0.8 |
Note
The confidence column has been removed from the output file.
Its contents appear as new secondary edges.
Normalizing a Non-lift Additional Column and Adding IDs Externally¶
Let's normalize just the non-lift additional column:
To avoid generating the same ID values as existing IDs,
the newly generated edge IDs are generated with the prefix N
kgtk normalize-edges -i examples/docs/normalize-file2.tsv \
--columns-to-remove confidence \
/ add-id --id-prefix N
| node1 | label | node2 | node1;label | label;label | node2;label | id |
|---|---|---|---|---|---|---|
| Q1 | P1 | Q5 | "Elmo" | "instance of" | "homo sapiens"|"human" | E1 |
| E1 | confidence | 0.3 | N1 | |||
| Q1 | P2 | Q6 | "Elmo" | "amigo"|"friend" | "Fred" | E2 |
| E2 | confidence | 0.9 | N2 | |||
| Q6 | P1 | Q5 | "Fred" | "instance of" | "homo sapiens"|"human" | E3 |
| E3 | confidence | 0.8 | N3 |
Normalizing a Non-lift Additional Column and Adding IDs Externally with an Initial ID¶
Let's normalize just the non-lift additional column.
To avoid generating the same ID values as existing IDs,
the newly generated edge IDs are generated with the initial value E100.
kgtk normalize-edges -i examples/docs/normalize-file2.tsv \
--columns-to-remove confidence \
/ add-id --initial-id 100
| node1 | label | node2 | node1;label | label;label | node2;label | id |
|---|---|---|---|---|---|---|
| Q1 | P1 | Q5 | "Elmo" | "instance of" | "homo sapiens"|"human" | E1 |
| E1 | confidence | 0.3 | E100 | |||
| Q1 | P2 | Q6 | "Elmo" | "amigo"|"friend" | "Fred" | E2 |
| E2 | confidence | 0.9 | E101 | |||
| Q6 | P1 | Q5 | "Fred" | "instance of" | "homo sapiens"|"human" | E3 |
| E3 | confidence | 0.8 | E102 |
Normalizing a Non-lift Additional Column and Adding IDs Externally with node1-label-node2-num¶
Let's normalize just the non-lift additional column.
To avoid generating the same ID values as existing IDs,
the newly generated edge IDs are generated with the initial value E100.
kgtk normalize-edges -i examples/docs/normalize-file2.tsv \
--columns-to-remove confidence \
/ add-id --id-style node1-label-node2-num
| node1 | label | node2 | node1;label | label;label | node2;label | id |
|---|---|---|---|---|---|---|
| Q1 | P1 | Q5 | "Elmo" | "instance of" | "homo sapiens"|"human" | E1 |
| E1 | confidence | 0.3 | E1-confidence-0.3-0000 | |||
| Q1 | P2 | Q6 | "Elmo" | "amigo"|"friend" | "Fred" | E2 |
| E2 | confidence | 0.9 | E2-confidence-0.9-0000 | |||
| Q6 | P1 | Q5 | "Fred" | "instance of" | "homo sapiens"|"human" | E3 |
| E3 | confidence | 0.8 | E3-confidence-0.8-0000 |
Normalizing a Non-lift Additional Column and Adding IDs Internally¶
Let's normalize just the non-lift additional column, using the internal option to add IDs with its default settings. This avoids the need to use a KGTK pipe.
kgtk normalize-edges -i examples/docs/normalize-file2.tsv \
--columns-to-remove confidence \
--add-id
| node1 | label | node2 | node1;label | label;label | node2;label | id |
|---|---|---|---|---|---|---|
| Q1 | P1 | Q5 | "Elmo" | "instance of" | "homo sapiens"|"human" | E1 |
| E1 | confidence | 0.3 | E1-confidence-0.3-0000 | |||
| Q1 | P2 | Q6 | "Elmo" | "amigo"|"friend" | "Fred" | E2 |
| E2 | confidence | 0.9 | E2-confidence-0.9-0000 | |||
| Q6 | P1 | Q5 | "Fred" | "instance of" | "homo sapiens"|"human" | E3 |
| E3 | confidence | 0.8 | E3-confidence-0.8-0000 |
Note
The sequence number generated by the internal ID generation code
may be different from the sequence number generated by an external kgtk add-id
pipe. That potential difference is not illustrated here.
Reversing kgtk lift with Other Labels¶
Suppose file3.tsv` contains the following table in KGTK format:
kgtk cat -i examples/docs/normalize-file3.tsv
| node1 | label | node2 | node1;name | label;relationship | node2;name |
|---|---|---|---|---|---|
| Q1 | P1 | Q5 | "Elmo" | "instance of" | "homo sapiens"|"human" |
| Q1 | P2 | Q6 | "Elmo" | "amigo"|"friend" | "Fred" |
| Q6 | P1 | Q5 | "Fred" | "instance of" | "homo sapiens"|"human" |
Lowering the additional columns with default settings:
kgtk lower -i examples/docs/normalize-file3.tsv
| node1 | label | node2 |
|---|---|---|
| Q1 | P1 | Q5 |
| Q1 | name | "Elmo" |
| P1 | relationship | "instance of" |
| Q5 | name | "homo sapiens" |
| Q5 | name | "human" |
| Q1 | P2 | Q6 |
| P2 | relationship | "amigo" |
| P2 | relationship | "friend" |
| Q6 | name | "Fred" |
| Q6 | P1 | Q5 |
Expert Example: Lowering with Base Columns¶
Suppose file4.tsv` contains the following table in KGTK format:
kgtk cat -i examples/docs/normalize-file4.tsv
| node1 | label | node2 | color | material | size |
|---|---|---|---|---|---|
| block1 | isa | cube | red | wood | large |
| block2 | isa | pyramid | blue | steel | small |
In this case, the additional columns color, material, and size are
all attributes of the entity in node, but without the node1; prefix.
These columns can be lowered by supplying a base column for each
column to be lowered using the expert option --base-columns BASE_COLUMNS ....
The columns to be lowered must be specified with --columns COLUMNS_TO_LOWER
(or an alias to this option), and there must be one base column specified for each
column to lower.
kgtk lower -i examples/docs/normalize-file4.tsv \
--columns color material size \
--base-columns node1 node1 node1
| node1 | label | node2 |
|---|---|---|
| block1 | isa | cube |
| block1 | color | red |
| block1 | material | wood |
| block1 | size | large |
| block2 | isa | pyramid |
| block2 | color | blue |
| block2 | material | steel |
| block2 | size | small |
Note
Another approach would be to rename the columns on input to names
such as node1;color.
See kgtk cat for am example of renaming columns
on input.
Expert Example: Lowering with Base Columns and Label Values¶
Suppose file4.tsv` contains the following table in KGTK format:
kgtk cat -i examples/docs/normalize-file4.tsv
| node1 | label | node2 | color | material | size |
|---|---|---|---|---|---|
| block1 | isa | cube | red | wood | large |
| block2 | isa | pyramid | blue | steel | small |
In this case, the additional columns color, material, and size are
all attributes of the entity in node, but without the node1; prefix.
These columns can be lowered by supplying a base column for each
column to be lowered using the expert option --base-columns BASE_COLUMNS ....
The columns to be lowered must be specified with --columns COLUMNS_TO_LOWER
(or an alias to this option), and there must be one base column specified for each
column to lower.
Furthermore, suppose that the relationships in the label edges must be all capital
letters. The expert option --label-values LABEL_VALUES ...] can be
used to supply the label values to use. There must be one label value specified
for each column to lower.
kgtk lower -i examples/docs/normalize-file4.tsv \
--columns color material size \
--base-columns node1 node1 node1 \
--label-values COLOR MATERIAL SIZE
| node1 | label | node2 |
|---|---|---|
| block1 | isa | cube |
| block1 | COLOR | red |
| block1 | MATERIAL | wood |
| block1 | SIZE | large |
| block2 | isa | pyramid |
| block2 | COLOR | blue |
| block2 | MATERIAL | steel |
| block2 | SIZE | small |
Note
Another approach would be to rename the columns on input to names
such as node1;COLOR.
See kgtk cat for am example of renaming columns
on input.