Skip to content

normalize-edges

Overview

kgtk normalize

kgtk normalize removes additional columns from a KGTK edge file. It implements two column removal patterns:

  • It reverses kgtk lift, then
  • it converts the remaining additional columns to normalized secondary edges.

kgtk lower

This alias for kgtk normalize removes additional columns from a KGTK edge file, reversing kgtk lift. It does not convert other additional columns to secondary edges.

kgtk normalize-edges

This alias for kgtk normalize converts all (or selected) additional columns in a KGTK edge file to secondary edges.

kgtk normalize-nodes

kgtk normalize-nodes converts KGTK node files to normalized KGTK edge files.

Note

kgtk normalize-nodes is currently implemented as a seperate command. In the future, kgtk normalize may provide the same functionality when the input file is a KGTK node file, with kgtk normalize-nodes as an alias.

Reversing Default kgtk lift

By default, kgtk lift creates the following lifted columns:

  • node1;label
  • label;label
  • node2;label

The following input file:

node1 label node2 node1;label label;label node2;label
Q1 P1 Q2 item1 isa group1

Would be transformed by kgtk lower into:

node1 label node2
Q1 P1 Q2
Q1 label item1
Q2 label group1
P1 label isa

Conversion to Secondary Edges Requires id

If converting additional columns to secondary edges is requested via kgtk normalize-edges or kgtk normalize --normalize (the default), the input KGTK edge file must contain an id column.

Note

In the future, there may be an option to generate id values on input edges as needed. Until then, use the kgtk add-id command to generate id field values prior to kgtk normalize-edges.

Converting Additional Columns to Normalized Secondary Edges

Additional columns that aren't lowered may ba converted to normalized secondary edges.

The following input file:

id node1 label node2 confidence reference
E1 Q1 P1 Q2 0.9 Wikidata

Would be transformed by kgtk normalize-edges to:

id node1 label node2
E1 Q1 P1 Q2
E1 confidence 0.9
E1 reference Wikidata

Note

The newly generated secondary edges do not themselves have id fields unless the --add-id option is specified. You may also use the kgtk add-id command to generate id field values after kgtk normalize-edges.

Selecting the Additional Columns to Normalize

The --columns option may be used to select the columns to normalize. This option has the aliases --columns-to-lower and --columns-to-remove, which may be used to increase the legibility of scripts that use the kgtk normalize command or its aliases.

Additional columns that are not selected for normalization are passed through to the output file.

Sending New Edges to a Seperate File

By default, newly created edges are sent to the primary output file, along with edges from the input file.

--new-edges-file NEW_EDGES_FILE may be used to route newly created edges to a seperate file.

Deduplicating New Edges

By default, newly created edges are deduplicated. The first instance generated is written to the appropriate output file (either the standard output file or the new edges output file).

--deduplicate False disables new edge deduplication.

Deduplication Memory Usage

Deduplication uses an in-memory dictionary. It is not suitable for use with processing large files with large numbers of unique newly-generated edges. In this case, use kgtk sort and kgtk compact to deduplicate the new edges as additional processing steps.

kgtk normalize --deduplicate-new-edges ... / sort / compact

Expanding Lists

List of values in additional columns being lowered/normalized are expanded into seperate output edges for each nonempty element of the list.

Note

This is required by the KGTK File Format v2.0, which prohibits lists in the node2 column.

Generating ID Values

kgtk normalize will generate ID values for output edges that were generated as a result of normalization. This code is somewhat experimental, and may be revised in the future. Alternatively, the output from kgtk normalize may be piped to kgtk add-id.

Usage

usage: kgtk normalize [-h] [-i INPUT_FILE] [-o OUTPUT_FILE]
                      [--new-edges-file NEW_EDGES_FILE]
                      [--columns COLUMNS_TO_LOWER [COLUMNS_TO_LOWER ...]]
                      [--add-id [True|False]] [--lower [True|False]]
                      [--normalize [True|False]]
                      [--deduplicate-new-edges [True|False]]
                      [--overwrite-id [optional true|false]]
                      [--verify-id-unique [optional true|false]]
                      [--value-hash-width VALUE_HASH_WIDTH]
                      [--claim-id-hash-width CLAIM_ID_HASH_WIDTH]
                      [--claim-id-column-name CLAIM_ID_COLUMN_NAME]
                      [--id-separator ID_SEPARATOR] [-v [optional True|False]]

Normalize a KGTK edge file by removing columns that match a "lift" pattern and converting remaining additional columns to new edges.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input-file INPUT_FILE
                        The KGTK input file. (May be omitted or '-' for
                        stdin.)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        The KGTK output file. (May be omitted or '-' for
                        stdout.)
  --new-edges-file NEW_EDGES_FILE
                        An optional output file for new edges (normalized
                        and/or lowered). If omitted, new edges will go in the
                        main output file. (Optional, use '-' for stdout.)
  --columns COLUMNS_TO_LOWER [COLUMNS_TO_LOWER ...], --columns-to-lower COLUMNS_TO_LOWER [COLUMNS_TO_LOWER ...], --columns-to-remove COLUMNS_TO_LOWER [COLUMNS_TO_LOWER ...]
                        Columns to lower and remove as a space-separated list.
                        (default=all columns other than key columns)
  --add-id [True|False]
                        When True, add an id column to the output (if not
                        already present). (default=False)
  --lower [True|False]  When True, lower columns that match a lift pattern.
                        (default=True)
  --normalize [True|False]
                        When True, normalize columns that do not match a lift
                        pattern. (default=True)
  --deduplicate-new-edges [True|False]
                        When True, deduplicate new edges. Not suitable for
                        large files. (default=True).
  --overwrite-id [optional true|false]
                        When true, replace existing ID values. When false,
                        copy existing ID values. When --overwrite-id is
                        omitted, it defaults to False. When --overwrite-id is
                        supplied without an argument, it is True.
  --verify-id-unique [optional true|false]
                        When true, verify ID uniqueness using an in-memory set
                        of IDs. When --verify-id-unique is omitted, it
                        defaults to False. When --verify-id-unique is supplied
                        without an argument, it is True.
  --value-hash-width VALUE_HASH_WIDTH
                        How many characters should be used in a value hash?
                        (default=6)
  --claim-id-hash-width CLAIM_ID_HASH_WIDTH
                        How many characters should be used to hash the claim
                        ID? 0 means do not hash the claim ID. (default=8)
  --claim-id-column-name CLAIM_ID_COLUMN_NAME
                        The name of the claim_id column. (default=claim_id)
  --id-separator ID_SEPARATOR
                        The separator user between ID subfields. (default=-)

  -v [optional True|False], --verbose [optional True|False]
                        Print additional progress messages (default=False).

Examples

Sample Data

Suppose file1.tsv` contains the following table in KGTK format:

kgtk cat -i examples/docs/normalize-file1.tsv
node1 label node2 node1;label label;label node2;label
Q1 P1 Q5 "Elmo" "instance of" "homo sapiens"|"human"
Q1 P2 Q6 "Elmo" "amigo"|"friend" "Fred"
Q6 P1 Q5 "Fred" "instance of" "homo sapiens"|"human"

Note

This file was generated using kgtk lift:

kgtk lift --input-file examples/docs/lift-file4.tsv -o examples/docs/normalize-file1.tsv

Reversing kgtk lift with kgtk lower

kgtk lower -i examples/docs/normalize-file1.tsv
node1 label node2
Q1 P1 Q5
Q1 label "Elmo"
P1 label "instance of"
Q5 label "homo sapiens"
Q5 label "human"
Q1 P2 Q6
P2 label "amigo"
P2 label "friend"
Q6 label "Fred"
Q6 P1 Q5

Note

The node1;label, label;label, and node2;label columns were recognized as lift-pattern additional columns. They were removed from the output file and their contents generated as label records.

Note

The list "amigo"|"friend" in the input file generated two output records.

Reversing kgtk lift with kgtk lower and Without Deduplication

kgtk lower -i examples/docs/normalize-file1.tsv \
           --deduplicate-new-edges False
node1 label node2
Q1 P1 Q5
Q1 label "Elmo"
P1 label "instance of"
Q5 label "homo sapiens"
Q5 label "human"
Q1 P2 Q6
Q1 label "Elmo"
P2 label "amigo"
P2 label "friend"
Q6 label "Fred"
Q6 P1 Q5
Q6 label "Fred"
P1 label "instance of"
Q5 label "homo sapiens"
Q5 label "human"

Reversing kgtk lift with kgtk lower / sort /compact

kgtk lower -i examples/docs/normalize-file1.tsv \
           --deduplicate-new-edges False \
   / sort \
   / compact
node1 label node2
P1 label "instance of"
P2 label "amigo"
P2 label "friend"
Q1 P1 Q5
Q1 P2 Q6
Q1 label "Elmo"
Q5 label "homo sapiens"
Q5 label "human"
Q6 P1 Q5
Q6 label "Fred"

Note

This processing pipeline is suitable for use with larger files. Typically, additional parameters will be passed to kgtk sort to control the number of threads use, the amount of main memory used, and the filesystem location for temporary files. These details have been omitted for clarity.

Reversing Just kgtk lift with kgtk normalize

kgtk normalize -i examples/docs/normalize-file1.tsv \
               --normalize False
node1 label node2
Q1 P1 Q5
Q1 label "Elmo"
P1 label "instance of"
Q5 label "homo sapiens"
Q5 label "human"
Q1 P2 Q6
P2 label "amigo"
P2 label "friend"
Q6 label "Fred"
Q6 P1 Q5

Note

--normalize False is required because the input file does not have an id column.

Directing New Edges to a Seperate File

kgtk lower -i examples/docs/normalize-file1.tsv \
           --new-edges new.tsv
node1 label node2
Q1 P1 Q5
Q1 P2 Q6
Q6 P1 Q5
kgtk cat -i new.tsv
node1 label node2
Q1 label "Elmo"
P1 label "instance of"
Q5 label "homo sapiens"
Q5 label "human"
P2 label "amigo"
P2 label "friend"
Q6 label "Fred"

Sample Data with id and a Non-lift Additional Column

Suppose file2.tsv` contains the following table in KGTK format:

kgtk cat -i examples/docs/normalize-file2.tsv
node1 label node2 node1;label label;label node2;label id confidence
Q1 P1 Q5 "Elmo" "instance of" "homo sapiens"|"human" E1 0.3
Q1 P2 Q6 "Elmo" "amigo"|"friend" "Fred" E2 0.9
Q6 P1 Q5 "Fred" "instance of" "homo sapiens"|"human" E3 0.8

Normalizing Both Lift and Non-lift Additional Columns

kgtk normalize -i examples/docs/normalize-file2.tsv
node1 label node2 id
Q1 P1 Q5 E1
Q1 label "Elmo"
P1 label "instance of"
Q5 label "homo sapiens"
Q5 label "human"
E1 confidence 0.3
Q1 P2 Q6 E2
P2 label "amigo"
P2 label "friend"
Q6 label "Fred"
E2 confidence 0.9
Q6 P1 Q5 E3
E3 confidence 0.8

Note

The additional columns have been removed from the output file. Their contents appear as a mixture of label edges and secondary edges.

Normalizing Just a Non-lift Additional Column

Let's normalize just the non-lift additional column:

kgtk normalize-edges -i examples/docs/normalize-file2.tsv \
                     --columns-to-remove confidence
node1 label node2 node1;label label;label node2;label id
Q1 P1 Q5 "Elmo" "instance of" "homo sapiens"|"human" E1
E1 confidence 0.3
Q1 P2 Q6 "Elmo" "amigo"|"friend" "Fred" E2
E2 confidence 0.9
Q6 P1 Q5 "Fred" "instance of" "homo sapiens"|"human" E3
E3 confidence 0.8

Note

The confidence column has been removed from the output file. Its contents appear as new secondary edges.

Normalizing a Non-lift Additional Column and Adding IDs Externally

Let's normalize just the non-lift additional column: To avoid generating the same ID values as existing IDs, the newly generated edge IDs are generated with the prefix N

kgtk normalize-edges -i examples/docs/normalize-file2.tsv \
                     --columns-to-remove confidence \
  / add-id --id-prefix N
node1 label node2 node1;label label;label node2;label id
Q1 P1 Q5 "Elmo" "instance of" "homo sapiens"|"human" E1
E1 confidence 0.3 N1
Q1 P2 Q6 "Elmo" "amigo"|"friend" "Fred" E2
E2 confidence 0.9 N2
Q6 P1 Q5 "Fred" "instance of" "homo sapiens"|"human" E3
E3 confidence 0.8 N3

Normalizing a Non-lift Additional Column and Adding IDs Externally with an Initial ID

Let's normalize just the non-lift additional column. To avoid generating the same ID values as existing IDs, the newly generated edge IDs are generated with the initial value E100.

kgtk normalize-edges -i examples/docs/normalize-file2.tsv \
                     --columns-to-remove confidence \
  / add-id --initial-id 100
node1 label node2 node1;label label;label node2;label id
Q1 P1 Q5 "Elmo" "instance of" "homo sapiens"|"human" E1
E1 confidence 0.3 E100
Q1 P2 Q6 "Elmo" "amigo"|"friend" "Fred" E2
E2 confidence 0.9 E101
Q6 P1 Q5 "Fred" "instance of" "homo sapiens"|"human" E3
E3 confidence 0.8 E102

Normalizing a Non-lift Additional Column and Adding IDs Externally with node1-label-node2-num

Let's normalize just the non-lift additional column. To avoid generating the same ID values as existing IDs, the newly generated edge IDs are generated with the initial value E100.

kgtk normalize-edges -i examples/docs/normalize-file2.tsv \
                     --columns-to-remove confidence \
  / add-id --id-style node1-label-node2-num
node1 label node2 node1;label label;label node2;label id
Q1 P1 Q5 "Elmo" "instance of" "homo sapiens"|"human" E1
E1 confidence 0.3 E1-confidence-0.3-0000
Q1 P2 Q6 "Elmo" "amigo"|"friend" "Fred" E2
E2 confidence 0.9 E2-confidence-0.9-0000
Q6 P1 Q5 "Fred" "instance of" "homo sapiens"|"human" E3
E3 confidence 0.8 E3-confidence-0.8-0000

Normalizing a Non-lift Additional Column and Adding IDs Internally

Let's normalize just the non-lift additional column, using the internal option to add IDs with its default settings. This avoids the need to use a KGTK pipe.

kgtk normalize-edges -i examples/docs/normalize-file2.tsv \
                     --columns-to-remove confidence \
             --add-id
node1 label node2 node1;label label;label node2;label id
Q1 P1 Q5 "Elmo" "instance of" "homo sapiens"|"human" E1
E1 confidence 0.3 E1-confidence-0.3-0000
Q1 P2 Q6 "Elmo" "amigo"|"friend" "Fred" E2
E2 confidence 0.9 E2-confidence-0.9-0000
Q6 P1 Q5 "Fred" "instance of" "homo sapiens"|"human" E3
E3 confidence 0.8 E3-confidence-0.8-0000

Note

The sequence number generated by the internal ID generation code may be different from the sequence number generated by an external kgtk add-id pipe. That potential difference is not illustrated here.

Reversing kgtk lift with Other Labels

Suppose file3.tsv` contains the following table in KGTK format:

kgtk cat -i examples/docs/normalize-file3.tsv
node1 label node2 node1;name label;relationship node2;name
Q1 P1 Q5 "Elmo" "instance of" "homo sapiens"|"human"
Q1 P2 Q6 "Elmo" "amigo"|"friend" "Fred"
Q6 P1 Q5 "Fred" "instance of" "homo sapiens"|"human"

Lowering the additional columns with default settings:

kgtk lower -i examples/docs/normalize-file3.tsv
node1 label node2
Q1 P1 Q5
Q1 name "Elmo"
P1 relationship "instance of"
Q5 name "homo sapiens"
Q5 name "human"
Q1 P2 Q6
P2 relationship "amigo"
P2 relationship "friend"
Q6 name "Fred"
Q6 P1 Q5

Expert Example: Lowering with Base Columns

Suppose file4.tsv` contains the following table in KGTK format:

kgtk cat -i examples/docs/normalize-file4.tsv
node1 label node2 color material size
block1 isa cube red wood large
block2 isa pyramid blue steel small

In this case, the additional columns color, material, and size are all attributes of the entity in node, but without the node1; prefix.

These columns can be lowered by supplying a base column for each column to be lowered using the expert option --base-columns BASE_COLUMNS .... The columns to be lowered must be specified with --columns COLUMNS_TO_LOWER (or an alias to this option), and there must be one base column specified for each column to lower.

kgtk lower -i examples/docs/normalize-file4.tsv \
           --columns      color material size \
       --base-columns node1  node1   node1
node1 label node2
block1 isa cube
block1 color red
block1 material wood
block1 size large
block2 isa pyramid
block2 color blue
block2 material steel
block2 size small

Note

Another approach would be to rename the columns on input to names such as node1;color.

See kgtk cat for am example of renaming columns on input.

Expert Example: Lowering with Base Columns and Label Values

Suppose file4.tsv` contains the following table in KGTK format:

kgtk cat -i examples/docs/normalize-file4.tsv
node1 label node2 color material size
block1 isa cube red wood large
block2 isa pyramid blue steel small

In this case, the additional columns color, material, and size are all attributes of the entity in node, but without the node1; prefix.

These columns can be lowered by supplying a base column for each column to be lowered using the expert option --base-columns BASE_COLUMNS .... The columns to be lowered must be specified with --columns COLUMNS_TO_LOWER (or an alias to this option), and there must be one base column specified for each column to lower.

Furthermore, suppose that the relationships in the label edges must be all capital letters. The expert option --label-values LABEL_VALUES ...] can be used to supply the label values to use. There must be one label value specified for each column to lower.

kgtk lower -i examples/docs/normalize-file4.tsv \
           --columns      color material size \
       --base-columns node1  node1   node1 \
       --label-values COLOR MATERIAL SIZE
node1 label node2
block1 isa cube
block1 COLOR red
block1 MATERIAL wood
block1 SIZE large
block2 isa pyramid
block2 COLOR blue
block2 MATERIAL steel
block2 SIZE small

Note

Another approach would be to rename the columns on input to names such as node1;COLOR.

See kgtk cat for am example of renaming columns on input.