Skip to content

cat

41;370;0c>## Overview

The cat command combines (concatenates) one or more KGTK files, optionally decompressing input files and compressing the output file, while managing the KGTK column headers appropriately. The input file(s) are read in the order specified and edges are copied to the output file without deduplication.

Merging Column Headers

Each column in the input file(s) becomes a column in the output file. Input columns with the same name in different files are merged into a single column. Column names are case sensitive.

Input columns with one of KGTK's required column names will also be merged into a single column even if their names do not match exactly, so long as their names are matching KGTK aliases. The first name or alias seen takes priority. For example, if the first input file has a "node1" column and the second input file has a "from" column, the two columns will be combined as the "node1" column in the output file.

Canonical Name Alias Names
id ID
label predicate relation relationship
node1 from subject
node2 to object

KGTK File Modes

Normally, the files being combined must be either all KGTK edge files or all KGTK node files. kgtk cat will complain if an input file is not a KGTK edge or node file, or if kgtk cat is given a mixture of KGTK edge and node files. These constraints can be overridden with the expert option --mode=NONE.

Input File Format

Although KGTK commands use the KGTK File Format as their primary file format, input files can be read in another supported file format using the expert option --input-format INPUT_FORMAT, where INPUT_FORMAT is one of the format names shown in the table below:

Format Extension Description
kgtk .kgtk or .tsv KGTK tab separated values file format.
csv .csv A simple comma separated value file with doubled quoting and column headers.

When the --input-format option has not been specified, the default is to use kgtk format for input files unless the filename extension (suffix) is .csv (optionally followed by one of the compressed file extensions, see below.)

Note

The expert option --input-format INPUT_FORMAT applies to all input files in the kgtk cat command, so it not possible at present to use kgtk cat to combine a file in KGTK format with a file in CSV format. It is necessary to convert all input files to a common input format before using kgtk cat to combine them (although their compression format may vary, as described below).

Note

CSV input file conversion is very simple at the moment. It may be extended in the future to accomodate KGTK datatypes such as date/time.

Input File Decompression

Input files may be decompressed using an algorithm selected by the filename extension. The following compression algorithms are supported:

Extension Algorithm
.bz2 bzip2
.gz gzip
.lz4 LZ4
.xz XZ Utils, based on LZMA

When used, compression filename extensions must appear after any other filename extensions, e.g. .kgtk.gz, .csv.gz.

Decompression may also be selected using the --compression-type COMPRESSION_TYPE option. This is an expert option which does not appear in the normal usage message (shown below). The COMPRESSION_TYPE value is one of the extension values shown in the table above, with or without the leading period. This option may be used to specify decompression of standard input (-).

If --compression-type is not specified and the the filename extension is not a recognized compression filename extension, the input file will not be decompressed.

Note

When the --compression type expert option is specified, all input files will be decompressed using the specified compression type, ignoring their file extensions.

Note

At the present time, decompression is not supported for file descriptor input files (filenames that begin with <, followed by a file descriptor number).

Output File Format

Although KGTK commands use the KGTK File Format as their primary file format, the output file can be written in a selection of formats other than KGTK format by using the --output-format FORMAT option, where FORMAT is one of the values in the table shown below.

Format Extension Description
kgtk .kgtk or .tsv KGTK tab separated values file format.
csv .csv A simple comma separated value file with doubled quoting and column headers.
md .md GitHub markdown tables.
json .json JSON list of lists of strings with column header line.
json-map (none) JSON list of maps from column names to string values.
json-map-compact (none) JSON list of maps from column names to string values with empty values suppressed.
jsonl .jsonl JSON lines of lists of strings with column header line.
jsonl-map (none) JSON lines of maps from column names to string values.
jsonl-map-compact (none) JSON lines of maps from column names to string values with empty values suppressed.
tsv (none) Tab separated values. Dates have their sigils removed, and strings have the backslash escape removed before pipes.
tsv-csvlike (none) Tab separated values. Dates have their sigils removed, and strings are transformed into CSV-like double quoted strings, losing the language code if present.
tsv-unquoted (none) Tab separated values. Dates have their sigils removed, and strings have their content exposed without quotes and without escapes before pipes.
tsv-unquoted-ep (none) Tab separated values. Dates have their sigils removed, and strings have their content exposed without quotes ; pipes retain their preceeding escapes.

Output formats may also be selected by the filename extension on the output file if --output-format has not been specified. For example, writing an output file with the extension .csv will automatically generate an output file in CSV format. Any unrecognized extensions default to kgtk format unless overridden by the --output-format option.

Note

The csv and json* formats use very primitive conversions at the present time, which do not provide proper treatment for different data types: booleans, numbers, strings.

Output File Compression

Output files may be compressed using an algorithm selected by the file extension. The following compression algorithms are supported:

Extension Algorithm
.bz2 bzip2
.gz gzip
.lz4 LZ4
.xz XZ Utils, based on LZMA

When specified, compression format extensions must appear after output format selection extensions, e.g. .kgtk.gz, .csv.gz, .json.bz2.

Note

At the present time, the --compression-type COMPRESSION_TYPE option does not affect output files. Standard output (-) and file descriptor output files (filesnames that begin with >, followed by a file descriptor number) will not be compressed. This behavior may change at a later date.

Fast Copies

When certain conditions are met, kgtk cat will use Unix system utilities to perform decompression. concatenation, and compression. The major constraints are:

  • The input files must have the same column header names (allowing for aliases) and order.
  • The input files must come from the filesystem, not standard input or a file descriptor number.
  • The input files must meet a minimum total size.
  • Various checking options must not be turned on.
  • The files must contain column names headers that are not overidden bu command line options.

Usage

usage: kgtk cat [-h] [-i INPUT_FILE [INPUT_FILE ...]] [-o OUTPUT_FILE]
                [--output-format {csv,html,html-compact,json,json-map,json-map-compact,jsonl,jsonl-map,jsonl-map-compact,kgtk,md,table,tsv,tsv-csvlike,tsv-unquoted,tsv-unquoted-ep}]
                [--pure-python [True|False]]
                [--fast-copy-min-size FAST_COPY_MIN_SIZE]
                [-v [optional True|False]]

Concatenate two or more KGTK files, merging the columns appropriately. All files must be KGTK edge files or all files must be KGTK node files (unless overridden with --mode=NONE). 

Additional options are shown in expert help.
kgtk --expert cat --help

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE [INPUT_FILE ...], --input-files INPUT_FILE [INPUT_FILE ...]
                        KGTK input files (May be omitted or '-' for stdin.)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        The KGTK output file. (May be omitted or '-' for
                        stdout.)
  --output-format {csv,html,html-compact,json,json-map,json-map-compact,jsonl,jsonl-map,jsonl-map-compact,kgtk,md,table,tsv,tsv-csvlike,tsv-unquoted,tsv-unquoted-ep}
                        The file format (default=kgtk)
  --pure-python [True|False]
                        When True, use Python code. (default=False)
  --fast-copy-min-size FAST_COPY_MIN_SIZE
                        The minium number of bytes before using OS tools for
                        fast copy (default=10000).

  -v [optional True|False], --verbose [optional True|False]
                        Print additional progress messages (default=False).

Examples

Sample Data

Suppose that movies_reduced.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/movies_reduced.tsv
id node1 label node2
t1 terminator label 'The Terminator'@en
t2 terminator instance_of film
t3 terminator genre action
t4 terminator genre science_fiction
t5 terminator publication_date ^1984-10-26T00:00:00Z/11
t6 t5 location united_states
t7 terminator publication_date ^1985-02-08T00:00:00Z/11
t8 t7 location sweden
t9 terminator director james_cameron
t10 terminator cast arnold_schwarzenegger
t11 t10 role terminator
t12 terminator cast michael_biehn
t13 t12 role kyle_reese
t14 terminator cast linda_hamilton
t15 t14 role sarah_connor
t16 terminator duration 108
t17 terminator award national_film_registry
t18 t17 point_in_time ^2008-01-01T00:00:00Z/9

Suppose that tutorial_people_full.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/tutorial_people_full.tsv
id node1 label node2
h1 james_cameron label "James Cameron"
h2 james_cameron instance_of human
h3 james_cameron birth_date ^1954-08-16T00:00:00Z/11
h4 james_cameron country Canada
h5 arnold_schwarzenegger label "Arnold Schwarzenegger"
h6 arnold_schwarzenegger instance_of human
h7 arnold_schwarzenegger birth_date ^1947-07-30T00:00:00Z/11
h8 arnold_schwarzenegger country "Austria"
h9 michael_biehn label "Michael Biehn"
h10 michael_biehn instance_of human
h11 michael_biehn birth_date ^1956-07-31T00:00:00Z/11
h12 michael_biehn country "United States of America"
h13 linda_hamilton label "Linda Hamilton"
h14 linda_hamilton instance_of human
h15 linda_hamilton birth_date ^1956-09-26T00:00:00Z/11
h16 linda_hamilton country "United States of America"
h17 edward_furlong label "Edward Furlong"
h18 edward_furlong instance_of human
h19 edward_furlong birth_date ^1977-08-02T00:00:00Z/11
h20 edward_furlong country "United States of America"
h21 robert_patrick label "Robert Patrick"
h22 robert_patrick instance_of human
h23 robert_patrick birth_date ^1958-11-05T00:00:00Z/11
h24 robert_patrick country "United States of America"

Combine two KGTK files, sending the output to standard output.

These two files have only he 4 basic KGTK fields.

kgtk cat -i examples/docs/movies_reduced.tsv examples/docs/tutorial_people_full.tsv

The result will be the following file in KGTK format:

id node1 label node2
t1 terminator label 'The Terminator'@en
t2 terminator instance_of film
t3 terminator genre action
t4 terminator genre science_fiction
t5 terminator publication_date ^1984-10-26T00:00:00Z/11
t6 t5 location united_states
t7 terminator publication_date ^1985-02-08T00:00:00Z/11
t8 t7 location sweden
t9 terminator director james_cameron
t10 terminator cast arnold_schwarzenegger
t11 t10 role terminator
t12 terminator cast michael_biehn
t13 t12 role kyle_reese
t14 terminator cast linda_hamilton
t15 t14 role sarah_connor
t16 terminator duration 108
t17 terminator award national_film_registry
t18 t17 point_in_time ^2008-01-01T00:00:00Z/9
h1 james_cameron label "James Cameron"
h2 james_cameron instance_of human
h3 james_cameron birth_date ^1954-08-16T00:00:00Z/11
h4 james_cameron country Canada
h5 arnold_schwarzenegger label "Arnold Schwarzenegger"
h6 arnold_schwarzenegger instance_of human
h7 arnold_schwarzenegger birth_date ^1947-07-30T00:00:00Z/11
h8 arnold_schwarzenegger country "Austria"
h9 michael_biehn label "Michael Biehn"
h10 michael_biehn instance_of human
h11 michael_biehn birth_date ^1956-07-31T00:00:00Z/11
h12 michael_biehn country "United States of America"
h13 linda_hamilton label "Linda Hamilton"
h14 linda_hamilton instance_of human
h15 linda_hamilton birth_date ^1956-09-26T00:00:00Z/11
h16 linda_hamilton country "United States of America"
h17 edward_furlong label "Edward Furlong"
h18 edward_furlong instance_of human
h19 edward_furlong birth_date ^1977-08-02T00:00:00Z/11
h20 edward_furlong country "United States of America"
h21 robert_patrick label "Robert Patrick"
h22 robert_patrick instance_of human
h23 robert_patrick birth_date ^1958-11-05T00:00:00Z/11
h24 robert_patrick country "United States of America"

Combine two gzipped KGTK files, sending the output to a bzip2 file.

kgtk cat -i examples/docs/movies_reduced.tsv.gz examples/docs/tutorial_people_full.tsv.gz -o ofile.tsv.bz2

Expert Topic: Processing Files Not in KGTK Format

Suppose that not-kgtk.tsv contains the following data not in KGTK format (--mode=NONE has been added to allow the file to be processed by kgtk cat):

kgtk cat -i examples/docs/not-kgtk.tsv --mode=NONE
a b c d
h21 robert_patrick label "Robert Patrick"
h22 robert_patrick instance_of human
h23 robert_patrick birth_date ^1958-11-05T00:00:00Z/11
h24 robert_patrick country "United States of America"

Trying to run the command without --mode=NONE:

kgtk cat -i examples/docs/not-kgtk.tsv

will result in an error message:

In input 1 header 'a    b   c   d': Missing required column: id | ID
Exit requested

We can force the kgtk cat command to process the file by using the --mode NONE option, as shown above.

Note

--mode NONE is implemented by KgtkReader. It can be used by many KGTK commands.

Read a CSV file

Here's an example of reading a CSV file, using the filename suffix to establish the file format:

kgtk cat -i examples/docs/cat-csv-file.csv --mode=NONE
AtomicNumber Element Symbol AtomicMass NumberofNeutrons NumberofProtons NumberofElectrons Period Group Phase Radioactive Natural Metal Nonmetal Metalloid Type AtomicRadius Electronegativity FirstIonization Density MeltingPoint BoilingPoint NumberOfIsotopes Discoverer Year SpecificHeat NumberofShells NumberofValence
1 Hydrogen H 1.007 0 1 1 1 1 gas yes yes Nonmetal 0.79 2.2 13.5984 8.99E-05 14.175 20.28 3 Cavendish 1766 14.304 1 1
2 Helium He 4.002 2 2 2 1 18 gas yes yes NobleGas 0.49 24.5874 1.79E-04 4.22 5 Janssen 1868 5.193 1
3 Lithium Li 6.941 4 3 3 2 1 solid yes yes AlkaliMetal 2.1 0.98 5.3917 5.34E-01 453.85 1615 5 Arfvedson 1817 3.582 2 1
4 Beryllium Be 9.012 5 4 4 2 2 solid yes yes AlkalineEarthMetal 1.4 1.57 9.3227 1.85E+00 1560.15 2742 6 Vaulquelin 1798 1.825 2 2
5 Boron B 10.811 6 5 5 2 13 solid yes yes Metalloid 1.2 2.04 8.298 2.34E+00 2573.15 4200 6 Gay-Lussac 1808 1.026 2 3
6 Carbon C 12.011 6 6 6 2 14 solid yes yes Nonmetal 0.91 2.55 11.2603 2.27E+00 3948.15 4300 7 Prehistoric 0.709 2 4
7 Nitrogen N 14.007 7 7 7 2 15 gas yes yes Nonmetal 0.75 3.04 14.5341 1.25E-03 63.29 77.36 8 Rutherford 1772 1.04 2 5
8 Oxygen O 15.999 8 8 8 2 16 gas yes yes Nonmetal 0.65 3.44 13.6181 1.43E-03 50.5 90.2 8 Priestley|Scheele 1774 0.918 2 6
9 Fluorine F 18.998 10 9 9 2 17 gas yes yes Halogen 0.57 3.98 17.4228 1.70E-03 53.63 85.03 6 Moissan 1886 0.824 2 7
10 Neon Ne 20.18 10 10 10 2 18 gas yes yes Noble Gas 0.51 21.5645 9.00E-04 24.703 27.07 8 Ramsay_and_Travers 1898 1.03 2 8
11 Sodium Na 22.99 12 11 11 3 1 solid yes yes AlkaliMetal 2.2 0.93 5.1391 9.71E-01 371.15 1156 7 Davy 1807 1.228 3 1
12 Magnesium Mg 24.305 12 12 12 3 2 solid yes yes AlkalineEarthMetal 1.7 1.31 7.6462 1.74E+00 923.15 1363 8 Black 1755 1.023 3 2
13 Aluminum Al 26.982 14 13 13 3 13 solid yes yes Metal 1.8 1.61 5.9858 2.70E+00 933.4 2792 8 Wshler 1827 0.897 3 3
14 Silicon Si 28.086 14 14 14 3 14 solid yes yes Metalloid 1.5 1.9 8.1517 2.33E+00 1683.15 3538 8 Berzelius 1824 0.705 3 4
15 Phosphorus P 30.974 16 15 15 3 15 solid yes yes Nonmetal 1.2 2.19 10.4867 1.82E+00 317.25 553 7 BranBrand 1669 0.769 3 5
16 Sulfur S 32.065 16 16 16 3 16 solid yes yes Nonmetal 1.1 2.58 10.36 2.07E+00 388.51 717.8 10 Prehistoric 0.71 3 6
17 Chlorine Cl 35.453 18 17 17 3 17 gas yes yes Halogen 0.97 3.16 12.9676 3.21E-03 172.31 239.11 11 Scheele 1774 0.479 3 7
18 Argon Ar 39.948 22 18 18 3 18 gas yes yes NobleGas 0.88 15.7596 1.78E-03 83.96 87.3 8 Rayleigh_and_Ramsay 1894 0.52 3 8

Expert Topic: Adding Column Names

Suppose that you have a TSV (tab-separated values) data file that looks like a KGTK data file but without the header line. You can supply a header line with the expert option --force-column-names. You can also use this option when concatenating several data files, so long as they are all missing header lines and they should all have the same header line.

Consider the following input file:

kgtk cat -i examples/docs/no-header.tsv --mode=NONE
a b c d
h21 robert_patrick label "Robert Patrick"
h22 robert_patrick instance_of human
h23 robert_patrick birth_date ^1958-11-05T00:00:00Z/11
h24 robert_patrick country "United States of America"

We can supply a valid header line as follows:

kgtk cat -i examples/docs/no-header.tsv \
         --force-column-names id node1 label node2

The result will be the following file in KGTK format:

id node1 label node2
h21 robert_patrick label "Robert Patrick"
h22 robert_patrick instance_of human
h23 robert_patrick birth_date ^1958-11-05T00:00:00Z/11
h24 robert_patrick country "United States of America"

Note

---force-column-names takes place before the input file is checked to see if it is a valid KGTK edge or node file. Since we supplied valid KGTK edge column names in the example above, --mode=NONE is no longer needed.

Expert Topic: Renaming Column Names on Input

There is a special KGTK command, kgtk rename-columns, for renaming columns. However, you may want to rename columns while also using other features of the kgtk cat command, such as combining multiple input files or sampling data lines.

You have two main choices: override the column names on input, or rename the column names on output.

Overriding the column names on input can be done by skipping the existing header record and supplying a replacement list of column names.

kgtk cat -i examples/docs/not-kgtk.tsv \
     --force-column-names id node1 label node2

The result will be the following file in KGTK format:

id node1 label node2
h21 robert_patrick label "Robert Patrick"
h22 robert_patrick instance_of human
h23 robert_patrick birth_date ^1958-11-05T00:00:00Z/11
h24 robert_patrick country "United States of America"

Note

When you rename columns on input, the change applies to all input files: they all must have the same column layout, for which you will provide a new set of column names.

Expert Topic: Renaming All Column Names on Output

There is a special KGTK command, kgtk rename_columns, for renaming columns. However, you may want to rename columns while also using other features of the kgtk cat command, such as combining multiple input files or sampling data lines.

You have two main choices: override the column names on input, or rename the column names on output.

For example, suppose your input file contained the following table in almost KGTK format:

kgtk cat -i examples/docs/movies_origin_destination.tsv --mode=NONE
origin label destination years
terminator label 'The Terminator'@en 4
terminator instance_of film 3

Renaming the column names on output can by done two ways. First, you can name all of the new column names using --output-columns.

kgtk cat -i examples/docs/movies_origin_destination.tsv --mode=NONE \
         --output-columns node1 label node2 years

The result will be the following table in KGTK format:

node1 label node2 years
terminator label 'The Terminator'@en 4
terminator instance_of film 3

Expert Topic: Renaming Selected Column Names on Output

Second, you can rename individual columns using --old-columns and --new-columns.

You want to rename the origin column to node1, and the destination column to node2, leaving the other column names alone.

kgtk cat -i examples/docs/movies_origin_destination.tsv --mode=NONE \
         --old-columns origin destination \
     --new-columns node1 node2

The result will be the following table in KGTK format:

node1 label node2 years
terminator label 'The Terminator'@en 4
terminator instance_of film 3

Note

Renaming column names on output can be done when you combine a disparate set of KGTK files. The rename applies to the merged set of column names computed by kgtk cat.

Expert Topic: Data Sampling: head

Limit the number of records read (like head).

kgtk cat -i examples/docs/movies_reduced.tsv --record-limit 4

The result will be the following table in KGTK format:

id node1 label node2
t1 terminator label 'The Terminator'@en
t2 terminator instance_of film
t3 terminator genre action
t4 terminator genre science_fiction

Expert Topic: Data Sampling: skip

Skip some number of initial records, then begin processing.

kgtk cat -i examples/docs/movies_reduced.tsv --initial-skip-count 4

The result will be the following table in KGTK format:

id node1 label node2
t5 terminator publication_date ^1984-10-26T00:00:00Z/11
t6 t5 location united_states
t7 terminator publication_date ^1985-02-08T00:00:00Z/11
t8 t7 location sweden
t9 terminator director james_cameron
t10 terminator cast arnold_schwarzenegger
t11 t10 role terminator
t12 terminator cast michael_biehn
t13 t12 role kyle_reese
t14 terminator cast linda_hamilton
t15 t14 role sarah_connor
t16 terminator duration 108
t17 terminator award national_film_registry
t18 t17 point_in_time ^2008-01-01T00:00:00Z/9

Expert Topic: Data Sampling: last 5

Process the last n records relative to the end (like tail). You must know the number of data records in the file (the number of lines in the file minus the header line).

kgtk cat -i examples/docs/movies_reduced.tsv --record-limit 15 --tail-count 3

The result will be the following table in KGTK format:

id node1 label node2
t13 t12 role kyle_reese
t14 terminator cast linda_hamilton
t15 t14 role sarah_connor

Note

If both --initial-skip-count # and --record-limit # --tail-count # are specified, the number of records skipped will be the maximum of the initial skip count and (record limit minus tail count).

Expert Topic: Data Sampling: every n

Process every nth record (after skipping, but calculated relative to the count of data lines read before skipping). The following example will process every second line.

kgtk cat -i examples/docs/movies_reduced.tsv --every-nth-record 2

The result will be the following table in KGTK format:

id node1 label node2
t2 terminator instance_of film
t4 terminator genre science_fiction
t6 t5 location united_states
t8 t7 location sweden
t10 terminator cast arnold_schwarzenegger
t12 terminator cast michael_biehn
t14 terminator cast linda_hamilton
t16 terminator duration 108
t18 t17 point_in_time ^2008-01-01T00:00:00Z/9

Expert Topic: Converting a CSV File to a quasi-KGTK File

The expert option --input-format csv may be used to read an input file in CSV (comma-separated values) format. The expert option --mode=NONE will also be needed if the input file does not have the required columns of a KGTK edge or node file.

kgtk cat -i examples/docs/periodic_table_of_elements_1-18.csv \
         --input-format csv --mode=NONE

The result will be the following table in quasi-KGTK format:

AtomicNumber Element Symbol AtomicMass NumberofNeutrons NumberofProtons NumberofElectrons Period Group Phase Radioactive Natural Metal Nonmetal Metalloid Type AtomicRadius Electronegativity FirstIonization Density MeltingPoint BoilingPoint NumberOfIsotopes Discoverer Year SpecificHeat NumberofShells NumberofValence
1 Hydrogen H 1.007 0 1 1 1 1 gas yes yes Nonmetal 0.79 2.2 13.5984 8.99E-05 14.175 20.28 3 Cavendish 1766 14.304 1 1
2 Helium He 4.002 2 2 2 1 18 gas yes yes NobleGas 0.49 24.5874 1.79E-04 4.22 5 Janssen 1868 5.193 1
3 Lithium Li 6.941 4 3 3 2 1 solid yes yes AlkaliMetal 2.1 0.98 5.3917 5.34E-01 453.85 1615 5 Arfvedson 1817 3.582 2 1
4 Beryllium Be 9.012 5 4 4 2 2 solid yes yes AlkalineEarthMetal 1.4 1.57 9.3227 1.85E+00 1560.15 2742 6 Vaulquelin 1798 1.825 2 2
5 Boron B 10.811 6 5 5 2 13 solid yes yes Metalloid 1.2 2.04 8.298 2.34E+00 2573.15 4200 6 Gay-Lussac 1808 1.026 2 3
6 Carbon C 12.011 6 6 6 2 14 solid yes yes Nonmetal 0.91 2.55 11.2603 2.27E+00 3948.15 4300 7 Prehistoric 0.709 2 4
7 Nitrogen N 14.007 7 7 7 2 15 gas yes yes Nonmetal 0.75 3.04 14.5341 1.25E-03 63.29 77.36 8 Rutherford 1772 1.04 2 5
8 Oxygen O 15.999 8 8 8 2 16 gas yes yes Nonmetal 0.65 3.44 13.6181 1.43E-03 50.5 90.2 8 Priestley|Scheele 1774 0.918 2 6
9 Fluorine F 18.998 10 9 9 2 17 gas yes yes Halogen 0.57 3.98 17.4228 1.70E-03 53.63 85.03 6 Moissan 1886 0.824 2 7
10 Neon Ne 20.18 10 10 10 2 18 gas yes yes Noble Gas 0.51 21.5645 9.00E-04 24.703 27.07 8 Ramsay_and_Travers 1898 1.03 2 8
11 Sodium Na 22.99 12 11 11 3 1 solid yes yes AlkaliMetal 2.2 0.93 5.1391 9.71E-01 371.15 1156 7 Davy 1807 1.228 3 1
12 Magnesium Mg 24.305 12 12 12 3 2 solid yes yes AlkalineEarthMetal 1.7 1.31 7.6462 1.74E+00 923.15 1363 8 Black 1755 1.023 3 2
13 Aluminum Al 26.982 14 13 13 3 13 solid yes yes Metal 1.8 1.61 5.9858 2.70E+00 933.4 2792 8 Wshler 1827 0.897 3 3
14 Silicon Si 28.086 14 14 14 3 14 solid yes yes Metalloid 1.5 1.9 8.1517 2.33E+00 1683.15 3538 8 Berzelius 1824 0.705 3 4
15 Phosphorus P 30.974 16 15 15 3 15 solid yes yes Nonmetal 1.2 2.19 10.4867 1.82E+00 317.25 553 7 BranBrand 1669 0.769 3 5
16 Sulfur S 32.065 16 16 16 3 16 solid yes yes Nonmetal 1.1 2.58 10.36 2.07E+00 388.51 717.8 10 Prehistoric 0.71 3 6
17 Chlorine Cl 35.453 18 17 17 3 17 gas yes yes Halogen 0.97 3.16 12.9676 3.21E-03 172.31 239.11 11 Scheele 1774 0.479 3 7
18 Argon Ar 39.948 22 18 18 3 18 gas yes yes NobleGas 0.88 15.7596 1.78E-03 83.96 87.3 8 Rayleigh_and_Ramsay 1894 0.52 3 8

Expert Topic: Converting a CSV File to a KGTK File

The expert option --input-format csv may be used to read an input file in CSV (comma-separated values) format. The expert option --mode=NONE will also be needed if the input file does not have the required columns of a KGTK edge or node file.

If we want the output file to be a KGTK file instead of a quasi-KGTK file, and one of the columns is suitable to use as an id column, we can rename that column on input or output. In this example, we rename the specific column on output.

kgtk cat -i examples/docs/periodic_table_of_elements_1-18.csv \
         --input-format csv --mode=NONE \
         --old-column AtomicNumber \
         --new-column id     

The result will be the following table in KGTK format:

id Element Symbol AtomicMass NumberofNeutrons NumberofProtons NumberofElectrons Period Group Phase Radioactive Natural Metal Nonmetal Metalloid Type AtomicRadius Electronegativity FirstIonization Density MeltingPoint BoilingPoint NumberOfIsotopes Discoverer Year SpecificHeat NumberofShells NumberofValence
1 Hydrogen H 1.007 0 1 1 1 1 gas yes yes Nonmetal 0.79 2.2 13.5984 8.99E-05 14.175 20.28 3 Cavendish 1766 14.304 1 1
2 Helium He 4.002 2 2 2 1 18 gas yes yes NobleGas 0.49 24.5874 1.79E-04 4.22 5 Janssen 1868 5.193 1
3 Lithium Li 6.941 4 3 3 2 1 solid yes yes AlkaliMetal 2.1 0.98 5.3917 5.34E-01 453.85 1615 5 Arfvedson 1817 3.582 2 1
4 Beryllium Be 9.012 5 4 4 2 2 solid yes yes AlkalineEarthMetal 1.4 1.57 9.3227 1.85E+00 1560.15 2742 6 Vaulquelin 1798 1.825 2 2
5 Boron B 10.811 6 5 5 2 13 solid yes yes Metalloid 1.2 2.04 8.298 2.34E+00 2573.15 4200 6 Gay-Lussac 1808 1.026 2 3
6 Carbon C 12.011 6 6 6 2 14 solid yes yes Nonmetal 0.91 2.55 11.2603 2.27E+00 3948.15 4300 7 Prehistoric 0.709 2 4
7 Nitrogen N 14.007 7 7 7 2 15 gas yes yes Nonmetal 0.75 3.04 14.5341 1.25E-03 63.29 77.36 8 Rutherford 1772 1.04 2 5
8 Oxygen O 15.999 8 8 8 2 16 gas yes yes Nonmetal 0.65 3.44 13.6181 1.43E-03 50.5 90.2 8 Priestley|Scheele 1774 0.918 2 6
9 Fluorine F 18.998 10 9 9 2 17 gas yes yes Halogen 0.57 3.98 17.4228 1.70E-03 53.63 85.03 6 Moissan 1886 0.824 2 7
10 Neon Ne 20.18 10 10 10 2 18 gas yes yes Noble Gas 0.51 21.5645 9.00E-04 24.703 27.07 8 Ramsay_and_Travers 1898 1.03 2 8
11 Sodium Na 22.99 12 11 11 3 1 solid yes yes AlkaliMetal 2.2 0.93 5.1391 9.71E-01 371.15 1156 7 Davy 1807 1.228 3 1
12 Magnesium Mg 24.305 12 12 12 3 2 solid yes yes AlkalineEarthMetal 1.7 1.31 7.6462 1.74E+00 923.15 1363 8 Black 1755 1.023 3 2
13 Aluminum Al 26.982 14 13 13 3 13 solid yes yes Metal 1.8 1.61 5.9858 2.70E+00 933.4 2792 8 Wshler 1827 0.897 3 3
14 Silicon Si 28.086 14 14 14 3 14 solid yes yes Metalloid 1.5 1.9 8.1517 2.33E+00 1683.15 3538 8 Berzelius 1824 0.705 3 4
15 Phosphorus P 30.974 16 15 15 3 15 solid yes yes Nonmetal 1.2 2.19 10.4867 1.82E+00 317.25 553 7 BranBrand 1669 0.769 3 5
16 Sulfur S 32.065 16 16 16 3 16 solid yes yes Nonmetal 1.1 2.58 10.36 2.07E+00 388.51 717.8 10 Prehistoric 0.71 3 6
17 Chlorine Cl 35.453 18 17 17 3 17 gas yes yes Halogen 0.97 3.16 12.9676 3.21E-03 172.31 239.11 11 Scheele 1774 0.479 3 7
18 Argon Ar 39.948 22 18 18 3 18 gas yes yes NobleGas 0.88 15.7596 1.78E-03 83.96 87.3 8 Rayleigh_and_Ramsay 1894 0.52 3 8

Note

See kgtk add-id for an example of converting a CSV file without an id column to a KGTK node file by adding an id column.

Note

See kgtk normalize-nodes for an example of converting a CSV file without an id column to a KGTK edge file.

Expert Topic: Implying a Label Column

It is not uncommon to encounter two-column files (TSV or CSV) which represent an edge with an implied label column value (predicate). The --implied-label VALUE option may be used to convert the input data into a three-column format.

Consider the following file, which lists some cities in the State of Massachusettes and the year that they were founded. Since this is neither a KGTK edge file nor a KGTK node file, we need to specify --mode=NONE to bypass certain validity checks:

kgtk cat --mode=NONE -i examples/docs/cat-two-columns.tsv
node1 node2
Boston 1630
Concord 1635
Scituate 1636
Springfield 1636
Cambridge 1638
Lexington 1642
Worcester 1673

We can convert this file into a KGTK edge file on input by specifying an implied label column and value:

kgtk cat --implied-label=founded -i examples/docs/cat-two-columns.tsv
node1 node2 label
Boston 1630 founded
Concord 1635 founded
Scituate 1636 founded
Springfield 1636 founded
Cambridge 1638 founded
Lexington 1642 founded
Worcester 1673 founded

Note

The --implied-label=VALUE option is implemented by KgtkReader, and can be used with most KGTK subcommands.

Expert Topic: Supressing the Output Header

Sometimes it is desired to produce a TSV file without an output header.

kgtk cat -i examples/docs/movies_reduced.tsv --no-output-header

The result will be the following file in KGTK format except for missing the header line.

t1 terminator label 'The Terminator'@en
t2 terminator instance_of film
t3 terminator genre action
t4 terminator genre science_fiction
t5 terminator publication_date ^1984-10-26T00:00:00Z/11
t6 t5 location united_states
t7 terminator publication_date ^1985-02-08T00:00:00Z/11
t8 t7 location sweden
t9 terminator director james_cameron
t10 terminator cast arnold_schwarzenegger
t11 t10 role terminator
t12 terminator cast michael_biehn
t13 t12 role kyle_reese
t14 terminator cast linda_hamilton
t15 t14 role sarah_connor
t16 terminator duration 108
t17 terminator award national_film_registry
t18 t17 point_in_time ^2008-01-01T00:00:00Z/9

Expert Topic: Reading Files without Header Records: Supply Column Names

Sometimes you may wish to read a TSV file that does not contain a header record.

kgtk cat -i examples/docs/cat-file-without-header.tsv --mode=NONE
john woke ^2020-05-02T00:00
john woke ^2020-05-00T00:00
john slept ^2020-05-02T24:00
lionheart born ^1157-09-08T00:00
year0001 starts ^0001-01-01T00:00
year9999 ends ^9999-12-31T11:59:59

Copy the file, supplying column names:

kgtk cat -i examples/docs/cat-file-without-header.tsv \
         --input-column-names node1 label node2

The result will be the following file in KGTK format:

node1 label node2
john woke ^2020-05-00T00:00
john slept ^2020-05-02T24:00
lionheart born ^1157-09-08T00:00
year0001 starts ^0001-01-01T00:00
year9999 ends ^9999-12-31T11:59:59

Expert Topic: Reading Files without Header Records: Automatic Column Names

Another approach to reading a file without a header record is to have KGTK assign column names, which it will do beginning with COL1, COL2, etc. It is necessary to tell KGTK how many columns are in the file. It is also necessary to say --mode=NONE, since the generated column names do not match the definition of a KGTK edge file or node file.

kgtk cat -i examples/docs/cat-file-without-header.tsv \
         --no-input-header \
         --supply-missing-column-names \
     --number-of-columns 3 \
     --mode=NONE
COL1 COL2 COL3
john woke ^2020-05-02T00:00
john woke ^2020-05-00T00:00
john slept ^2020-05-02T24:00
lionheart born ^1157-09-08T00:00
year0001 starts ^0001-01-01T00:00
year9999 ends ^9999-12-31T11:59:59

Expert Topic: Reading Files without Header Records: Empty Column Names

Assume that you have a TSV file with the right number of columns in the header record, but one or more missing column names.

kgtk cat -i examples/docs/cat-file-with-empty-column-names.tsv
In input 1 header 'node1    label   node2       ': Column 3 has an empty name in the file header
Exit requested

You can ask the system to replace the empty column names with COLn:

kgtk cat -i examples/docs/cat-file-with-empty-column-names.tsv \
         --supply-missing-column-names
node1 label node2 COL4 COL5
john observed ^2020-05-02T00:00 fever cough
john observed ^2020-05-00T00:00 normal normal
john observed ^2020-05-02T24:00 normal cough

Expert Topic: Requiring Certain Columns

Sometimes you may wish to require that an input file contains certain named columns that are essential to your analysis.

kgtk cat -i examples/docs/cat-edges-with-totals.tsv
node1 label node2 node1;total
P10 p585-count 73 3879
P1000 p585-count 16 266
P101 p585-count 5 157519
P1018 p585-count 2 177
P102 p585-count 295 414726
P1025 p585-count 26 693
P1026 p585-count 40 6930
P1027 p585-count 14 10008
P1028 p585-count 1131 4035
P1029 p585-count 4 2643
P1035 p585-count 4 366
P1037 p585-count 60 9317
P1040 p585-count 1 45073
P1050 p585-count 246 226380

Supposw you require that the node1;total column be present:

kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
         --require-column-names  'node1;total'

This will succeed:

node1 label node2 node1;total
P10 p585-count 73 3879
P1000 p585-count 16 266
P101 p585-count 5 157519
P1018 p585-count 2 177
P102 p585-count 295 414726
P1025 p585-count 26 693
P1026 p585-count 40 6930
P1027 p585-count 14 10008
P1028 p585-count 1131 4035
P1029 p585-count 4 2643
P1035 p585-count 4 366
P1037 p585-count 60 9317
P1040 p585-count 1 45073
P1050 p585-count 246 226380

Suppose you also require that an 'average' column be present, but it is missing:

kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
         --require-column-names  'node1;total' average

This will result in an error message:

In input 1 header 'node1    label   node2   node1;total': The following required columns were missing: ['average']
Exit requested

Expert Topic: Prohibiting Additional Columns

Sometimes you may want to prohibit additional columns. There are several special cases to consider:

  • Prohibiting additional columns for a standard KGTK node file.
  • Prohibiting additional columns for a standard KGTK edge file.
  • Requiring certain additional columns and prohibiting others.

Suppose that you have a standard KGTK edge file and you wish to prohibit any additional columns.

Consider a standard KGTK node file without additional columns:

kgtk cat -i examples/docs/cat-nodes.tsv \
         --no-additional-columns

This command will succeed:

id
P10
P100
P1000

A KGTK node file with unexpected additional columns will fail:

kgtk cat -i examples/docs/cat-nodes-and-titles.tsv \
         --no-additional-columns
In input 1 header 'id   titel': The following additional columns are unexpected: ['titel']
Exit requested

Consider a standard KGTK edge file without additional columns:

kgtk cat -i examples/docs/cat-edges.tsv \
         --no-additional-columns

This command will succeed:

node1 label node2
P10 p585-count 73
P1000 p585-count 16
P101 p585-count 5
P1018 p585-count 2
P102 p585-count 295
P1025 p585-count 26
P1026 p585-count 40
P1027 p585-count 14
P1028 p585-count 1131
P1029 p585-count 4
P1035 p585-count 4
P1037 p585-count 60
P1040 p585-count 1
P1050 p585-count 246

Note: The node1, label, and node2 columns (or their aliases) are allowed. The id column is also allowed, although it is not required.

Consider a KGTK edge file with an undesired additional column:

kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
         --no-additional-columns

This will fail with the following error message:

In input 1 header 'node1    label   node2   node1;total': The following additional columns are unexpected: ['node1;total']
Exit requested

If we want to accept the node1;total additional column, but prohibit others, we can do so by explicitly listing the required columns. All required columns (e.g., node1, label,and node2`) must be listed:

kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
         --require-column-names node1 label node2 'node1;total' \
         --no-additional-columns

This will succeed:

node1 label node2 node1;total
P10 p585-count 73 3879
P1000 p585-count 16 266
P101 p585-count 5 157519
P1018 p585-count 2 177
P102 p585-count 295 414726
P1025 p585-count 26 693
P1026 p585-count 40 6930
P1027 p585-count 14 10008
P1028 p585-count 1131 4035
P1029 p585-count 4 2643
P1035 p585-count 4 366
P1037 p585-count 60 9317
P1040 p585-count 1 45073
P1050 p585-count 246 226380

An unexpected additional column will fail:

kgtk cat -i examples/docs/cat-edges-with-totals-and-averages.tsv \
         --require-column-names node1 label node2 'node1;total' \
         --no-additional-columns
In input 1 header 'node1    label   node2   node1;total average': The following additional columns are unexpected: ['average']
Exit requested

We can add the average column to the list of required column names and accept that file:

kgtk cat -i examples/docs/cat-edges-with-totals-and-averages.tsv \
         --require-column-names node1 label node2 'node1;total' average \
         --no-additional-columns
node1 label node2 node1;total average
P10 p585-count 73 3879 53.136986301369866
P1000 p585-count 16 266 16.625
P101 p585-count 5 157519 31503.8
P1018 p585-count 2 177 88.5
P102 p585-count 295 414726 1405.8508474576272
P1025 p585-count 26 693 26.653846153846153
P1026 p585-count 40 6930 173.25
P1027 p585-count 14 10008 714.8571428571429
P1028 p585-count 1131 4035 3.5676392572944295
P1029 p585-count 4 2643 660.75
P1035 p585-count 4 366 91.5
P1037 p585-count 60 9317 155.28333333333333
P1040 p585-count 1 45073 45073.0
P1050 p585-count 246 226380 920.2439024390244

Note

At the present time there is no option to list optional additional columns.

Expert Topic: Requiring a Certain Number of Columns

Another way to ensure that a KGTK edge file has only [ node1, label, node2] columns is to require that the file have 3 columns. More generally, you can require that a file have a certain number of columns without having to name all the columns individually.

kgtk cat -i examples/docs/cat-edges.tsv \
         --number-of-columns 3
node1 label node2
P10 p585-count 73
P1000 p585-count 16
P101 p585-count 5
P1018 p585-count 2
P102 p585-count 295
P1025 p585-count 26
P1026 p585-count 40
P1027 p585-count 14
P1028 p585-count 1131
P1029 p585-count 4
P1035 p585-count 4
P1037 p585-count 60
P1040 p585-count 1
P1050 p585-count 246
kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
         --number-of-columns 3
In input 1 header 'node1    label   node2   node1;total': Expected 3 columns, got 4 in the header
Exit requested
kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
         --number-of-columns 4
node1 label node2 node1;total
P10 p585-count 73 3879
P1000 p585-count 16 266
P101 p585-count 5 157519
P1018 p585-count 2 177
P102 p585-count 295 414726
P1025 p585-count 26 693
P1026 p585-count 40 6930
P1027 p585-count 14 10008
P1028 p585-count 1131 4035
P1029 p585-count 4 2643
P1035 p585-count 4 366
P1037 p585-count 60 9317
P1040 p585-count 1 45073
P1050 p585-count 246 226380

Expert Topic: Pure Python Copies

The fast copy option can be disabled by specifying --pure-python.

kgtk cat -i examples/docs/cat-edges.tsv \
         -i examples/docs/cat-edges.tsv \
         --pure-python
node1 label node2
P10 p585-count 73
P1000 p585-count 16
P101 p585-count 5
P1018 p585-count 2
P102 p585-count 295
P1025 p585-count 26
P1026 p585-count 40
P1027 p585-count 14
P1028 p585-count 1131
P1029 p585-count 4
P1035 p585-count 4
P1037 p585-count 60
P1040 p585-count 1
P1050 p585-count 246
P10 p585-count 73
P1000 p585-count 16
P101 p585-count 5
P1018 p585-count 2
P102 p585-count 295
P1025 p585-count 26
P1026 p585-count 40
P1027 p585-count 14
P1028 p585-count 1131
P1029 p585-count 4
P1035 p585-count 4
P1037 p585-count 60
P1040 p585-count 1
P1050 p585-count 246

Expert Topic: Changing the Fast Copy Minimum Size Throshold

Normally, kgtk cat will use the fast copy path with system commands only when the total sizes of the input files pass a threshhold. This is because that are overheads on starting the system utilities as subprocesses, and for very small files it may be faster to perform all processing directly in Python.

The threshold may be changed. For example, if you wanted the code to use the fast copy path regardless of the size of the input files, use:

kgtk cat -i examples/docs/cat-edges.tsv \
         -i examples/docs/cat-edges.tsv \
         --fast-copy-min-size 0
node1 label node2
P10 p585-count 73
P1000 p585-count 16
P101 p585-count 5
P1018 p585-count 2
P102 p585-count 295
P1025 p585-count 26
P1026 p585-count 40
P1027 p585-count 14
P1028 p585-count 1131
P1029 p585-count 4
P1035 p585-count 4
P1037 p585-count 60
P1040 p585-count 1
P1050 p585-count 246
P10 p585-count 73
P1000 p585-count 16
P101 p585-count 5
P1018 p585-count 2
P102 p585-count 295
P1025 p585-count 26
P1026 p585-count 40
P1027 p585-count 14
P1028 p585-count 1131
P1029 p585-count 4
P1035 p585-count 4
P1037 p585-count 60
P1040 p585-count 1
P1050 p585-count 246

Expert Topic: Overriding System Commands

The names of the system commands used by the fast copy path may be overridden on the command line.

kgtk cat -i examples/docs/cat-edges.tsv \
         -i examples/docs/cat-edges.tsv \
         --bash-command /usr/bin/bash \
     --bzip2-command /usr/bin/bzip2 \
     --cat-command /usr/bin/cat \
     --gzip-command /usr/bin/gzip \
     --tail-command /usr/bin/tail \
     --xz-command /usr/bin/xz
node1 label node2
P10 p585-count 73
P1000 p585-count 16
P101 p585-count 5
P1018 p585-count 2
P102 p585-count 295
P1025 p585-count 26
P1026 p585-count 40
P1027 p585-count 14
P1028 p585-count 1131
P1029 p585-count 4
P1035 p585-count 4
P1037 p585-count 60
P1040 p585-count 1
P1050 p585-count 246
P10 p585-count 73
P1000 p585-count 16
P101 p585-count 5
P1018 p585-count 2
P102 p585-count 295
P1025 p585-count 26
P1026 p585-count 40
P1027 p585-count 14
P1028 p585-count 1131
P1029 p585-count 4
P1035 p585-count 4
P1037 p585-count 60
P1040 p585-count 1
P1050 p585-count 246