cat
41;370;0c>## Overview
The cat command combines (concatenates) one or more KGTK files, optionally decompressing input files and compressing the output file, while managing the KGTK column headers appropriately. The input file(s) are read in the order specified and edges are copied to the output file without deduplication.
Merging Column Headers¶
Each column in the input file(s) becomes a column in the output file. Input columns with the same name in different files are merged into a single column. Column names are case sensitive.
Input columns with one of KGTK's required column names will also be merged into a single column even if their names do not match exactly, so long as their names are matching KGTK aliases. The first name or alias seen takes priority. For example, if the first input file has a "node1" column and the second input file has a "from" column, the two columns will be combined as the "node1" column in the output file.
Canonical Name | Alias Names |
---|---|
id | ID |
label | predicate relation relationship |
node1 | from subject |
node2 | to object |
KGTK File Modes¶
Normally, the files being combined must be either all KGTK edge files or all
KGTK node files. kgtk cat
will complain if an input file is not a KGTK
edge or node file, or if kgtk cat
is given a mixture of KGTK edge and node files.
These constraints can be overridden with the expert option --mode=NONE
.
Input File Format¶
Although KGTK commands use the KGTK File Format as their primary file format,
input files can be read in another supported file format using the expert
option --input-format INPUT_FORMAT
, where INPUT_FORMAT
is one of the
format names shown in the table below:
Format | Extension | Description |
---|---|---|
kgtk | .kgtk or .tsv | KGTK tab separated values file format. |
csv | .csv | A simple comma separated value file with doubled quoting and column headers. |
When the --input-format
option has not been specified, the default is to use kgtk
format
for input files unless the filename extension (suffix) is .csv
(optionally followed
by one of the compressed file extensions, see below.)
Note
The expert option --input-format INPUT_FORMAT
applies to all input files in
the kgtk cat
command, so it not possible at present to use kgtk cat
to
combine a file in KGTK format with a file in CSV format. It is necessary to
convert all input files to a common input format before using kgtk cat
to combine them
(although their compression format may vary, as described below).
Note
CSV input file conversion is very simple at the moment. It may be extended in the future to accomodate KGTK datatypes such as date/time.
Input File Decompression¶
Input files may be decompressed using an algorithm selected by the filename extension. The following compression algorithms are supported:
Extension | Algorithm |
---|---|
.bz2 | bzip2 |
.gz | gzip |
.lz4 | LZ4 |
.xz | XZ Utils, based on LZMA |
When used, compression filename extensions must appear after any other
filename extensions, e.g. .kgtk.gz
, .csv.gz
.
Decompression may also be selected using the --compression-type COMPRESSION_TYPE
option.
This is an expert option which does not appear in the normal
usage message (shown below). The COMPRESSION_TYPE
value is one of the extension values shown in the
table above, with or without the leading period.
This option may be used to specify decompression of standard input (-
).
If --compression-type
is not specified and the the filename extension is not
a recognized compression filename extension, the input file will not be
decompressed.
Note
When the --compression type
expert option is specified, all input
files will be decompressed using the specified compression type, ignoring their file extensions.
Note
At the present time, decompression is not supported for file descriptor input
files (filenames that begin with <
, followed by a file descriptor number).
Output File Format¶
Although KGTK commands use the KGTK File Format as their primary file format,
the output file can be written in a selection of formats other
than KGTK format by using the --output-format FORMAT
option, where FORMAT
is one of the values in the table shown below.
Format | Extension | Description |
---|---|---|
kgtk | .kgtk or .tsv | KGTK tab separated values file format. |
csv | .csv | A simple comma separated value file with doubled quoting and column headers. |
md | .md | GitHub markdown tables. |
json | .json | JSON list of lists of strings with column header line. |
json-map | (none) | JSON list of maps from column names to string values. |
json-map-compact | (none) | JSON list of maps from column names to string values with empty values suppressed. |
jsonl | .jsonl | JSON lines of lists of strings with column header line. |
jsonl-map | (none) | JSON lines of maps from column names to string values. |
jsonl-map-compact | (none) | JSON lines of maps from column names to string values with empty values suppressed. |
tsv | (none) | Tab separated values. Dates have their sigils removed, and strings have the backslash escape removed before pipes. |
tsv-csvlike | (none) | Tab separated values. Dates have their sigils removed, and strings are transformed into CSV-like double quoted strings, losing the language code if present. |
tsv-unquoted | (none) | Tab separated values. Dates have their sigils removed, and strings have their content exposed without quotes and without escapes before pipes. |
tsv-unquoted-ep | (none) | Tab separated values. Dates have their sigils removed, and strings have their content exposed without quotes ; pipes retain their preceeding escapes. |
Output formats may also be selected by the filename extension on the output file if --output-format
has not been specified. For example,
writing an output file with the extension .csv
will automatically generate an output file
in CSV format. Any unrecognized extensions default to kgtk format unless
overridden by the --output-format
option.
Note
The csv and json* formats use very primitive conversions at the present time, which do not provide proper treatment for different data types: booleans, numbers, strings.
Output File Compression¶
Output files may be compressed using an algorithm selected by the file extension. The following compression algorithms are supported:
Extension | Algorithm |
---|---|
.bz2 | bzip2 |
.gz | gzip |
.lz4 | LZ4 |
.xz | XZ Utils, based on LZMA |
When specified, compression format extensions must appear after output format
selection extensions, e.g. .kgtk.gz
, .csv.gz
, .json.bz2
.
Note
At the present time, the --compression-type COMPRESSION_TYPE
option does not
affect output files. Standard output (-
) and file descriptor output
files (filesnames that begin with >
, followed by a file descriptor number)
will not be compressed. This behavior may change at a later date.
Fast Copies¶
When certain conditions are met, kgtk cat
will use Unix system utilities to perform
decompression. concatenation, and compression. The major constraints are:
- The input files must have the same column header names (allowing for aliases) and order.
- The input files must come from the filesystem, not standard input or a file descriptor number.
- The input files must meet a minimum total size.
- Various checking options must not be turned on.
- The files must contain column names headers that are not overidden bu command line options.
Usage¶
usage: kgtk cat [-h] [-i INPUT_FILE [INPUT_FILE ...]] [-o OUTPUT_FILE]
[--output-format {csv,html,html-compact,json,json-map,json-map-compact,jsonl,jsonl-map,jsonl-map-compact,kgtk,md,table,tsv,tsv-csvlike,tsv-unquoted,tsv-unquoted-ep}]
[--pure-python [True|False]]
[--fast-copy-min-size FAST_COPY_MIN_SIZE]
[-v [optional True|False]]
Concatenate two or more KGTK files, merging the columns appropriately. All files must be KGTK edge files or all files must be KGTK node files (unless overridden with --mode=NONE).
Additional options are shown in expert help.
kgtk --expert cat --help
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE [INPUT_FILE ...], --input-files INPUT_FILE [INPUT_FILE ...]
KGTK input files (May be omitted or '-' for stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--output-format {csv,html,html-compact,json,json-map,json-map-compact,jsonl,jsonl-map,jsonl-map-compact,kgtk,md,table,tsv,tsv-csvlike,tsv-unquoted,tsv-unquoted-ep}
The file format (default=kgtk)
--pure-python [True|False]
When True, use Python code. (default=False)
--fast-copy-min-size FAST_COPY_MIN_SIZE
The minium number of bytes before using OS tools for
fast copy (default=10000).
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
Examples¶
Sample Data¶
Suppose that movies_reduced.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/movies_reduced.tsv
id | node1 | label | node2 |
---|---|---|---|
t1 | terminator | label | 'The Terminator'@en |
t2 | terminator | instance_of | film |
t3 | terminator | genre | action |
t4 | terminator | genre | science_fiction |
t5 | terminator | publication_date | ^1984-10-26T00:00:00Z/11 |
t6 | t5 | location | united_states |
t7 | terminator | publication_date | ^1985-02-08T00:00:00Z/11 |
t8 | t7 | location | sweden |
t9 | terminator | director | james_cameron |
t10 | terminator | cast | arnold_schwarzenegger |
t11 | t10 | role | terminator |
t12 | terminator | cast | michael_biehn |
t13 | t12 | role | kyle_reese |
t14 | terminator | cast | linda_hamilton |
t15 | t14 | role | sarah_connor |
t16 | terminator | duration | 108 |
t17 | terminator | award | national_film_registry |
t18 | t17 | point_in_time | ^2008-01-01T00:00:00Z/9 |
Suppose that tutorial_people_full.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/tutorial_people_full.tsv
id | node1 | label | node2 |
---|---|---|---|
h1 | james_cameron | label | "James Cameron" |
h2 | james_cameron | instance_of | human |
h3 | james_cameron | birth_date | ^1954-08-16T00:00:00Z/11 |
h4 | james_cameron | country | Canada |
h5 | arnold_schwarzenegger | label | "Arnold Schwarzenegger" |
h6 | arnold_schwarzenegger | instance_of | human |
h7 | arnold_schwarzenegger | birth_date | ^1947-07-30T00:00:00Z/11 |
h8 | arnold_schwarzenegger | country | "Austria" |
h9 | michael_biehn | label | "Michael Biehn" |
h10 | michael_biehn | instance_of | human |
h11 | michael_biehn | birth_date | ^1956-07-31T00:00:00Z/11 |
h12 | michael_biehn | country | "United States of America" |
h13 | linda_hamilton | label | "Linda Hamilton" |
h14 | linda_hamilton | instance_of | human |
h15 | linda_hamilton | birth_date | ^1956-09-26T00:00:00Z/11 |
h16 | linda_hamilton | country | "United States of America" |
h17 | edward_furlong | label | "Edward Furlong" |
h18 | edward_furlong | instance_of | human |
h19 | edward_furlong | birth_date | ^1977-08-02T00:00:00Z/11 |
h20 | edward_furlong | country | "United States of America" |
h21 | robert_patrick | label | "Robert Patrick" |
h22 | robert_patrick | instance_of | human |
h23 | robert_patrick | birth_date | ^1958-11-05T00:00:00Z/11 |
h24 | robert_patrick | country | "United States of America" |
Combine two KGTK files, sending the output to standard output.¶
These two files have only he 4 basic KGTK fields.
kgtk cat -i examples/docs/movies_reduced.tsv examples/docs/tutorial_people_full.tsv
The result will be the following file in KGTK format:
id | node1 | label | node2 |
---|---|---|---|
t1 | terminator | label | 'The Terminator'@en |
t2 | terminator | instance_of | film |
t3 | terminator | genre | action |
t4 | terminator | genre | science_fiction |
t5 | terminator | publication_date | ^1984-10-26T00:00:00Z/11 |
t6 | t5 | location | united_states |
t7 | terminator | publication_date | ^1985-02-08T00:00:00Z/11 |
t8 | t7 | location | sweden |
t9 | terminator | director | james_cameron |
t10 | terminator | cast | arnold_schwarzenegger |
t11 | t10 | role | terminator |
t12 | terminator | cast | michael_biehn |
t13 | t12 | role | kyle_reese |
t14 | terminator | cast | linda_hamilton |
t15 | t14 | role | sarah_connor |
t16 | terminator | duration | 108 |
t17 | terminator | award | national_film_registry |
t18 | t17 | point_in_time | ^2008-01-01T00:00:00Z/9 |
h1 | james_cameron | label | "James Cameron" |
h2 | james_cameron | instance_of | human |
h3 | james_cameron | birth_date | ^1954-08-16T00:00:00Z/11 |
h4 | james_cameron | country | Canada |
h5 | arnold_schwarzenegger | label | "Arnold Schwarzenegger" |
h6 | arnold_schwarzenegger | instance_of | human |
h7 | arnold_schwarzenegger | birth_date | ^1947-07-30T00:00:00Z/11 |
h8 | arnold_schwarzenegger | country | "Austria" |
h9 | michael_biehn | label | "Michael Biehn" |
h10 | michael_biehn | instance_of | human |
h11 | michael_biehn | birth_date | ^1956-07-31T00:00:00Z/11 |
h12 | michael_biehn | country | "United States of America" |
h13 | linda_hamilton | label | "Linda Hamilton" |
h14 | linda_hamilton | instance_of | human |
h15 | linda_hamilton | birth_date | ^1956-09-26T00:00:00Z/11 |
h16 | linda_hamilton | country | "United States of America" |
h17 | edward_furlong | label | "Edward Furlong" |
h18 | edward_furlong | instance_of | human |
h19 | edward_furlong | birth_date | ^1977-08-02T00:00:00Z/11 |
h20 | edward_furlong | country | "United States of America" |
h21 | robert_patrick | label | "Robert Patrick" |
h22 | robert_patrick | instance_of | human |
h23 | robert_patrick | birth_date | ^1958-11-05T00:00:00Z/11 |
h24 | robert_patrick | country | "United States of America" |
Combine two gzipped KGTK files, sending the output to a bzip2 file.¶
kgtk cat -i examples/docs/movies_reduced.tsv.gz examples/docs/tutorial_people_full.tsv.gz -o ofile.tsv.bz2
Expert Topic: Processing Files Not in KGTK Format¶
Suppose that not-kgtk.tsv
contains the following data not in KGTK format
(--mode=NONE
has been added to allow the file to be processed by kgtk cat
):
kgtk cat -i examples/docs/not-kgtk.tsv --mode=NONE
a | b | c | d |
---|---|---|---|
h21 | robert_patrick | label | "Robert Patrick" |
h22 | robert_patrick | instance_of | human |
h23 | robert_patrick | birth_date | ^1958-11-05T00:00:00Z/11 |
h24 | robert_patrick | country | "United States of America" |
Trying to run the command without --mode=NONE
:
kgtk cat -i examples/docs/not-kgtk.tsv
will result in an error message:
In input 1 header 'a b c d': Missing required column: id | ID
Exit requested
We can force the kgtk cat
command to process the file by using the --mode NONE
option,
as shown above.
Note
--mode NONE
is implemented by KgtkReader. It can be used by many KGTK commands.
Read a CSV file¶
Here's an example of reading a CSV file, using the filename suffix to establish the file format:
kgtk cat -i examples/docs/cat-csv-file.csv --mode=NONE
AtomicNumber | Element | Symbol | AtomicMass | NumberofNeutrons | NumberofProtons | NumberofElectrons | Period | Group | Phase | Radioactive | Natural | Metal | Nonmetal | Metalloid | Type | AtomicRadius | Electronegativity | FirstIonization | Density | MeltingPoint | BoilingPoint | NumberOfIsotopes | Discoverer | Year | SpecificHeat | NumberofShells | NumberofValence |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Hydrogen | H | 1.007 | 0 | 1 | 1 | 1 | 1 | gas | yes | yes | Nonmetal | 0.79 | 2.2 | 13.5984 | 8.99E-05 | 14.175 | 20.28 | 3 | Cavendish | 1766 | 14.304 | 1 | 1 | |||
2 | Helium | He | 4.002 | 2 | 2 | 2 | 1 | 18 | gas | yes | yes | NobleGas | 0.49 | 24.5874 | 1.79E-04 | 4.22 | 5 | Janssen | 1868 | 5.193 | 1 | ||||||
3 | Lithium | Li | 6.941 | 4 | 3 | 3 | 2 | 1 | solid | yes | yes | AlkaliMetal | 2.1 | 0.98 | 5.3917 | 5.34E-01 | 453.85 | 1615 | 5 | Arfvedson | 1817 | 3.582 | 2 | 1 | |||
4 | Beryllium | Be | 9.012 | 5 | 4 | 4 | 2 | 2 | solid | yes | yes | AlkalineEarthMetal | 1.4 | 1.57 | 9.3227 | 1.85E+00 | 1560.15 | 2742 | 6 | Vaulquelin | 1798 | 1.825 | 2 | 2 | |||
5 | Boron | B | 10.811 | 6 | 5 | 5 | 2 | 13 | solid | yes | yes | Metalloid | 1.2 | 2.04 | 8.298 | 2.34E+00 | 2573.15 | 4200 | 6 | Gay-Lussac | 1808 | 1.026 | 2 | 3 | |||
6 | Carbon | C | 12.011 | 6 | 6 | 6 | 2 | 14 | solid | yes | yes | Nonmetal | 0.91 | 2.55 | 11.2603 | 2.27E+00 | 3948.15 | 4300 | 7 | Prehistoric | 0.709 | 2 | 4 | ||||
7 | Nitrogen | N | 14.007 | 7 | 7 | 7 | 2 | 15 | gas | yes | yes | Nonmetal | 0.75 | 3.04 | 14.5341 | 1.25E-03 | 63.29 | 77.36 | 8 | Rutherford | 1772 | 1.04 | 2 | 5 | |||
8 | Oxygen | O | 15.999 | 8 | 8 | 8 | 2 | 16 | gas | yes | yes | Nonmetal | 0.65 | 3.44 | 13.6181 | 1.43E-03 | 50.5 | 90.2 | 8 | Priestley|Scheele | 1774 | 0.918 | 2 | 6 | |||
9 | Fluorine | F | 18.998 | 10 | 9 | 9 | 2 | 17 | gas | yes | yes | Halogen | 0.57 | 3.98 | 17.4228 | 1.70E-03 | 53.63 | 85.03 | 6 | Moissan | 1886 | 0.824 | 2 | 7 | |||
10 | Neon | Ne | 20.18 | 10 | 10 | 10 | 2 | 18 | gas | yes | yes | Noble Gas | 0.51 | 21.5645 | 9.00E-04 | 24.703 | 27.07 | 8 | Ramsay_and_Travers | 1898 | 1.03 | 2 | 8 | ||||
11 | Sodium | Na | 22.99 | 12 | 11 | 11 | 3 | 1 | solid | yes | yes | AlkaliMetal | 2.2 | 0.93 | 5.1391 | 9.71E-01 | 371.15 | 1156 | 7 | Davy | 1807 | 1.228 | 3 | 1 | |||
12 | Magnesium | Mg | 24.305 | 12 | 12 | 12 | 3 | 2 | solid | yes | yes | AlkalineEarthMetal | 1.7 | 1.31 | 7.6462 | 1.74E+00 | 923.15 | 1363 | 8 | Black | 1755 | 1.023 | 3 | 2 | |||
13 | Aluminum | Al | 26.982 | 14 | 13 | 13 | 3 | 13 | solid | yes | yes | Metal | 1.8 | 1.61 | 5.9858 | 2.70E+00 | 933.4 | 2792 | 8 | Wshler | 1827 | 0.897 | 3 | 3 | |||
14 | Silicon | Si | 28.086 | 14 | 14 | 14 | 3 | 14 | solid | yes | yes | Metalloid | 1.5 | 1.9 | 8.1517 | 2.33E+00 | 1683.15 | 3538 | 8 | Berzelius | 1824 | 0.705 | 3 | 4 | |||
15 | Phosphorus | P | 30.974 | 16 | 15 | 15 | 3 | 15 | solid | yes | yes | Nonmetal | 1.2 | 2.19 | 10.4867 | 1.82E+00 | 317.25 | 553 | 7 | BranBrand | 1669 | 0.769 | 3 | 5 | |||
16 | Sulfur | S | 32.065 | 16 | 16 | 16 | 3 | 16 | solid | yes | yes | Nonmetal | 1.1 | 2.58 | 10.36 | 2.07E+00 | 388.51 | 717.8 | 10 | Prehistoric | 0.71 | 3 | 6 | ||||
17 | Chlorine | Cl | 35.453 | 18 | 17 | 17 | 3 | 17 | gas | yes | yes | Halogen | 0.97 | 3.16 | 12.9676 | 3.21E-03 | 172.31 | 239.11 | 11 | Scheele | 1774 | 0.479 | 3 | 7 | |||
18 | Argon | Ar | 39.948 | 22 | 18 | 18 | 3 | 18 | gas | yes | yes | NobleGas | 0.88 | 15.7596 | 1.78E-03 | 83.96 | 87.3 | 8 | Rayleigh_and_Ramsay | 1894 | 0.52 | 3 | 8 |
Expert Topic: Adding Column Names¶
Suppose that you have a TSV (tab-separated values) data file
that looks like a KGTK data file but without the header line.
You can supply a header line with the expert option --force-column-names
.
You can also use this option when concatenating several
data files, so long as they are all missing header lines and
they should all have the same header line.
Consider the following input file:
kgtk cat -i examples/docs/no-header.tsv --mode=NONE
a | b | c | d |
---|---|---|---|
h21 | robert_patrick | label | "Robert Patrick" |
h22 | robert_patrick | instance_of | human |
h23 | robert_patrick | birth_date | ^1958-11-05T00:00:00Z/11 |
h24 | robert_patrick | country | "United States of America" |
We can supply a valid header line as follows:
kgtk cat -i examples/docs/no-header.tsv \
--force-column-names id node1 label node2
The result will be the following file in KGTK format:
id | node1 | label | node2 |
---|---|---|---|
h21 | robert_patrick | label | "Robert Patrick" |
h22 | robert_patrick | instance_of | human |
h23 | robert_patrick | birth_date | ^1958-11-05T00:00:00Z/11 |
h24 | robert_patrick | country | "United States of America" |
Note
---force-column-names
takes place before the input file is checked
to see if it is a valid KGTK edge or node file. Since we supplied
valid KGTK edge column names in the example above, --mode=NONE
is
no longer needed.
Expert Topic: Renaming Column Names on Input¶
There is a special KGTK command, kgtk rename-columns
, for renaming columns.
However, you may want to rename columns while also using other features of
the kgtk cat
command, such as combining multiple input files or sampling
data lines.
You have two main choices: override the column names on input, or rename the column names on output.
Overriding the column names on input can be done by skipping the existing header record and supplying a replacement list of column names.
kgtk cat -i examples/docs/not-kgtk.tsv \
--force-column-names id node1 label node2
The result will be the following file in KGTK format:
id | node1 | label | node2 |
---|---|---|---|
h21 | robert_patrick | label | "Robert Patrick" |
h22 | robert_patrick | instance_of | human |
h23 | robert_patrick | birth_date | ^1958-11-05T00:00:00Z/11 |
h24 | robert_patrick | country | "United States of America" |
Note
When you rename columns on input, the change applies to all input files: they all must have the same column layout, for which you will provide a new set of column names.
Expert Topic: Renaming All Column Names on Output¶
There is a special KGTK command, kgtk rename_columns
, for renaming columns.
However, you may want to rename columns while also using other features of
the kgtk cat
command, such as combining multiple input files or sampling
data lines.
You have two main choices: override the column names on input, or rename the column names on output.
For example, suppose your input file contained the following table in almost KGTK format:
kgtk cat -i examples/docs/movies_origin_destination.tsv --mode=NONE
origin | label | destination | years |
---|---|---|---|
terminator | label | 'The Terminator'@en | 4 |
terminator | instance_of | film | 3 |
Renaming the column names on output can by done two ways. First, you can name all of the new column names using --output-columns.
kgtk cat -i examples/docs/movies_origin_destination.tsv --mode=NONE \
--output-columns node1 label node2 years
The result will be the following table in KGTK format:
node1 | label | node2 | years |
---|---|---|---|
terminator | label | 'The Terminator'@en | 4 |
terminator | instance_of | film | 3 |
Expert Topic: Renaming Selected Column Names on Output¶
Second, you can rename individual columns using --old-columns and --new-columns.
You want to rename the origin
column to node1
, and the destination
column to node2
, leaving the other column names alone.
kgtk cat -i examples/docs/movies_origin_destination.tsv --mode=NONE \
--old-columns origin destination \
--new-columns node1 node2
The result will be the following table in KGTK format:
node1 | label | node2 | years |
---|---|---|---|
terminator | label | 'The Terminator'@en | 4 |
terminator | instance_of | film | 3 |
Note
Renaming column names on output can be done when you combine a
disparate set of KGTK files. The rename applies to the merged set of column
names computed by kgtk cat
.
Expert Topic: Data Sampling: head¶
Limit the number of records read (like head
).
kgtk cat -i examples/docs/movies_reduced.tsv --record-limit 4
The result will be the following table in KGTK format:
id | node1 | label | node2 |
---|---|---|---|
t1 | terminator | label | 'The Terminator'@en |
t2 | terminator | instance_of | film |
t3 | terminator | genre | action |
t4 | terminator | genre | science_fiction |
Expert Topic: Data Sampling: skip¶
Skip some number of initial records, then begin processing.
kgtk cat -i examples/docs/movies_reduced.tsv --initial-skip-count 4
The result will be the following table in KGTK format:
id | node1 | label | node2 |
---|---|---|---|
t5 | terminator | publication_date | ^1984-10-26T00:00:00Z/11 |
t6 | t5 | location | united_states |
t7 | terminator | publication_date | ^1985-02-08T00:00:00Z/11 |
t8 | t7 | location | sweden |
t9 | terminator | director | james_cameron |
t10 | terminator | cast | arnold_schwarzenegger |
t11 | t10 | role | terminator |
t12 | terminator | cast | michael_biehn |
t13 | t12 | role | kyle_reese |
t14 | terminator | cast | linda_hamilton |
t15 | t14 | role | sarah_connor |
t16 | terminator | duration | 108 |
t17 | terminator | award | national_film_registry |
t18 | t17 | point_in_time | ^2008-01-01T00:00:00Z/9 |
Expert Topic: Data Sampling: last 5¶
Process the last n records relative to the end (like tail
).
You must know the number of data records in the file (the number of lines
in the file minus the header line).
kgtk cat -i examples/docs/movies_reduced.tsv --record-limit 15 --tail-count 3
The result will be the following table in KGTK format:
id | node1 | label | node2 |
---|---|---|---|
t13 | t12 | role | kyle_reese |
t14 | terminator | cast | linda_hamilton |
t15 | t14 | role | sarah_connor |
Note
If both --initial-skip-count # and --record-limit # --tail-count # are specified, the number of records skipped will be the maximum of the initial skip count and (record limit minus tail count).
Expert Topic: Data Sampling: every n¶
Process every nth record (after skipping, but calculated relative to the count of data lines read before skipping). The following example will process every second line.
kgtk cat -i examples/docs/movies_reduced.tsv --every-nth-record 2
The result will be the following table in KGTK format:
id | node1 | label | node2 |
---|---|---|---|
t2 | terminator | instance_of | film |
t4 | terminator | genre | science_fiction |
t6 | t5 | location | united_states |
t8 | t7 | location | sweden |
t10 | terminator | cast | arnold_schwarzenegger |
t12 | terminator | cast | michael_biehn |
t14 | terminator | cast | linda_hamilton |
t16 | terminator | duration | 108 |
t18 | t17 | point_in_time | ^2008-01-01T00:00:00Z/9 |
Expert Topic: Converting a CSV File to a quasi-KGTK File¶
The expert option --input-format csv
may be used to read an
input file in CSV (comma-separated values) format. The expert
option --mode=NONE
will also be needed if the input file
does not have the required columns of a KGTK edge or node file.
kgtk cat -i examples/docs/periodic_table_of_elements_1-18.csv \
--input-format csv --mode=NONE
The result will be the following table in quasi-KGTK format:
AtomicNumber | Element | Symbol | AtomicMass | NumberofNeutrons | NumberofProtons | NumberofElectrons | Period | Group | Phase | Radioactive | Natural | Metal | Nonmetal | Metalloid | Type | AtomicRadius | Electronegativity | FirstIonization | Density | MeltingPoint | BoilingPoint | NumberOfIsotopes | Discoverer | Year | SpecificHeat | NumberofShells | NumberofValence |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Hydrogen | H | 1.007 | 0 | 1 | 1 | 1 | 1 | gas | yes | yes | Nonmetal | 0.79 | 2.2 | 13.5984 | 8.99E-05 | 14.175 | 20.28 | 3 | Cavendish | 1766 | 14.304 | 1 | 1 | |||
2 | Helium | He | 4.002 | 2 | 2 | 2 | 1 | 18 | gas | yes | yes | NobleGas | 0.49 | 24.5874 | 1.79E-04 | 4.22 | 5 | Janssen | 1868 | 5.193 | 1 | ||||||
3 | Lithium | Li | 6.941 | 4 | 3 | 3 | 2 | 1 | solid | yes | yes | AlkaliMetal | 2.1 | 0.98 | 5.3917 | 5.34E-01 | 453.85 | 1615 | 5 | Arfvedson | 1817 | 3.582 | 2 | 1 | |||
4 | Beryllium | Be | 9.012 | 5 | 4 | 4 | 2 | 2 | solid | yes | yes | AlkalineEarthMetal | 1.4 | 1.57 | 9.3227 | 1.85E+00 | 1560.15 | 2742 | 6 | Vaulquelin | 1798 | 1.825 | 2 | 2 | |||
5 | Boron | B | 10.811 | 6 | 5 | 5 | 2 | 13 | solid | yes | yes | Metalloid | 1.2 | 2.04 | 8.298 | 2.34E+00 | 2573.15 | 4200 | 6 | Gay-Lussac | 1808 | 1.026 | 2 | 3 | |||
6 | Carbon | C | 12.011 | 6 | 6 | 6 | 2 | 14 | solid | yes | yes | Nonmetal | 0.91 | 2.55 | 11.2603 | 2.27E+00 | 3948.15 | 4300 | 7 | Prehistoric | 0.709 | 2 | 4 | ||||
7 | Nitrogen | N | 14.007 | 7 | 7 | 7 | 2 | 15 | gas | yes | yes | Nonmetal | 0.75 | 3.04 | 14.5341 | 1.25E-03 | 63.29 | 77.36 | 8 | Rutherford | 1772 | 1.04 | 2 | 5 | |||
8 | Oxygen | O | 15.999 | 8 | 8 | 8 | 2 | 16 | gas | yes | yes | Nonmetal | 0.65 | 3.44 | 13.6181 | 1.43E-03 | 50.5 | 90.2 | 8 | Priestley|Scheele | 1774 | 0.918 | 2 | 6 | |||
9 | Fluorine | F | 18.998 | 10 | 9 | 9 | 2 | 17 | gas | yes | yes | Halogen | 0.57 | 3.98 | 17.4228 | 1.70E-03 | 53.63 | 85.03 | 6 | Moissan | 1886 | 0.824 | 2 | 7 | |||
10 | Neon | Ne | 20.18 | 10 | 10 | 10 | 2 | 18 | gas | yes | yes | Noble Gas | 0.51 | 21.5645 | 9.00E-04 | 24.703 | 27.07 | 8 | Ramsay_and_Travers | 1898 | 1.03 | 2 | 8 | ||||
11 | Sodium | Na | 22.99 | 12 | 11 | 11 | 3 | 1 | solid | yes | yes | AlkaliMetal | 2.2 | 0.93 | 5.1391 | 9.71E-01 | 371.15 | 1156 | 7 | Davy | 1807 | 1.228 | 3 | 1 | |||
12 | Magnesium | Mg | 24.305 | 12 | 12 | 12 | 3 | 2 | solid | yes | yes | AlkalineEarthMetal | 1.7 | 1.31 | 7.6462 | 1.74E+00 | 923.15 | 1363 | 8 | Black | 1755 | 1.023 | 3 | 2 | |||
13 | Aluminum | Al | 26.982 | 14 | 13 | 13 | 3 | 13 | solid | yes | yes | Metal | 1.8 | 1.61 | 5.9858 | 2.70E+00 | 933.4 | 2792 | 8 | Wshler | 1827 | 0.897 | 3 | 3 | |||
14 | Silicon | Si | 28.086 | 14 | 14 | 14 | 3 | 14 | solid | yes | yes | Metalloid | 1.5 | 1.9 | 8.1517 | 2.33E+00 | 1683.15 | 3538 | 8 | Berzelius | 1824 | 0.705 | 3 | 4 | |||
15 | Phosphorus | P | 30.974 | 16 | 15 | 15 | 3 | 15 | solid | yes | yes | Nonmetal | 1.2 | 2.19 | 10.4867 | 1.82E+00 | 317.25 | 553 | 7 | BranBrand | 1669 | 0.769 | 3 | 5 | |||
16 | Sulfur | S | 32.065 | 16 | 16 | 16 | 3 | 16 | solid | yes | yes | Nonmetal | 1.1 | 2.58 | 10.36 | 2.07E+00 | 388.51 | 717.8 | 10 | Prehistoric | 0.71 | 3 | 6 | ||||
17 | Chlorine | Cl | 35.453 | 18 | 17 | 17 | 3 | 17 | gas | yes | yes | Halogen | 0.97 | 3.16 | 12.9676 | 3.21E-03 | 172.31 | 239.11 | 11 | Scheele | 1774 | 0.479 | 3 | 7 | |||
18 | Argon | Ar | 39.948 | 22 | 18 | 18 | 3 | 18 | gas | yes | yes | NobleGas | 0.88 | 15.7596 | 1.78E-03 | 83.96 | 87.3 | 8 | Rayleigh_and_Ramsay | 1894 | 0.52 | 3 | 8 |
Expert Topic: Converting a CSV File to a KGTK File¶
The expert option --input-format csv
may be used to read an
input file in CSV (comma-separated values) format. The expert
option --mode=NONE
will also be needed if the input file
does not have the required columns of a KGTK edge or node file.
If we want the output file to be a KGTK file instead of a quasi-KGTK
file, and one of the columns is suitable to use as an id
column,
we can rename that column on input or output. In this example, we
rename the specific column on output.
kgtk cat -i examples/docs/periodic_table_of_elements_1-18.csv \
--input-format csv --mode=NONE \
--old-column AtomicNumber \
--new-column id
The result will be the following table in KGTK format:
id | Element | Symbol | AtomicMass | NumberofNeutrons | NumberofProtons | NumberofElectrons | Period | Group | Phase | Radioactive | Natural | Metal | Nonmetal | Metalloid | Type | AtomicRadius | Electronegativity | FirstIonization | Density | MeltingPoint | BoilingPoint | NumberOfIsotopes | Discoverer | Year | SpecificHeat | NumberofShells | NumberofValence |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Hydrogen | H | 1.007 | 0 | 1 | 1 | 1 | 1 | gas | yes | yes | Nonmetal | 0.79 | 2.2 | 13.5984 | 8.99E-05 | 14.175 | 20.28 | 3 | Cavendish | 1766 | 14.304 | 1 | 1 | |||
2 | Helium | He | 4.002 | 2 | 2 | 2 | 1 | 18 | gas | yes | yes | NobleGas | 0.49 | 24.5874 | 1.79E-04 | 4.22 | 5 | Janssen | 1868 | 5.193 | 1 | ||||||
3 | Lithium | Li | 6.941 | 4 | 3 | 3 | 2 | 1 | solid | yes | yes | AlkaliMetal | 2.1 | 0.98 | 5.3917 | 5.34E-01 | 453.85 | 1615 | 5 | Arfvedson | 1817 | 3.582 | 2 | 1 | |||
4 | Beryllium | Be | 9.012 | 5 | 4 | 4 | 2 | 2 | solid | yes | yes | AlkalineEarthMetal | 1.4 | 1.57 | 9.3227 | 1.85E+00 | 1560.15 | 2742 | 6 | Vaulquelin | 1798 | 1.825 | 2 | 2 | |||
5 | Boron | B | 10.811 | 6 | 5 | 5 | 2 | 13 | solid | yes | yes | Metalloid | 1.2 | 2.04 | 8.298 | 2.34E+00 | 2573.15 | 4200 | 6 | Gay-Lussac | 1808 | 1.026 | 2 | 3 | |||
6 | Carbon | C | 12.011 | 6 | 6 | 6 | 2 | 14 | solid | yes | yes | Nonmetal | 0.91 | 2.55 | 11.2603 | 2.27E+00 | 3948.15 | 4300 | 7 | Prehistoric | 0.709 | 2 | 4 | ||||
7 | Nitrogen | N | 14.007 | 7 | 7 | 7 | 2 | 15 | gas | yes | yes | Nonmetal | 0.75 | 3.04 | 14.5341 | 1.25E-03 | 63.29 | 77.36 | 8 | Rutherford | 1772 | 1.04 | 2 | 5 | |||
8 | Oxygen | O | 15.999 | 8 | 8 | 8 | 2 | 16 | gas | yes | yes | Nonmetal | 0.65 | 3.44 | 13.6181 | 1.43E-03 | 50.5 | 90.2 | 8 | Priestley|Scheele | 1774 | 0.918 | 2 | 6 | |||
9 | Fluorine | F | 18.998 | 10 | 9 | 9 | 2 | 17 | gas | yes | yes | Halogen | 0.57 | 3.98 | 17.4228 | 1.70E-03 | 53.63 | 85.03 | 6 | Moissan | 1886 | 0.824 | 2 | 7 | |||
10 | Neon | Ne | 20.18 | 10 | 10 | 10 | 2 | 18 | gas | yes | yes | Noble Gas | 0.51 | 21.5645 | 9.00E-04 | 24.703 | 27.07 | 8 | Ramsay_and_Travers | 1898 | 1.03 | 2 | 8 | ||||
11 | Sodium | Na | 22.99 | 12 | 11 | 11 | 3 | 1 | solid | yes | yes | AlkaliMetal | 2.2 | 0.93 | 5.1391 | 9.71E-01 | 371.15 | 1156 | 7 | Davy | 1807 | 1.228 | 3 | 1 | |||
12 | Magnesium | Mg | 24.305 | 12 | 12 | 12 | 3 | 2 | solid | yes | yes | AlkalineEarthMetal | 1.7 | 1.31 | 7.6462 | 1.74E+00 | 923.15 | 1363 | 8 | Black | 1755 | 1.023 | 3 | 2 | |||
13 | Aluminum | Al | 26.982 | 14 | 13 | 13 | 3 | 13 | solid | yes | yes | Metal | 1.8 | 1.61 | 5.9858 | 2.70E+00 | 933.4 | 2792 | 8 | Wshler | 1827 | 0.897 | 3 | 3 | |||
14 | Silicon | Si | 28.086 | 14 | 14 | 14 | 3 | 14 | solid | yes | yes | Metalloid | 1.5 | 1.9 | 8.1517 | 2.33E+00 | 1683.15 | 3538 | 8 | Berzelius | 1824 | 0.705 | 3 | 4 | |||
15 | Phosphorus | P | 30.974 | 16 | 15 | 15 | 3 | 15 | solid | yes | yes | Nonmetal | 1.2 | 2.19 | 10.4867 | 1.82E+00 | 317.25 | 553 | 7 | BranBrand | 1669 | 0.769 | 3 | 5 | |||
16 | Sulfur | S | 32.065 | 16 | 16 | 16 | 3 | 16 | solid | yes | yes | Nonmetal | 1.1 | 2.58 | 10.36 | 2.07E+00 | 388.51 | 717.8 | 10 | Prehistoric | 0.71 | 3 | 6 | ||||
17 | Chlorine | Cl | 35.453 | 18 | 17 | 17 | 3 | 17 | gas | yes | yes | Halogen | 0.97 | 3.16 | 12.9676 | 3.21E-03 | 172.31 | 239.11 | 11 | Scheele | 1774 | 0.479 | 3 | 7 | |||
18 | Argon | Ar | 39.948 | 22 | 18 | 18 | 3 | 18 | gas | yes | yes | NobleGas | 0.88 | 15.7596 | 1.78E-03 | 83.96 | 87.3 | 8 | Rayleigh_and_Ramsay | 1894 | 0.52 | 3 | 8 |
Note
See kgtk add-id
for an example of
converting a CSV file without an id
column to a KGTK node file by adding an id
column.
Note
See kgtk normalize-nodes
for an example of
converting a CSV file without an id
column to a KGTK edge file.
Expert Topic: Implying a Label Column¶
It is not uncommon to encounter two-column files (TSV or CSV) which represent an
edge with an implied label
column value (predicate
). The --implied-label VALUE
option may be used to convert the input data into a three-column format.
Consider the following file, which lists some cities in the State of Massachusettes
and the year that they were founded. Since this is neither a KGTK edge file
nor a KGTK node file, we need to specify --mode=NONE
to bypass certain
validity checks:
kgtk cat --mode=NONE -i examples/docs/cat-two-columns.tsv
node1 | node2 |
---|---|
Boston | 1630 |
Concord | 1635 |
Scituate | 1636 |
Springfield | 1636 |
Cambridge | 1638 |
Lexington | 1642 |
Worcester | 1673 |
We can convert this file into a KGTK edge file on input by
specifying an implied label
column and value:
kgtk cat --implied-label=founded -i examples/docs/cat-two-columns.tsv
node1 | node2 | label |
---|---|---|
Boston | 1630 | founded |
Concord | 1635 | founded |
Scituate | 1636 | founded |
Springfield | 1636 | founded |
Cambridge | 1638 | founded |
Lexington | 1642 | founded |
Worcester | 1673 | founded |
Note
The --implied-label=VALUE
option is implemented by KgtkReader, and
can be used with most KGTK subcommands.
Expert Topic: Supressing the Output Header¶
Sometimes it is desired to produce a TSV file without an output header.
kgtk cat -i examples/docs/movies_reduced.tsv --no-output-header
The result will be the following file in KGTK format except for missing the header line.
t1 | terminator | label | 'The Terminator'@en |
t2 | terminator | instance_of | film |
t3 | terminator | genre | action |
t4 | terminator | genre | science_fiction |
t5 | terminator | publication_date | ^1984-10-26T00:00:00Z/11 |
t6 | t5 | location | united_states |
t7 | terminator | publication_date | ^1985-02-08T00:00:00Z/11 |
t8 | t7 | location | sweden |
t9 | terminator | director | james_cameron |
t10 | terminator | cast | arnold_schwarzenegger |
t11 | t10 | role | terminator |
t12 | terminator | cast | michael_biehn |
t13 | t12 | role | kyle_reese |
t14 | terminator | cast | linda_hamilton |
t15 | t14 | role | sarah_connor |
t16 | terminator | duration | 108 |
t17 | terminator | award | national_film_registry |
t18 | t17 | point_in_time | ^2008-01-01T00:00:00Z/9 |
Expert Topic: Reading Files without Header Records: Supply Column Names¶
Sometimes you may wish to read a TSV file that does not contain a header record.
kgtk cat -i examples/docs/cat-file-without-header.tsv --mode=NONE
john | woke | ^2020-05-02T00:00 |
---|---|---|
john | woke | ^2020-05-00T00:00 |
john | slept | ^2020-05-02T24:00 |
lionheart | born | ^1157-09-08T00:00 |
year0001 | starts | ^0001-01-01T00:00 |
year9999 | ends | ^9999-12-31T11:59:59 |
Copy the file, supplying column names:
kgtk cat -i examples/docs/cat-file-without-header.tsv \
--input-column-names node1 label node2
The result will be the following file in KGTK format:
node1 | label | node2 |
---|---|---|
john | woke | ^2020-05-00T00:00 |
john | slept | ^2020-05-02T24:00 |
lionheart | born | ^1157-09-08T00:00 |
year0001 | starts | ^0001-01-01T00:00 |
year9999 | ends | ^9999-12-31T11:59:59 |
Expert Topic: Reading Files without Header Records: Automatic Column Names¶
Another approach to reading a file without a header record is to have KGTK
assign column names, which it will do beginning with COL1, COL2, etc. It is
necessary to tell KGTK how many columns are in the file. It is also necessary
to say --mode=NONE
, since the generated column names do not match the definition
of a KGTK edge file or node file.
kgtk cat -i examples/docs/cat-file-without-header.tsv \
--no-input-header \
--supply-missing-column-names \
--number-of-columns 3 \
--mode=NONE
COL1 | COL2 | COL3 |
---|---|---|
john | woke | ^2020-05-02T00:00 |
john | woke | ^2020-05-00T00:00 |
john | slept | ^2020-05-02T24:00 |
lionheart | born | ^1157-09-08T00:00 |
year0001 | starts | ^0001-01-01T00:00 |
year9999 | ends | ^9999-12-31T11:59:59 |
Expert Topic: Reading Files without Header Records: Empty Column Names¶
Assume that you have a TSV file with the right number of columns in the header record, but one or more missing column names.
kgtk cat -i examples/docs/cat-file-with-empty-column-names.tsv
In input 1 header 'node1 label node2 ': Column 3 has an empty name in the file header
Exit requested
You can ask the system to replace the empty column names with COLn:
kgtk cat -i examples/docs/cat-file-with-empty-column-names.tsv \
--supply-missing-column-names
node1 | label | node2 | COL4 | COL5 |
---|---|---|---|---|
john | observed | ^2020-05-02T00:00 | fever | cough |
john | observed | ^2020-05-00T00:00 | normal | normal |
john | observed | ^2020-05-02T24:00 | normal | cough |
Expert Topic: Requiring Certain Columns¶
Sometimes you may wish to require that an input file contains certain named columns that are essential to your analysis.
kgtk cat -i examples/docs/cat-edges-with-totals.tsv
node1 | label | node2 | node1;total |
---|---|---|---|
P10 | p585-count | 73 | 3879 |
P1000 | p585-count | 16 | 266 |
P101 | p585-count | 5 | 157519 |
P1018 | p585-count | 2 | 177 |
P102 | p585-count | 295 | 414726 |
P1025 | p585-count | 26 | 693 |
P1026 | p585-count | 40 | 6930 |
P1027 | p585-count | 14 | 10008 |
P1028 | p585-count | 1131 | 4035 |
P1029 | p585-count | 4 | 2643 |
P1035 | p585-count | 4 | 366 |
P1037 | p585-count | 60 | 9317 |
P1040 | p585-count | 1 | 45073 |
P1050 | p585-count | 246 | 226380 |
Supposw you require that the node1;total
column be present:
kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
--require-column-names 'node1;total'
This will succeed:
node1 | label | node2 | node1;total |
---|---|---|---|
P10 | p585-count | 73 | 3879 |
P1000 | p585-count | 16 | 266 |
P101 | p585-count | 5 | 157519 |
P1018 | p585-count | 2 | 177 |
P102 | p585-count | 295 | 414726 |
P1025 | p585-count | 26 | 693 |
P1026 | p585-count | 40 | 6930 |
P1027 | p585-count | 14 | 10008 |
P1028 | p585-count | 1131 | 4035 |
P1029 | p585-count | 4 | 2643 |
P1035 | p585-count | 4 | 366 |
P1037 | p585-count | 60 | 9317 |
P1040 | p585-count | 1 | 45073 |
P1050 | p585-count | 246 | 226380 |
Suppose you also require that an 'average' column be present, but it is missing:
kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
--require-column-names 'node1;total' average
This will result in an error message:
In input 1 header 'node1 label node2 node1;total': The following required columns were missing: ['average']
Exit requested
Expert Topic: Prohibiting Additional Columns¶
Sometimes you may want to prohibit additional columns. There are several special cases to consider:
- Prohibiting additional columns for a standard KGTK node file.
- Prohibiting additional columns for a standard KGTK edge file.
- Requiring certain additional columns and prohibiting others.
Suppose that you have a standard KGTK edge file and you wish to prohibit any additional columns.
Consider a standard KGTK node file without additional columns:
kgtk cat -i examples/docs/cat-nodes.tsv \
--no-additional-columns
This command will succeed:
id |
---|
P10 |
P100 |
P1000 |
A KGTK node file with unexpected additional columns will fail:
kgtk cat -i examples/docs/cat-nodes-and-titles.tsv \
--no-additional-columns
In input 1 header 'id titel': The following additional columns are unexpected: ['titel']
Exit requested
Consider a standard KGTK edge file without additional columns:
kgtk cat -i examples/docs/cat-edges.tsv \
--no-additional-columns
This command will succeed:
node1 | label | node2 |
---|---|---|
P10 | p585-count | 73 |
P1000 | p585-count | 16 |
P101 | p585-count | 5 |
P1018 | p585-count | 2 |
P102 | p585-count | 295 |
P1025 | p585-count | 26 |
P1026 | p585-count | 40 |
P1027 | p585-count | 14 |
P1028 | p585-count | 1131 |
P1029 | p585-count | 4 |
P1035 | p585-count | 4 |
P1037 | p585-count | 60 |
P1040 | p585-count | 1 |
P1050 | p585-count | 246 |
Note: The node1
, label
, and node2
columns (or their aliases) are allowed.
The id
column is also allowed, although it is not required.
Consider a KGTK edge file with an undesired additional column:
kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
--no-additional-columns
This will fail with the following error message:
In input 1 header 'node1 label node2 node1;total': The following additional columns are unexpected: ['node1;total']
Exit requested
If we want to accept the node1;total
additional column, but prohibit
others, we can do so by explicitly listing the required columns. All
required columns (e.g., node1
, label,
and node2`) must be listed:
kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
--require-column-names node1 label node2 'node1;total' \
--no-additional-columns
This will succeed:
node1 | label | node2 | node1;total |
---|---|---|---|
P10 | p585-count | 73 | 3879 |
P1000 | p585-count | 16 | 266 |
P101 | p585-count | 5 | 157519 |
P1018 | p585-count | 2 | 177 |
P102 | p585-count | 295 | 414726 |
P1025 | p585-count | 26 | 693 |
P1026 | p585-count | 40 | 6930 |
P1027 | p585-count | 14 | 10008 |
P1028 | p585-count | 1131 | 4035 |
P1029 | p585-count | 4 | 2643 |
P1035 | p585-count | 4 | 366 |
P1037 | p585-count | 60 | 9317 |
P1040 | p585-count | 1 | 45073 |
P1050 | p585-count | 246 | 226380 |
An unexpected additional column will fail:
kgtk cat -i examples/docs/cat-edges-with-totals-and-averages.tsv \
--require-column-names node1 label node2 'node1;total' \
--no-additional-columns
In input 1 header 'node1 label node2 node1;total average': The following additional columns are unexpected: ['average']
Exit requested
We can add the average
column to the list of required column names
and accept that file:
kgtk cat -i examples/docs/cat-edges-with-totals-and-averages.tsv \
--require-column-names node1 label node2 'node1;total' average \
--no-additional-columns
node1 | label | node2 | node1;total | average |
---|---|---|---|---|
P10 | p585-count | 73 | 3879 | 53.136986301369866 |
P1000 | p585-count | 16 | 266 | 16.625 |
P101 | p585-count | 5 | 157519 | 31503.8 |
P1018 | p585-count | 2 | 177 | 88.5 |
P102 | p585-count | 295 | 414726 | 1405.8508474576272 |
P1025 | p585-count | 26 | 693 | 26.653846153846153 |
P1026 | p585-count | 40 | 6930 | 173.25 |
P1027 | p585-count | 14 | 10008 | 714.8571428571429 |
P1028 | p585-count | 1131 | 4035 | 3.5676392572944295 |
P1029 | p585-count | 4 | 2643 | 660.75 |
P1035 | p585-count | 4 | 366 | 91.5 |
P1037 | p585-count | 60 | 9317 | 155.28333333333333 |
P1040 | p585-count | 1 | 45073 | 45073.0 |
P1050 | p585-count | 246 | 226380 | 920.2439024390244 |
Note
At the present time there is no option to list optional additional columns.
Expert Topic: Requiring a Certain Number of Columns¶
Another way to ensure that a KGTK edge file has only
[ node1
, label
, node2
] columns is to require that the
file have 3 columns. More generally, you can require that
a file have a certain number of columns without having to name
all the columns individually.
kgtk cat -i examples/docs/cat-edges.tsv \
--number-of-columns 3
node1 | label | node2 |
---|---|---|
P10 | p585-count | 73 |
P1000 | p585-count | 16 |
P101 | p585-count | 5 |
P1018 | p585-count | 2 |
P102 | p585-count | 295 |
P1025 | p585-count | 26 |
P1026 | p585-count | 40 |
P1027 | p585-count | 14 |
P1028 | p585-count | 1131 |
P1029 | p585-count | 4 |
P1035 | p585-count | 4 |
P1037 | p585-count | 60 |
P1040 | p585-count | 1 |
P1050 | p585-count | 246 |
kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
--number-of-columns 3
In input 1 header 'node1 label node2 node1;total': Expected 3 columns, got 4 in the header
Exit requested
kgtk cat -i examples/docs/cat-edges-with-totals.tsv \
--number-of-columns 4
node1 | label | node2 | node1;total |
---|---|---|---|
P10 | p585-count | 73 | 3879 |
P1000 | p585-count | 16 | 266 |
P101 | p585-count | 5 | 157519 |
P1018 | p585-count | 2 | 177 |
P102 | p585-count | 295 | 414726 |
P1025 | p585-count | 26 | 693 |
P1026 | p585-count | 40 | 6930 |
P1027 | p585-count | 14 | 10008 |
P1028 | p585-count | 1131 | 4035 |
P1029 | p585-count | 4 | 2643 |
P1035 | p585-count | 4 | 366 |
P1037 | p585-count | 60 | 9317 |
P1040 | p585-count | 1 | 45073 |
P1050 | p585-count | 246 | 226380 |
Expert Topic: Pure Python Copies¶
The fast copy option can be disabled by specifying --pure-python
.
kgtk cat -i examples/docs/cat-edges.tsv \
-i examples/docs/cat-edges.tsv \
--pure-python
node1 | label | node2 |
---|---|---|
P10 | p585-count | 73 |
P1000 | p585-count | 16 |
P101 | p585-count | 5 |
P1018 | p585-count | 2 |
P102 | p585-count | 295 |
P1025 | p585-count | 26 |
P1026 | p585-count | 40 |
P1027 | p585-count | 14 |
P1028 | p585-count | 1131 |
P1029 | p585-count | 4 |
P1035 | p585-count | 4 |
P1037 | p585-count | 60 |
P1040 | p585-count | 1 |
P1050 | p585-count | 246 |
P10 | p585-count | 73 |
P1000 | p585-count | 16 |
P101 | p585-count | 5 |
P1018 | p585-count | 2 |
P102 | p585-count | 295 |
P1025 | p585-count | 26 |
P1026 | p585-count | 40 |
P1027 | p585-count | 14 |
P1028 | p585-count | 1131 |
P1029 | p585-count | 4 |
P1035 | p585-count | 4 |
P1037 | p585-count | 60 |
P1040 | p585-count | 1 |
P1050 | p585-count | 246 |
Expert Topic: Changing the Fast Copy Minimum Size Throshold¶
Normally, kgtk cat
will use the fast copy path with system commands only
when the total sizes of the input files pass a threshhold. This is because
that are overheads on starting the system utilities as subprocesses, and for
very small files it may be faster to perform all processing directly in
Python.
The threshold may be changed. For example, if you wanted the code to use the fast copy path regardless of the size of the input files, use:
kgtk cat -i examples/docs/cat-edges.tsv \
-i examples/docs/cat-edges.tsv \
--fast-copy-min-size 0
node1 | label | node2 |
---|---|---|
P10 | p585-count | 73 |
P1000 | p585-count | 16 |
P101 | p585-count | 5 |
P1018 | p585-count | 2 |
P102 | p585-count | 295 |
P1025 | p585-count | 26 |
P1026 | p585-count | 40 |
P1027 | p585-count | 14 |
P1028 | p585-count | 1131 |
P1029 | p585-count | 4 |
P1035 | p585-count | 4 |
P1037 | p585-count | 60 |
P1040 | p585-count | 1 |
P1050 | p585-count | 246 |
P10 | p585-count | 73 |
P1000 | p585-count | 16 |
P101 | p585-count | 5 |
P1018 | p585-count | 2 |
P102 | p585-count | 295 |
P1025 | p585-count | 26 |
P1026 | p585-count | 40 |
P1027 | p585-count | 14 |
P1028 | p585-count | 1131 |
P1029 | p585-count | 4 |
P1035 | p585-count | 4 |
P1037 | p585-count | 60 |
P1040 | p585-count | 1 |
P1050 | p585-count | 246 |
Expert Topic: Overriding System Commands¶
The names of the system commands used by the fast copy path may be overridden on the command line.
kgtk cat -i examples/docs/cat-edges.tsv \
-i examples/docs/cat-edges.tsv \
--bash-command /usr/bin/bash \
--bzip2-command /usr/bin/bzip2 \
--cat-command /usr/bin/cat \
--gzip-command /usr/bin/gzip \
--tail-command /usr/bin/tail \
--xz-command /usr/bin/xz
node1 | label | node2 |
---|---|---|
P10 | p585-count | 73 |
P1000 | p585-count | 16 |
P101 | p585-count | 5 |
P1018 | p585-count | 2 |
P102 | p585-count | 295 |
P1025 | p585-count | 26 |
P1026 | p585-count | 40 |
P1027 | p585-count | 14 |
P1028 | p585-count | 1131 |
P1029 | p585-count | 4 |
P1035 | p585-count | 4 |
P1037 | p585-count | 60 |
P1040 | p585-count | 1 |
P1050 | p585-count | 246 |
P10 | p585-count | 73 |
P1000 | p585-count | 16 |
P101 | p585-count | 5 |
P1018 | p585-count | 2 |
P102 | p585-count | 295 |
P1025 | p585-count | 26 |
P1026 | p585-count | 40 |
P1027 | p585-count | 14 |
P1028 | p585-count | 1131 |
P1029 | p585-count | 4 |
P1035 | p585-count | 4 |
P1037 | p585-count | 60 |
P1040 | p585-count | 1 |
P1050 | p585-count | 246 |