Skip to content

validate

Overview

This tool validates that KGTK files meet the rules in the KGTK File Format v2. Error messages will be generated when rule violations are detected. By default, most error messages are written to standard output so they may be easily captured in a log file.

One or more KGTK files may be processed at a time. Input files will be decompressed automatically in certain conditions.

Note

All of the validations shown here are done by KgtkReader. They may be enabled in any KGTK tool that uses KgtkReader to read its input files. kgtk validate enables line and data value validation by default, while other KGTK tools disable these processing steps by default.

Input File Decompression

Input files may be (de)compressed using a algorithm selected by the file extension: .bz2 .gz .lz4 .xy

The expert option --compression-type may be used to override the decompression selection algorithm; this is useful when reading from piped input.

Default Rules

By default, the following rules apply:

  • errors that occur while processing a KGTK file's column header line cause an immediate exit:
    • An empty column name
    • A duplicate column name
    • A missing required column name for an edge or node file
    • An ambiguous required column name (e.g., id and ID are both present)
  • empty data lines are silently ignored and not passed through.
  • data lines containing only whitespace are silently ignored and not passed through.
  • data lines with empty required fields (node1 and node2 for KGTK edge files, id for KGTK node files) are silently ignored.
  • data lines that have too few fields cause a complaint to be issued.
  • data lines that have too many fields cause a complaint to be issued.
  • lines with data value validation errors cause a complaint to be issued.

These defaults may be changed through expert options.

Action Codes

The action codes are used to control what happens when kgtk validate discovers a rule violation.

Action keyword Action when condition detected
PASS Silently allow the data line to pass through.
REPORT Report the data line and let it pass through.
EXCLUDE Silently exclude (ignore) the data line.
COMPLAIN Report the data line and exclude (ignore) it.
ERROR Raise a ValueError. This may be useful when you wish to interrupt processing of a large file.
EXIT sys.exit(1) This may be useful when you wish to interrupt processing of a large file.

These codes apply to the following kgtk validate comand line options:

Option Default
--blank-required-field-line-action EXCLUDE
--comment-line-action EXCLUDE
--empty-line-action EXCLUDE
--invalid-value-action EXCLUDE
--long-line-action COMPLAIN
--prohibited-list-action COMPLAIN
--short-line-action COMPLAIN
--whitespace-line-action EXCLUDE

--header-error-action

The action to take if a header error is detected, such as:

  • An empty column name
  • A duplicate column name
  • A missing required column name for an edge or node file
  • An ambiguous required column name (e.g., id and ID are both present)

Only ERROR and EXIT actions are implemented for header errors.

--unsafe-column-name

The action to take if a header column name contains one of the following:

  • Leading white space
  • Trailing white space
  • Internal white space except in strings or language-qualified strings
  • Commas
  • Vertical bars

--error-limit

Execution will stop if the error limit is exceeded. The default value is 1000 errors. To avoid stopping at 1000 errors, either raise the error limit or set it to zero:

--error-limit=0

KGTK File Mode

Mode Meaning
NONE Do not require node1, node1, or id columns
EDGE Treat the input file as a KGTK edge file and require the presence of node1 and node2 columns or their allowable aliases.
NODE Treat the input file as a KGTK node file and require the presence of an id column or its allowable alias (ID).
AUTO Automatically determine if an input file is an edge file or a node file. If a node1 (or allowable alias) column is present, assume that the file is a KGTK edge file. Otherwise, assume that it is a KGTK node file

Special Column Names and Aliases

Canonical Name Allowed Aliases Comments
id ID This is a required column in Node files, an optional one in Edge files (but may cause behavior changes if present).
node1 from, subject This is a required column in Edge files. It may not contain empty values.
label predicate, relation, relationship This is a required columns in Edge files. It may contain empty values.
node2 to, object This is a required column in Edge files. It may not contain empty values.

Escapes in Strings and Language Qualified Strings

KGTK strings ("...") and language-qualified strings ('...'@lan) may contain the following escape sequences.

Sequence Description Comments
\a alarm (bell) - ASCII <BEL>
\b backspace - ASCII <BS>
\f formfeed - ASCII <FF>
\n newline (linefeed) - ASCII <LF>
\r carriage return - ASCII <CR>
\t horizontal tab - ASCII <TAB>
\v vertical tab - ASCII <VT>
\\ backslash - (\)
\' single quote - (') The KGTK sigil for language qualified strings.
\" double quote - (") The KGTK sigil for strings.
\| vertical bar - (|) The KGTK multi-valued list separator.

Info

A sigil is a symbol attached to (usually prefixing) a variable name, usually expressing the variable's datatype or scope (see Wikipedia). Here, it means the introductory character that determines the datatype of a KGTK value.

Usage

usage: kgtk validate [-h] [-i INPUT_FILE [INPUT_FILE ...]]
                     [--header-only [HEADER_ONLY]]
                     [--summary [REPORT_SUMMARY]] [-v [optional True|False]]

Validate one or more KGTK files. Empty lines, whitespace lines, comment lines, and lines with empty required fields are silently skipped. Header errors cause an immediate exception. Data value errors are reported. 

To validate data and pass clean data to an output file or pipe, use the kgtk clean_data command.

Additional options are shown in expert help.
kgtk --expert validate --help

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE [INPUT_FILE ...], --input-files INPUT_FILE [INPUT_FILE ...]
                        The KGTK file(s) to validate. (May be omitted or '-'
                        for stdin.)
  --header-only [HEADER_ONLY]
                        Process the only the header of the input file
                        (default=False).
  --summary [REPORT_SUMMARY]
                        Report a summary on the lines processed.
                        (default=True).

  -v [optional True|False], --verbose [optional True|False]
                        Print additional progress messages (default=False).

Expert Usage

usage: kgtk validate [-h] [-i INPUT_FILE [INPUT_FILE ...]]
                     [--header-only [HEADER_ONLY]]
                     [--summary [REPORT_SUMMARY]]
                     [--errors-to-stdout [optional True|False] |
                     --errors-to-stderr [optional True|False]]
                     [--show-options [optional True|False]]
                     [-v [optional True|False]]
                     [--very-verbose [optional True|False]]
                     [--column-separator COLUMN_SEPARATOR]
                     [--input-format INPUT_FORMAT]
                     [--compression-type COMPRESSION_TYPE]
                     [--error-limit ERROR_LIMIT]
                     [--use-mgzip [optional True|False]]
                     [--mgzip-threads MGZIP_THREADS]
                     [--gzip-in-parallel [optional True|False]]
                     [--gzip-queue-size GZIP_QUEUE_SIZE]
                     [--implied-label IMPLIED_LABEL]
                     [--use-graph-cache-envar [optional True|False]]
                     [--ignore-stale-graph-cache [optional True|False]]
                     [--graph-cache GRAPH_CACHE]
                     [--graph-cache-fetchmany-size GRAPH_CACHE_FETCHMANY_SIZE]
                     [--graph-cache-filter-batch-size GRAPH_CACHE_FILTER_BATCH_SIZE]
                     [--mode {NONE,EDGE,NODE,AUTO}]
                     [--input-column-names FORCE_COLUMN_NAMES [FORCE_COLUMN_NAMES ...]]
                     [--no-input-header [optional True|False]]
                     [--supply-missing-column-names [optional True|False]]
                     [--number-of-columns COUNT]
                     [--require-column-names REQUIRE_COLUMN_NAMES [REQUIRE_COLUMN_NAMES ...]]
                     [--no-additional-columns [optional True|False]]
                     [--unquote-csv-column-names [optional True|False]]
                     [--header-error-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
                     [--unsafe-column-name-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
                     [--prohibit-whitespace-in-column-names [optional True|False]]
                     [--initial-skip-count INITIAL_SKIP_COUNT]
                     [--every-nth-record EVERY_NTH_RECORD]
                     [--record-limit RECORD_LIMIT] [--tail-count TAIL_COUNT]
                     [--repair-and-validate-lines [optional True|False]]
                     [--repair-and-validate-values [optional True|False]]
                     [--blank-required-field-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
                     [--comment-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
                     [--empty-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
                     [--fill-short-lines [optional True|False]]
                     [--invalid-value-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
                     [--long-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
                     [--prohibited-list-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
                     [--short-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
                     [--truncate-long-lines [TRUNCATE_LONG_LINES]]
                     [--whitespace-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
                     [--additional-language-codes [ADDITIONAL_LANGUAGE_CODES ...]]
                     [--allow-lax-qnodes [ALLOW_LAX_QNODES]]
                     [--allow-language-suffixes [ALLOW_LANGUAGE_SUFFIXES]]
                     [--allow-lax-strings [ALLOW_LAX_STRINGS]]
                     [--allow-lax-lq-strings [ALLOW_LAX_LQ_STRINGS]]
                     [--allow-wikidata-lq-strings [ALLOW_WIKIDATA_LQ_STRINGS]]
                     [--require-iso8601-extended [REQUIRE_ISO8601_EXTENDED]]
                     [--force-iso8601-extended [FORCE_ISO8601_EXTENDED]]
                     [--allow-month-or-day-zero [ALLOW_MONTH_OR_DAY_ZERO]]
                     [--repair-month-or-day-zero [REPAIR_MONTH_OR_DAY_ZERO]]
                     [--allow-end-of-day [ALLOW_END_OF_DAY]]
                     [--minimum-valid-year MINIMUM_VALID_YEAR]
                     [--clamp-minimum-year [CLAMP_MINIMUM_YEAR]]
                     [--ignore-minimum-year [IGNORE_MINIMUM_YEAR]]
                     [--maximum-valid-year MAXIMUM_VALID_YEAR]
                     [--clamp-maximum-year [CLAMP_MAXIMUM_YEAR]]
                     [--ignore-maximum-year [IGNORE_MAXIMUM_YEAR]]
                     [--validate-fromisoformat [VALIDATE_FROMISOFORMAT]]
                     [--allow-lax-coordinates [ALLOW_LAX_COORDINATES]]
                     [--repair-lax-coordinates [REPAIR_LAX_COORDINATES]]
                     [--allow-out-of-range-coordinates [ALLOW_OUT_OF_RANGE_COORDINATES]]
                     [--minimum-valid-lat MINIMUM_VALID_LAT]
                     [--clamp-minimum-lat [CLAMP_MINIMUM_LAT]]
                     [--maximum-valid-lat MAXIMUM_VALID_LAT]
                     [--clamp-maximum-lat [CLAMP_MAXIMUM_LAT]]
                     [--minimum-valid-lon MINIMUM_VALID_LON]
                     [--clamp-minimum-lon [CLAMP_MINIMUM_LON]]
                     [--maximum-valid-lon MAXIMUM_VALID_LON]
                     [--clamp-maximum-lon [CLAMP_MAXIMUM_LON]]
                     [--modulo-repair-lon [MODULO_REPAIR_LON]]
                     [--escape-list-separators [ESCAPE_LIST_SEPARATORS]]

Validate one or more KGTK files. Empty lines, whitespace lines, comment lines, and lines with empty required fields are silently skipped. Header errors cause an immediate exception. Data value errors are reported. 

To validate data and pass clean data to an output file or pipe, use the kgtk clean_data command.

Additional options are shown in expert help.
kgtk --expert validate --help

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE [INPUT_FILE ...], --input-files INPUT_FILE [INPUT_FILE ...]
                        The KGTK file(s) to validate. (May be omitted or '-'
                        for stdin.)
  --header-only [HEADER_ONLY]
                        Process the only the header of the input file
                        (default=False).
  --summary [REPORT_SUMMARY]
                        Report a summary on the lines processed.
                        (default=True).

Error and feedback messages:
  Send error messages and feedback to stderr or stdout, control the amount of feedback and debugging messages.

  --errors-to-stdout [optional True|False]
                        Send errors to stdout instead of stderr.
                        (default=False).
  --errors-to-stderr [optional True|False]
                        Send errors to stderr instead of stdout.
                        (default=False).
  --show-options [optional True|False]
                        Print the options selected (default=False).
  -v [optional True|False], --verbose [optional True|False]
                        Print additional progress messages (default=False).
  --very-verbose [optional True|False]
                        Print additional progress messages (default=False).

File options:
  Options affecting processing.

  --column-separator COLUMN_SEPARATOR
                        Column separator (default=<TAB>).
  --input-format INPUT_FORMAT
                        Specify the input format (default=None).
  --compression-type COMPRESSION_TYPE
                        Specify the compression type (default=None).
  --error-limit ERROR_LIMIT
                        The maximum number of errors to report before failing
                        (default=1000)
  --use-mgzip [optional True|False]
                        Execute multithreaded gzip. (default=False).
  --mgzip-threads MGZIP_THREADS
                        Multithreaded gzip thread count. (default=3).
  --gzip-in-parallel [optional True|False]
                        Execute gzip in parallel. (default=False).
  --gzip-queue-size GZIP_QUEUE_SIZE
                        Queue size for parallel gzip. (default=1000).
  --implied-label IMPLIED_LABEL
                        When specified, imply a label colum with the specified
                        value (default=None).
  --use-graph-cache-envar [optional True|False]
                        use KGTK_GRAPH_CACHE if --graph-cache is not
                        specified. (default=True).
  --ignore-stale-graph-cache [optional True|False]
                        Ignore the graph cache if the file exists with a
                        differen size or modificatin time. (default=True).
  --graph-cache GRAPH_CACHE
                        When specified, look for input files in a graph cache.
                        (default=None).
  --graph-cache-fetchmany-size GRAPH_CACHE_FETCHMANY_SIZE
                        Graph cache transfer buffer size. (default=1000).
  --graph-cache-filter-batch-size GRAPH_CACHE_FILTER_BATCH_SIZE
                        Graph cache filter batch size. (default=1000).
  --mode {NONE,EDGE,NODE,AUTO}
                        Determine the KGTK file mode
                        (default=KgtkReaderMode.AUTO).

Header parsing:
  Options affecting header parsing.

  --input-column-names FORCE_COLUMN_NAMES [FORCE_COLUMN_NAMES ...], --force-column-names FORCE_COLUMN_NAMES [FORCE_COLUMN_NAMES ...]
                        Supply input column names when the input file does not
                        have a header record (--no-input-header=True), or
                        forcibly override the column names when a header row
                        exists (--no-input-header=False) (default=None).
  --no-input-header [optional True|False]
                        When the input file does not have a header record,
                        specify --no-input-header=True and --input-column-
                        names. When the input file does have a header record
                        that you want to forcibly override, specify --input-
                        column-names and --no-input-header=False. --no-input-
                        header has no effect when --input-column-names has not
                        been specified. (default=False).
  --supply-missing-column-names [optional True|False]
                        Supply column names that are missing. (default=False).
  --number-of-columns COUNT
                        The expected number of columns in the header.
                        (default=None).
  --require-column-names REQUIRE_COLUMN_NAMES [REQUIRE_COLUMN_NAMES ...]
                        The list of column names required in the input file.
                        (default=None).
  --no-additional-columns [optional True|False]
                        When True, do not allow any column names other than
                        the required column names. When --require-column-names
                        is not specified, then disallow columns other than
                        [node1, label, node2, id] (or aliases) for an edge
                        file, and [id] for a node file. (default=False).
  --unquote-csv-column-names [optional True|False]
                        Remove double quotes from the outside of column names.
                        (default=True).
  --header-error-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
                        The action to take when a header error is detected.
                        Only ERROR or EXIT are supported
                        (default=ValidationAction.EXIT).
  --unsafe-column-name-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
                        The action to take when a column name is unsafe
                        (default=ValidationAction.REPORT).
  --prohibit-whitespace-in-column-names [optional True|False]
                        Prohibit whitespace in column names. (default=False).

Pre-validation sampling:
  Options affecting pre-validation data line sampling.

  --initial-skip-count INITIAL_SKIP_COUNT
                        The number of data records to skip initially
                        (default=do not skip).
  --every-nth-record EVERY_NTH_RECORD
                        Pass every nth record (default=pass all records).
  --record-limit RECORD_LIMIT
                        Limit the number of records read (default=no limit).
  --tail-count TAIL_COUNT
                        Pass this number of records (default=no tail
                        processing).

Line parsing:
  Options affecting data line parsing.

  --repair-and-validate-lines [optional True|False]
                        Repair and validate lines (default=True).
  --repair-and-validate-values [optional True|False]
                        Repair and validate values (default=True).
  --blank-required-field-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
                        The action to take when a line with a blank node1,
                        node2, or id field (per mode) is detected
                        (default=ValidationAction.EXCLUDE).
  --comment-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
                        The action to take when a comment line is detected
                        (default=ValidationAction.EXCLUDE).
  --empty-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
                        The action to take when an empty line is detected
                        (default=ValidationAction.EXCLUDE).
  --fill-short-lines [optional True|False]
                        Fill missing trailing columns in short lines with
                        empty values (default=False).
  --invalid-value-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
                        The action to take when a data cell value is invalid
                        (default=ValidationAction.COMPLAIN).
  --long-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
                        The action to take when a long line is detected
                        (default=ValidationAction.COMPLAIN).
  --prohibited-list-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
                        The action to take when a data cell contains a
                        prohibited list (default=ValidationAction.COMPLAIN).
  --short-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
                        The action to take when a short line is detected
                        (default=ValidationAction.COMPLAIN).
  --truncate-long-lines [TRUNCATE_LONG_LINES]
                        Remove excess trailing columns in long lines
                        (default=False).
  --whitespace-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
                        The action to take when a whitespace line is detected
                        (default=ValidationAction.EXCLUDE).

Data value parsing:
  Options controlling the parsing and processing of KGTK data values.

  --additional-language-codes [ADDITIONAL_LANGUAGE_CODES ...]
                        Additional language codes. (default=use internal
                        list).
  --allow-lax-qnodes [ALLOW_LAX_QNODES]
                        Allow qnode suffixes in quantities to include alphas
                        and dash as well as digits. (default=False).
  --allow-language-suffixes [ALLOW_LANGUAGE_SUFFIXES]
                        Allow language identifier suffixes starting with a
                        dash. (default=False).
  --allow-lax-strings [ALLOW_LAX_STRINGS]
                        Do not check if double quotes are backslashed inside
                        strings. (default=False).
  --allow-lax-lq-strings [ALLOW_LAX_LQ_STRINGS]
                        Do not check if single quotes are backslashed inside
                        language qualified strings. (default=False).
  --allow-wikidata-lq-strings [ALLOW_WIKIDATA_LQ_STRINGS]
                        Allow Wikidata language qualifiers. (default=False).
  --require-iso8601-extended [REQUIRE_ISO8601_EXTENDED]
                        Require colon(:) and hyphen(-) in dates and times.
                        (default=False).
  --force-iso8601-extended [FORCE_ISO8601_EXTENDED]
                        Force colon (:) and hyphen(-) in dates and times.
                        (default=False).
  --allow-month-or-day-zero [ALLOW_MONTH_OR_DAY_ZERO]
                        Allow month or day zero in dates. (default=False).
  --repair-month-or-day-zero [REPAIR_MONTH_OR_DAY_ZERO]
                        Repair month or day zero in dates. (default=False).
  --allow-end-of-day [ALLOW_END_OF_DAY]
                        Allow 24:00:00 to represent the end of the day.
                        (default=True).
  --minimum-valid-year MINIMUM_VALID_YEAR
                        The minimum valid year in dates. (default=1583).
  --clamp-minimum-year [CLAMP_MINIMUM_YEAR]
                        Clamp years at the minimum value. (default=False).
  --ignore-minimum-year [IGNORE_MINIMUM_YEAR]
                        Ignore the minimum year constraint. (default=False).
  --maximum-valid-year MAXIMUM_VALID_YEAR
                        The maximum valid year in dates. (default=2100).
  --clamp-maximum-year [CLAMP_MAXIMUM_YEAR]
                        Clamp years at the maximum value. (default=False).
  --ignore-maximum-year [IGNORE_MAXIMUM_YEAR]
                        Ignore the maximum year constraint. (default=False).
  --validate-fromisoformat [VALIDATE_FROMISOFORMAT]
                        Validate that datetime.fromisoformat(...) can parse
                        this date and time. This checks that the
                        year/month/day combination is valid. The year must be
                        in the range 1..9999, inclusive. (default=False).
  --allow-lax-coordinates [ALLOW_LAX_COORDINATES]
                        Allow coordinates using scientific notation.
                        (default=False).
  --repair-lax-coordinates [REPAIR_LAX_COORDINATES]
                        Allow coordinates using scientific notation.
                        (default=False).
  --allow-out-of-range-coordinates [ALLOW_OUT_OF_RANGE_COORDINATES]
                        Allow coordinates that don't make sense.
                        (default=False).
  --minimum-valid-lat MINIMUM_VALID_LAT
                        The minimum valid latitude. (default=-90.000000).
  --clamp-minimum-lat [CLAMP_MINIMUM_LAT]
                        Clamp latitudes at the minimum value. (default=False).
  --maximum-valid-lat MAXIMUM_VALID_LAT
                        The maximum valid latitude. (default=90.000000).
  --clamp-maximum-lat [CLAMP_MAXIMUM_LAT]
                        Clamp latitudes at the maximum value. (default=False).
  --minimum-valid-lon MINIMUM_VALID_LON
                        The minimum valid longitude. (default=-180.000000).
  --clamp-minimum-lon [CLAMP_MINIMUM_LON]
                        Clamp longitudes at the minimum value.
                        (default=False).
  --maximum-valid-lon MAXIMUM_VALID_LON
                        The maximum valid longitude. (default=180.000000).
  --clamp-maximum-lon [CLAMP_MAXIMUM_LON]
                        Clamp longitudes at the maximum value.
                        (default=False).
  --modulo-repair-lon [MODULO_REPAIR_LON]
                        Wrap longitude to (-180.0,180.0]. (default=False).
  --escape-list-separators [ESCAPE_LIST_SEPARATORS]
                        Escape all list separators instead of splitting on
                        them. (default=False).

Examples

Sample Data: a Date Containing Day 00

Suppose that examples/docs/validate-date-with-day-zero.tsv contains the following table in KGTK format:

kgtk cat -i examples/docs/validate-date-with-day-zero.tsv
node1 label node2
john woke ^2020-05-00T00:00
john woke ^2020-00-00T00:00

Validate using Default Options

kgtk validate -i examples/docs/validate-date-with-day-zero.tsv

The following complaint and summary will be issued:

====================================================
Data lines read: 2
Data lines passed: 0
Data lines excluded due to invalid values: 2
Data errors reported: 2

The first data line was flagged because it contained "00" in the day field, which violates the ISO 8601 specification.

The following error message is sent to stderr. The return status is 1.

Errors detected

Validate with Verbose Feedback

Sometimes you may wish to get more feedback about what kgtk validate is doing.

kgtk validate -i examples/docs/validate-date-with-day-zero.tsv \
              --verbose

This results in the following output:

====================================================
Validating 'examples/docs/validate-date-with-day-zero.tsv'
input format: kgtk
KgtkReader: File_path.suffix: .tsv
KgtkReader: reading file examples/docs/validate-date-with-day-zero.tsv
header: node1   label   node2
column names: ['node1', 'label', 'node2']
node1 column found, this is a KGTK edge file
KgtkReader: is_edge_file=True is_node_file=False
KgtkReader: Special columns: node1=0 label=1 node2=2 id=-1
KgtkReader: Reading an edge file.
Validated 0 data lines

====================================================
Data lines read: 2
Data lines passed: 0
Data lines excluded due to invalid values: 2
Data errors reported: 2

The following error message is sent to stderr. The return status is 1.

Errors detected

Validate Only the Header

Validate only the header record, ignoring data records:

kgtk validate -i examples/docs/validate-date-with-day-zero.tsv \
              --header-only
====================================================
Data lines read: 0
Data lines passed: 0

Header Error: No Header Line in File (Empty File)

Validate an empty input file:

kgtk validate -i examples/docs/validate-empty-file.tsv

This generates the following message on standard output:

Error: No header line in file

The following error message is sent to stderr. The return status is 1.

Exiting due to error

Supply a Missing Header Line

Validate an empty input file, supplying a header line:

kgtk validate -i examples/docs/validate-empty-file.tsv \
              --force-column-names node1 label node2 \
          --no-input-header

This generates the following message on standard output:

====================================================
Data lines read: 0
Data lines passed: 0

Header Error: No Header Line to Skip

Validate an empty input file, skipping a nonexistant header line.

kgtk validate -i examples/docs/validate-empty-file.tsv \
              --force-column-names node1 label node2

This generates the following message on standard output:

Error: No header line to skip

The following error message is sent to stderr. The return status is 1.

Exiting due to error

Header Error: Column Name Is Empty

Validate an input file with an empty column name:

cat examples/docs/validate-empty-column-name.tsv
label node2
kgtk validate -i examples/docs/validate-empty-column-name.tsv

The following error is reported on standard output:

In input header '   label   node2': Column 0 has an empty name in the file header

The following error message is sent to stderr. The return status is 1.

Exit requested

Header Error: See All Header Errors

Validate an input file with an empty column name. This will generate an error message, and normally an immediate exit. If you want to see all header error messages, use --header-error-action COMPLAIN to continue processing.

cat examples/docs/validate-empty-column-name.tsv
label node2
kgtk validate -i examples/docs/validate-empty-column-name.tsv \
              --header-error-action COMPLAIN

The following error is reported on standard output:

In input header '   label   node2': Column 0 has an empty name in the file header
In input header '   label   node2': Missing required column: id | ID

====================================================
Data lines read: 0
Data lines passed: 0

Processing continues without exiting.

Note

No error message is sent to stderr and the return status is 0.

Header Error: Column Name Starts with White Space

Validate an input file where the intended node1, label, and node2 column names have initial whitespace.

cat examples/docs/validate-column-names-initial-whitespace.tsv
id node1 label node2
kgtk validate -i examples/docs/validate-column-names-initial-whitespace.tsv

The following message is reported on standard output:

In input header 'id  node1   label   node2': 
Column name ' node1' starts with leading white space
Column name ' label' starts with leading white space
Column name ' node2' starts with leading white space

====================================================
Data lines read: 0
Data lines passed: 0

By default, this message is reported but processing continues without exiting. An error return status is not generated.

Header Error: Column Name Starts with White Space: Exit Requested

Validate an input file where the intended node1, label, and node2 column names have initial whitespace, exiting with an error if initial whitespace is detected.

cat examples/docs/validate-column-names-initial-whitespace.tsv
id node1 label node2
kgtk validate -i examples/docs/validate-column-names-initial-whitespace.tsv \
              --unsafe-column-name-action EXIT

The following message is reported on standard output:

In input header 'id  node1   label   node2': 
Column name ' node1' starts with leading white space
Column name ' label' starts with leading white space
Column name ' node2' starts with leading white space

The following error message is sent to stderr. The return status is 1.

Exit requested

Header Error: Column Name Ends with White Space

Validate an input file where the intended node1, label, and node2 column names have trailing whitespace.

cat examples/docs/validate-column-names-trailing-whitespace.tsv
id node1 label node2
kgtk validate -i examples/docs/validate-column-names-trailing-whitespace.tsv

The following message is reported on standard output:

In input header 'id node1   label   node2 ': 
Column name 'node1 ' ends with trailing white space
Column name 'label ' ends with trailing white space
Column name 'node2 ' ends with trailing white space

====================================================
Data lines read: 0
Data lines passed: 0

By default, this message is reported but processing continues without exiting. An error return status is not generated.

Header Error: Column Name Ends with White Space: Exit Requested

Validate an input file where the intended node1, label, and node2 column names have trailing whitespace, exiting with an error if trailing whitespace is detected.

cat examples/docs/validate-column-names-trailing-whitespace.tsv
id node1 label node2
kgtk validate -i examples/docs/validate-column-names-trailing-whitespace.tsv \
              --unsafe-column-name-action EXIT

The following error is reported on standard output:

In input header 'id node1   label   node2 ': 
Column name 'node1 ' ends with trailing white space
Column name 'label ' ends with trailing white space
Column name 'node2 ' ends with trailing white space

The following error message is sent to stderr. The return status is 1.

Exit requested

Header Error: Column Name Contains Internal White Space

Validate an input file where the intended node1 and node2 column names have internal whitespace. By default, this is allowed, but it may be prohibited on request.

cat examples/docs/validate-column-names-internal-whitespace.tsv
id node 1 label node 2
kgtk validate -i examples/docs/validate-column-names-internal-whitespace.tsv \
              --prohibit-whitespace-in-column-names

The following error is reported on standard output:

In input header 'id node 1  label   node 2': 
Column name 'node 1' contains internal white space
Column name 'node 2' contains internal white space

====================================================
Data lines read: 0
Data lines passed: 0

By default, this message is reported but processing continues without exiting. An error return status is not generated.

Header Error: Column Name Contains Internal White Space: Exit Requested

Validate an input file where the intended node1 and node2 column names have internal whitespace, exiting with an error if trailing whitespace is detected.

cat examples/docs/validate-column-names-internal-whitespace.tsv
id node 1 label node 2
kgtk validate -i examples/docs/validate-column-names-internal-whitespace.tsv \
              --prohibit-whitespace-in-column-names \
              --unsafe-column-name-action EXIT

The following error is reported on standard output:

In input header 'id node 1  label   node 2': 
Column name 'node 1' contains internal white space
Column name 'node 2' contains internal white space

The following error message is sent to stderr. The return status is 1.

Exit requested

Header Error: Column Name Contains a Comma (,)

Validate an input file where the intended node1, label, and node2 column names have a comma (,) at the end.

cat examples/docs/validate-column-names-with-comma.tsv
node1, label, node2, id
kgtk validate -i examples/docs/validate-column-names-with-comma.tsv

The following error is reported on standard output:

In input header 'node1, label,  node2,  id': 
Warning: Column name 'node1,' contains a comma (,)
Warning: Column name 'label,' contains a comma (,)
Warning: Column name 'node2,' contains a comma (,)

====================================================
Data lines read: 0
Data lines passed: 0

By default, this message is reported but processing continues without exiting. An error return status is not generated.

Header Error: Column Name Contains a Comma (,), Exit Requested

Validate an input file where the intended node1, label, and node2 column names have a comma (,) at the end, exiting with an error if trailing whitespace is detected.

cat examples/docs/validate-column-names-with-comma.tsv
node1, label, node2, id
kgtk validate -i examples/docs/validate-column-names-with-comma.tsv \
              --unsafe-column-name-action EXIT

The following error is reported on standard output:

In input header 'node1, label,  node2,  id': 
Warning: Column name 'node1,' contains a comma (,)
Warning: Column name 'label,' contains a comma (,)
Warning: Column name 'node2,' contains a comma (,)

The following error message is sent to stderr. The return status is 1.

Exit requested

Header Error: Column Name Contains a Vertical Bar (|)

Validate an input file where the intended node1, label, and node2 column names have a vertical bar (`|) at the end.

kgtk validate -i examples/docs/validate-column-names-with-vertical-bar.tsv

The following warnings is reported on standard output:

In input header 'node1| label|  node2|  id': 
Warning: Column name 'node1|' contains a vertical bar (|)
Warning: Column name 'label|' contains a vertical bar (|)
Warning: Column name 'node2|' contains a vertical bar (|)

====================================================
Data lines read: 0
Data lines passed: 0

By default, this message is reported but processing continues without exiting. An error return status is not generated.

Header Error: Column Name Contains a Vertical Bar (|): Exit Requested

Validate an input file where the intended node1, label, and node2 column names have a vertical bar (`|) at the end, exiting with an error if trailing whitespace is detected.

kgtk validate -i examples/docs/validate-column-names-with-vertical-bar.tsv \
              --unsafe-column-name-action EXIT

The following warnings is reported on standard output:

In input header 'node1| label|  node2|  id': 
Warning: Column name 'node1|' contains a vertical bar (|)
Warning: Column name 'label|' contains a vertical bar (|)
Warning: Column name 'node2|' contains a vertical bar (|)

The following error message is sent to stderr. The return status is 1.

Exit requested

Header Error: Column Name Is a Duplicate

Validate an input file with two node1 columns instead of node1 and node2 columns.

cat examples/docs/validate-column-names-with-duplicates.tsv
node1 label id
kgtk validate -i examples/docs/validate-column-names-with-duplicates.tsv

The following error is reported on standard output:

In input header 'node1  label   node1   id': Column 2 (node1) is a duplicate of column 0

The following error message is sent to stderr. The return status is 1.

Exit requested

Header Error: Missing Required Column in a Node File

Validate an input file as a KGTK Node file when the input file does not have the required column (id) for a Node file. We force the file to be treated as a Node file by specifying --mode=NODE.

cat examples/docs/validate-column-names-without-required-columns.tsv
col1 col2 col3
kgtk validate -i examples/docs/validate-column-names-without-required-columns.tsv \
              --mode=NODE

The following error is reported on standard output:

In input header 'col1   col2    col3': Missing required column: id | ID

The following error message is sent to stderr. The return status is 1.

Exit requested

Header Error: Missing Required Columns in an Edge File

Validate an input file as a KGTK Edge file when the input file does not have the required columns (node1, label, node2) for a Edge file. We force the file to be treated as a Edge file by specifying --mode=EDGE.

cat examples/docs/validate-column-names-without-required-columns.tsv
col1 col2 col3
kgtk validate -i examples/docs/validate-column-names-without-required-columns.tsv \
              --mode=EDGE

The following error is reported on standard output:

In input header 'col1   col2    col3': Missing required column: node1 | from | subject

The following error message is sent to stderr. The return status is 1.

Exit requested

Header Error: Missing Required Column with --mode=AUTO

Validate an input file when the input file does not have the required columns for as Edge or Node file, and we force auto-mode sensing with --mode=AUTO.

cat examples/docs/validate-column-names-without-required-columns.tsv
col1 col2 col3
kgtk validate -i examples/docs/validate-column-names-without-required-columns.tsv \
              --mode=AUTO

The following error is reported on standard output:

In input header 'col1   col2    col3': Missing required column: id | ID

The following error message is sent to stderr. The return status is 1.

Exit requested

Note: No Columns are Required with --mode=NONE

Validate an input file with required column validtion disabled with --mode=NONE

cat examples/docs/validate-column-names-without-required-columns.tsv
col1 col2 col3
kgtk validate -i examples/docs/validate-column-names-without-required-columns.tsv \
              --mode=NONE

The following is reported on standard output:

====================================================
Data lines read: 0
Data lines passed: 0

Header Error: Ambiguous Required Columns

Validate an input file with a node1 column abd its alias from.

cat examples/docs/validate-column-names-with-ambiguities.tsv
node1 label node2 id
kgtk validate -i examples/docs/validate-column-names-with-ambiguities.tsv

The following error is reported on standard output:

In input header 'node1  label   node2   id  from': Ambiguous required column names node1 and from

The following error message is sent to stderr. The return status is 1.

Exit requested

Note

When there are multiple ambiguous column names, only the first pair of ambiguous names is reported. This behavior may change in the future to report all ambiguous column names sets.

Line Check: Empty Lines

Empty lines are silently ignored from input files during validation when --empty-line-action=EXCLUDE (the default).

cat examples/docs/validate-empty-lines.tsv
node1 label node2
line1 isa line
line3 isa line
kgtk validate -i examples/docs/validate-empty-lines.tsv
====================================================
Data lines read: 3
Data lines passed: 2
Data lines ignored: 1

Note

See the table of Action Codes for a discussion of other --empty-line-action values.

Line Check: Comment Lines

Comment lines (lines that begin with hash (#)) are silently ignored in input files during validation when --comment-line-action=EXCLUDE (the default).

kgtk validate -i examples/docs/validate-comment-lines.tsv
====================================================
Data lines read: 3
Data lines passed: 2
Data lines ignored: 1

Note

At the present time the input file cannot be shown in this document for this example.

Note

See the table of Action Codes for a discussion of other --comment-line-action values.

Line Check: Whitespace Lines

Whitespace lines are silently ignored in input files during validation whe --whitespace-line-action=EXCLUDE (the default).

cat examples/docs/validate-whitespace-lines.tsv
node1 label node2
line1 isa line
line3 isa line
kgtk validate -i examples/docs/validate-whitespace-lines.tsv
====================================================
Data lines read: 3
Data lines passed: 2
Data lines ignored: 1

Note

See the table of Action Codes for a discussion of other --whitespace-line-action values.

Line Check: Short Lines

Short lines, lines with too few columns, are silently ignored input files during validation if fill-short-lines=False (the default) and --short-line-action=COMPLAIN (the default)

cat examples/docs/validate-short-lines.tsv
node1 label node2
line1 isa line
line2 isashortline
line3 isa line
kgtk validate -i examples/docs/validate-short-lines.tsv

The following is reported on standard output:

====================================================
Data lines read: 3
Data lines passed: 2
Data lines excluded due to too few columns: 1
Data errors reported: 1

The following error is reported on standard error: Errors detected

Note

See the table of Action Codes for a discussion of other --short-line-action settings.

Line Check: Fill Missing Trailing Columns

Short lines, lines with too few columns, are padded on input if --fill-short-lines=True is specified. --short-line-action will not be triggered.

cat examples/docs/validate-short-lines.tsv
node1 label node2
line1 isa line
line2 isashortline
line3 isa line
kgtk validate -i examples/docs/validate-short-lines.tsv \
              --fill-short-lines
====================================================
Data lines read: 3
Data lines passed: 2
Data lines filled: 1
Data lines excluded due to blank fields: 1

Line Check: Long Lines

Long lines, lines with extra columns, are silently ignored input files during validation if truncate-long-lines=True (the default) and --long-line-action=COMPLAIN (the default).

cat examples/docs/validate-long-lines.tsv
node1 label node2
line1 isa line
line2 isa long
line3 isa line
kgtk validate -i examples/docs/validate-long-lines.tsv
====================================================
Data lines read: 3
Data lines passed: 2
Data lines excluded due to too many columns: 1
Data errors reported: 1

The following error is reported on standard error: Errors detected

Note

See the table of Action Codes for a discussion of other --long-line-action values.

Line Check: Remove Extra Trailing Columns

Long lines, lines with extra columns, are truncated on input if --truncate-longt-lines=True is specified. --long-line-action will not be triggered.

cat examples/docs/validate-long-lines.tsv
node1 label node2
line1 isa line
line2 isa long
line3 isa line
kgtk validate -i examples/docs/validate-long-lines.tsv \
              --truncate-long-lines
====================================================
Data lines read: 3
Data lines passed: 3
Data lines truncated: 1

Line Check: Prohibited Lists in the node1 Column of Edge Files

Multivalue lists (|) are prohibited by the KGTK File Specification v2 in the node1, label, and node2 columns of a KGTK edge file. This constraint is applied when --prohibited-list-action==COMPLAIN (the default).

cat examples/docs/validate-node1-list.tsv
node1 label node2 id
line1|line3 isa line id1
kgtk validate -i examples/docs/validate-node1-list.tsv
====================================================
Data lines read: 1
Data lines passed: 0
Data lines excluded due to prohibited lists: 1
Data errors reported: 1

The following error is reported on standard error: Errors detected

Note

This constraint does not apply to KGTK node files or to quasi-KGTK (--mode=NONE) files.

Note

See the table of Action Codes for a discussion of other --prohibited-list-action values.

Line Check: Prohibited Lists in the label Column of Edge Files

Multivalue lists (|) are prohibited by the KGTK File Specification v2 in the node1, label, and node2 columns of a KGTK edge file. This constraint is applied when --prohibited-list-action==COMPLAIN (the default).

cat examples/docs/validate-label-list.tsv
node1 label node2 id
line1 isa|equals line id1
kgtk validate -i examples/docs/validate-label-list.tsv
====================================================
Data lines read: 1
Data lines passed: 0
Data lines excluded due to prohibited lists: 1
Data errors reported: 1

The following error is reported on standard error: Errors detected

Note

This constraint does not apply to KGTK node files or to quasi-KGTK (--mode=NONE) files.

Note

See the table of Action Codes for a discussion of other --prohibited-list-action values.

Line Check: Prohibited Lists in the node2 Column of Edge Files

Multivalue lists (|) are prohibited by the KGTK File Specification v2 in the node1, label, and node2 columns of a KGTK edge file. This constraint is applied when --prohibited-list-action==COMPLAIN (the default).

cat examples/docs/validate-node2-list.tsv
node1 label node2 id
line1 isa line|record id1
kgtk validate -i examples/docs/validate-node2-list.tsv
====================================================
Data lines read: 1
Data lines passed: 0
Data lines excluded due to prohibited lists: 1
Data errors reported: 1

The following error is reported on standard error: Errors detected

Note

This constraint does not apply to KGTK node files or to quasi-KGTK (--mode=NONE) files.

Note

See the table of Action Codes for a discussion of other --prohibited-list-action values.

Line Check: Allow Multivalue Lists in the node1, label, and node2 Columns of Edge Files

Multivalue lists (|) are prohibited by the KGTK File Specification v2 in the node1, label, and node2 columns of a KGTK edge file. This constraint is applied when --prohibited-list-action==COMPLAIN (the default). The constraint can be removed by specifying --prohibited-list-action=PASS or --prohibited-list-action=REPORT.

cat examples/docs/validate-node2-list.tsv
node1 label node2 id
line1 isa line|record id1
kgtk validate -i examples/docs/validate-node2-list.tsv \
              --prohibited-list-action=PASS
====================================================
Data lines read: 1
Data lines passed: 1

The REPORT option will allow lines with prohibited multivalue lists to pass, but will report them to the output file (normally standard output for kgtk validate).

kgtk validate -i examples/docs/validate-node2-list.tsv \
              --prohibited-list-action=REPORT
====================================================
Data lines read: 1
Data lines passed: 1
Data errors reported: 1

The following error is reported on standard error: Errors detected

Note

This constraint does not apply to KGTK node files or to quasi-KGTK (--mode=NONE) files, so setting --mode=NONE is another way to remove this constraint, although it also removes many other constraints.

Line Check: node1 May Not Be Blank in an Edge File

The node1 field may not be blank in a KGTK edge file.

cat examples/docs/validate-node1-blank-edge.tsv
node1 label node2 id
isa line id1
kgtk validate -i examples/docs/validate-node1-blank-edge.tsv
====================================================
Data lines read: 1
Data lines passed: 0
Data lines excluded due to blank fields: 1

Line Check: node1 May Be Blank in a Node File

The node1 field may be blank in a KGTK node file.

cat examples/docs/validate-node1-blank-node.tsv
id size color node1
id1 large red
kgtk validate -i examples/docs/validate-node1-blank-node.tsv \
              --mode=NODE
====================================================
Data lines read: 1
Data lines passed: 1

Note

In this example it was necessary to specify --mode=NODE to prevent the input file from being treated as an edge file.

Line Check: label May Be Blank in an Edge File

The label field may be blank in a KGTK edge file.

cat examples/docs/validate-label-blank-edge.tsv
node1 label node2 id
line1 line id1
kgtk validate -i examples/docs/validate-label-blank-edge.tsv
====================================================
Data lines read: 1
Data lines passed: 1

Line Check: label May Be Blank in a Node File

The label field may be blank in a KGTK node file.

cat examples/docs/validate-label-blank-node.tsv
id size color label
id1 large red
kgtk validate -i examples/docs/validate-label-blank-node.tsv
====================================================
Data lines read: 1
Data lines passed: 1

Line Check: node2 May Not Be Blank in an Edge File

The node2 field may not be blank in a KGTK edge file.

cat examples/docs/validate-node2-blank-edge.tsv
node1 label node2 id
line1 isa id1
kgtk validate -i examples/docs/validate-node2-blank-edge.tsv
====================================================
Data lines read: 1
Data lines passed: 0
Data lines excluded due to blank fields: 1

Line Check: node2 May Be Blank in a Node File

The node2 field may be blank in a KGTK node file.

cat examples/docs/validate-node2-blank-node.tsv
id size color node2
id1 large red
kgtk validate -i examples/docs/validate-node2-blank-node.tsv
====================================================
Data lines read: 1
Data lines passed: 1

Line Check: id May Be Blank in an Edge File

The id field may be blank in a KGTK edge file.

cat examples/docs/validate-id-blank-edge.tsv
node1 label node2 id
line1 isa line
kgtk validate -i examples/docs/validate-id-blank-edge.tsv
====================================================
Data lines read: 1
Data lines passed: 1

Line Check: id May Not Be Blank in a Node File

The id field may not be blank in a KGTK node file.

cat examples/docs/validate-id-blank-node.tsv
id size color
large red
kgtk validate -i examples/docs/validate-id-blank-node.tsv
====================================================
Data lines read: 1
Data lines passed: 0
Data lines excluded due to blank fields: 1

Value Check: Numbers and Quantities

Numbers are dimensionless. They may be integers (decimal, binary, octal, or hexadecimal), floating point (with or without exponential), or imaginary.

Quanties are numbers with an attached tolerance and/or dimension. The dimemsion may be indicated by SI units or by a QNode (a Wikidata QID or Q identifier).

By default, standard Wikidata QNodes are allowed as dimension qualifiers in quantities. When --allow-lax-qnodes=FALSE (the default), a QNode is an initial Q followed by an initial digit other than 0, followed by zero or more digits 0-9.

Lines with invalid numbers quantities are excluded by default.

kgtk cat -i examples/docs/validate-numbers-and-quantities.tsv
node1 label node2
line1 invalid 9x
line2 invalid 9[8,10j]
line3 invalid --9
line4 valid 9
line5 valid 9m
line6 valid 9Q12345
line7 invalid 9Q012345
line8 invalid 9Q123_45
line9 invalid 9Q123-45
line10 invalid 9Q123az
line11 invalid 9Q123AZ
kgtk validate -i examples/docs/validate-numbers-and-quantities.tsv
====================================================
Data lines read: 11
Data lines passed: 3
Data lines excluded due to invalid values: 8
Data errors reported: 8

The following error is reported on standard error: Errors detected

Value Check: Lax QNodes in Quantities

By default, standard QNodes (Wikidata QIDs or Q identifiers) are allowed as dimension qualifiers in quantities. When --allow-lax-qnodes=FALSE (the default), a QNode is an initial Q followed by an initial digit other than 0, followed by zero or more digits 0-9.

When --allow-lax-qnodes=TRUE, the QNode pattern is generalized with the addition of -, _, and upper- and lower-case alphas (a-zA-Z) after the initial Q.

kgtk cat -i examples/docs/validate-lax-qnodes-in-quantities.tsv
node1 label node2
line6 valid 9Q12345
line7 valid 9Q012345
line8 valid 9Q123_45
line9 valid 9Q123-45
line10 valid 9Q123az
line11 valid 9Q123AZ
kgtk validate -i examples/docs/validate-lax-qnodes-in-quantities.tsv \
         --allow-lax-qnodes
====================================================
Data lines read: 6
Data lines passed: 6

Value Check: Strings

Strings begin and end with double quotes (").

Strings that start with a double quote but do not end with one are invalid.

Internal double quotes in a string must be escaped with backslash (\") when --allow-lax-strings=FALSE (the default), otherwise the string is invalid.

Tab characters inside a string must be represented by \t whens the tab character is the column separator (controlled by --column-separator).

List separators (|) must be escaped (\|) inside a string when --escape-list-separators=True (the default).

Invalid strings are excluded by default.

kgtk cat -i examples/docs/validate-strings.tsv
node1 label node2
line1 invalid "xxx
line2 valid "xxx\"yyy"
line3 invalid "xxx"yyy"
line4 valid "xxx\\yyy"
line5 valid "xxx\tyyy"
kgtk validate -i examples/docs/validate-strings.tsv
====================================================
Data lines read: 5
Data lines passed: 3
Data lines excluded due to invalid values: 2
Data errors reported: 2

The following error is reported on standard error: Errors detected

Value Check: Lax Strings

Strings with internal double quote characters(") that are not escaped (\") are considered valid when --allow-lax-strings=TRUE.

kgtk clean can convert lax strings into strict KGTK strings.

kgtk cat -i examples/docs/validate-lax-strings.tsv
node1 label node2
line1 invalid "xxx
line2 valid "xxx\"yyy"
line3 valid "xxx"yyy"
line4 valid "xxx\\yyy"
line5 valid "xxx\tyyy"
kgtk validate -i examples/docs/validate-lax-strings.tsv \
              --allow-lax-strings
====================================================
Data lines read: 5
Data lines passed: 4
Data lines excluded due to invalid values: 1
Data errors reported: 1

The following error is reported on standard error: Errors detected

Value Check: Language-Qualified Strings

KGTK language-qualified strings begin with single quotes ('). They end with single quotes (') followed by an at sign (@) and a language qualifier (e.g., en). Example: 'abc'@en.

Language-qualified strings that start with a single quote but do not end with one, followed by at sign and the language qualifier, are invalid.

Internal single quotes in a language-qualified string must be escaped with backslash (\') when --allow-lax-lq-strings=FALSE (the default), otherwise the language-qualified string is invalid.

Tab characters inside a language-qualified string must be represented by \t whens the tab character is the column separator (controlled by --column-separator).

List separators (|) must be escaped (\|) inside a language-qualified string when --escape-list-separators=True (the default).

The language qualifier is an ISO 639-3 (or ISO 639-5) two- or three-character language code when --allow-wikidata-lq-strings=FALSE (the default). The language qualifiers are validated against internal tables of ISO 639-3 (or ISO 639-5) codes and additional language codes.

When --additional-language-codes is specified it overrides the internal table of additional language codes.

By default, --allow-language-suffixes=FALSE. When --allow-language-suffixes=TRUE, the language qualifier may be followed by a language suffix, which is a dash (-) followed by a string matching the pattern [-a-zA-Z0-9]+.

Invalid language-qualified strings are excluded by default.

kgtk cat -i examples/docs/validate-language-qualified-strings.tsv
node1 label node2
line1 valid 'abc'@en
line2 valid 'a\'bc'@en
line3 invalid 'a'bc'@en
line4 invalid 'abc'@en-gb
line5 invalid 'abc'@xxx
kgtk validate -i examples/docs/validate-language-qualified-strings.tsv
====================================================
Data lines read: 5
Data lines passed: 2
Data lines excluded due to invalid values: 3
Data errors reported: 3

The following error is reported on standard error: Errors detected

Value Check: Language-Qualified Strings with Suffixes

When --allow-language-suffixes=TRUE, the language qualifier may be followed by a language suffix, which is a dash (-) followed by a string matching the pattern [-a-zA-Z0-9]+.

kgtk cat -i examples/docs/validate-language-qualified-strings-with-suffixes.tsv
node1 label node2
line1 valid 'abc'@en-gb
kgtk validate -i examples/docs/validate-language-qualified-strings-with-suffixes.tsv \
              --allow-language-suffixes
====================================================
Data lines read: 1
Data lines passed: 1

Value Check: Lax Language-Qualified Strings

KGTK language-qualified strings with internal double quote characters(") that are not escaped (\") are considered valid when --allow-lax-lq-strings=TRUE.

kgtk clean can convert lax language-qualified strings into strict KGTK strings.

kgtk cat -i examples/docs/validate-lax-language-qualified-strings.tsv
node1 label node2
line1 valid 'abc'@en
line1 valid 'a'bc'@en
kgtk validate -i examples/docs/validate-lax-language-qualified-strings.tsv \
              --allow-lax-lq-strings TRUE
====================================================
Data lines read: 2
Data lines passed: 2

Value Check: Wikidata Language-Qualified Strings

When --allow-wikidata-lq-strings=TRUE, the language qualifier may be two or more alpha characters, optionally followed by a language suffix (a dash (-) followed by a string matching the pattern [-a-zA-Z0-9]+). The language qualifier is not validated against known values.

kgtk cat -i examples/docs/validate-wikidata-language-qualified-strings.tsv
node1 label node2
line1 valid 'abc'@english
line2 valid 'abc'@english-gb
kgtk validate -i examples/docs/validate-wikidata-language-qualified-strings.tsv \
              --allow-wikidata-lq-strings
====================================================
Data lines read: 2
Data lines passed: 2

Value Check: Language Qualified Strings with Additional Language Codes

kgtk cat -i examples/docs/validate-language-qualified-strings-with-addl-codes.tsv
node1 label node2
line1 valid 'abc'@xxx
line2 valid 'abc'@yyy
kgtk validate -i examples/docs/validate-language-qualified-strings-with-addl-codes.tsv \
              --additional-language-codes xxx yyy
====================================================
Data lines read: 2
Data lines passed: 2

Value Check: Location Coordinates

KGTk location coodinates values start with the at sign (@), followed by the latitude and longitude separated by a slash (/).

Latitude and longitude are indegrees. They may be integers or floating point numbers.

When --allow-lax-coordinates=FALSE (the dafault), latitude and longitude may not include exponents. When --allow-lax-coordinates=TRUE, latitude and longitude may be floating point numbers with exponents.

When --allow-out-of-range-coordinates=FALSE (the default), the latitude and longitude must fit within specified ranges. When `--allow-out-of-range-coordinates=TRUE), the following checks are not applied.

--minimum-valid-lat (default -90.00) is the minimum valid latitide. When --clamp-minimum-lat=FALSE (the default), a latitude value less than the minimum value will result in an error. When --clamp-minimum-lat=TRUE, a latitude value less than the minimum value will be set to the minimum value.

--maximum-valid-lat (default 90.00) is the maximum valid latitide. When --clamp-maximum-lat=FALSE (the default), a latitude value less than the maximum value will result in an error. When --clamp-maximum-lat=TRUE, a latitude value greater than the maximum value will be set to the maximum value.

--minimum-valid-lon (default -180.00) is the minimum valid latitide. When --clamp-minimum-lon=FALSE (the default), a longitude value less than the minimum value will result in an error. When --clamp-minimum-lon=TRUE, a longitude value less than the minimum value will be set to the minimum value.

--maximum-valid-lon (default 180.00) is the maximum valid latitide. When --clamp-maximum-lon=FALSE (the default), a longitude value less than the maximum value will result in an error. When --clamp-maximum-lon=TRUE, a longitude value greater than the maximum value will be set to the maximum value.

kgtk clean can update KGTK latitudes or longitudes with clamped values.

kgtk cat -i examples/docs/validate-location-coordinates.tsv
node1 label node2
line1 valid @34/118
line2 valid @33.9803/118.4517
line3 invalid @33.9803/118.4517e1
line4 invalid @100/118
line5 invalid @-100/118
line6 invalid @34/200
line7 invalid @34/-200
kgtk validate -i examples/docs/validate-location-coordinates.tsv
====================================================
Data lines read: 7
Data lines passed: 2
Data lines excluded due to invalid values: 5
Data errors reported: 5

The following error is reported on standard error: Errors detected

Value Check: Allow Lax Location Coordinates

kgtk cat -i examples/docs/validate-location-coordinates.tsv
node1 label node2
line1 valid @34/118
line2 valid @33.9803/118.4517
line3 invalid @33.9803/118.4517e1
line4 invalid @100/118
line5 invalid @-100/118
line6 invalid @34/200
line7 invalid @34/-200
kgtk validate -i examples/docs/validate-location-coordinates.tsv \
              --allow-lax-coordinates
====================================================
Data lines read: 7
Data lines passed: 2
Data lines excluded due to invalid values: 5
Data errors reported: 5

The following error is reported on standard error: Errors detected

Value Check: Allow Out of Range Location Coordinates

kgtk cat -i examples/docs/validate-location-coordinates.tsv
node1 label node2
line1 valid @34/118
line2 valid @33.9803/118.4517
line3 invalid @33.9803/118.4517e1
line4 invalid @100/118
line5 invalid @-100/118
line6 invalid @34/200
line7 invalid @34/-200
kgtk validate -i examples/docs/validate-location-coordinates.tsv \
              --allow-out-of-range-coordinates
====================================================
Data lines read: 7
Data lines passed: 6
Data lines excluded due to invalid values: 1
Data errors reported: 1

The following error is reported on standard error: Errors detected

Value Check: Clamp Out of Range Location Coordinates

kgtk cat -i examples/docs/validate-location-coordinates.tsv
node1 label node2
line1 valid @34/118
line2 valid @33.9803/118.4517
line3 invalid @33.9803/118.4517e1
line4 invalid @100/118
line5 invalid @-100/118
line6 invalid @34/200
line7 invalid @34/-200
kgtk validate -i examples/docs/validate-location-coordinates.tsv \
              --clamp-minimum-lat --clamp-maximum-lat \
              --clamp-minimum-lon --clamp-maximum-lon
====================================================
Data lines read: 7
Data lines passed: 6
Data lines excluded due to invalid values: 1
Data errors reported: 1

The following error is reported on standard error: Errors detected

Value Check: Dates with Month or Day Zero

Wikidata uses day 0 on date/time values with coarser than day granularity. Wikidata uses month 0 on date/time values with coarser than month granularity. If these date strings are imported into KGTK files without modification, the result is a date/time string that does not meet KGTK's ISO 8601 requirement.

kgtk cat -i examples/docs/validate-date-with-day-zero.tsv
node1 label node2
john woke ^2020-05-00T00:00
john woke ^2020-00-00T00:00
kgtk validate -i examples/docs/validate-date-with-day-zero.tsv

This results in the following summary:

====================================================
Data lines read: 2
Data lines passed: 0
Data lines excluded due to invalid values: 2
Data errors reported: 2

The following error is reported on standard error: Errors detected

Value Check: Allow Dates with Month or Day Zero

Instruct the validator to accept month or day 00, even though this is not allowed by ISO 6801.

kgtk cat -i examples/docs/validate-date-with-day-zero.tsv
node1 label node2
john woke ^2020-05-00T00:00
john woke ^2020-00-00T00:00
kgtk validate -i examples/docs/validate-date-with-day-zero.tsv \
              --allow-month-or-day-zero

This results in no error messages, and the following summary:

====================================================
Data lines read: 2
Data lines passed: 2

Info

Wikidata use day 0 on date/time values with coarser than day granularity. Wikidata uses month 0 on date/time values with coarser than month granularity.

Value Check: Dates with End of Day Markers (24:00) Allowed by Dafault

KGTK uses ISO 8601 dates. Prior to the 2019 revision of this standard, ISO 8601-1:2019, "24:00" could be used to indicate midnight at the end of a day. The 2019 revision disallowed this usage, but KGTK continues to support it, as end-of-day markers may appear in earlier sources, such as Wikidata.

kgtk cat -i examples/docs/validate-date-with-end-of-day.tsv
node1 label node2
john woke ^2020-05-01T24:00
kgtk validate -i examples/docs/validate-date-with-end-of-day.tsv

This results in no error messages, and the following summary:

====================================================
Data lines read: 1
Data lines passed: 1

Value Check: Disallow Dates with End of Day Marker (24:00)

Instruct the validator to disallow the end-of-day marker (24:00), in conformity to the current (ISO 8601)[https://en.wikipedia.org/wiki/ISO_8601) standard.

kgtk cat -i examples/docs/validate-date-with-end-of-day.tsv
node1 label node2
john woke ^2020-05-01T24:00
kgtk validate -i examples/docs/validate-date-with-end-of-day.tsv \
              --allow-end-of-day False

This results in the following summary:

====================================================
Data lines read: 1
Data lines passed: 0
Data lines excluded due to invalid values: 1
Data errors reported: 1

The following error is reported on standard error: Errors detected

Value Check: Minimum Valid Year (1583 by Default)

The KGTK File Specification v2 uses ISO 8601 date format. ISO 8601 is based on the Gregorian calendar, which started on 15 October 1582. The default minimum valid year in ISO 8601 is 1583.

Extending the Gregorian calendar before its start date is called the proleptic Gregorian calendar. ISO 8601 can be used to represent dates prior to year 1583, and has special rules for representing the year 1 BC and earlier years. KGTK generally follows these rules. The following points should be noted:

  • The year 1 BC is represented as the year 0000.
  • An optional + may be used in front of year 0000 and later years.
  • The year 2 BC and earlier years require minus signs (-) in front of the year number.
  • The year 2 BC is represented as year -0001
  • KGTK allows dates with more than four digits in the year, but only in ISO 8601 extended mode (with dashes (-) between date components and colons (:) between time components, see the --force-iso8601-extended and --require-iso8601-extended examples, below)

--minimum-valid-year is used to specify the minimum allowed year. The default value is 1583.

--ignore-minimum-year, when TRUE, disables the minimym valid year check. The default for this option is FALSE.

--clamp-minimum-year, when TRUE, forces all years below the minimum value to be set to the minium value. The default for this option is FALSE.

kgtk cat -i examples/docs/validate-date-with-minimum-year.tsv
node1 label node2
john born ^1583-01-01T00:00
jack born ^1582-01-01T00:00
jorge born ^0922-01-01T00:00
jerry born ^0000-01-01T00:00
jon born ^+0000-01-01T00:00
jared born ^-0001-01-01T00:00
jimmy born ^-10001-01-01T00:00
kgtk validate -i examples/docs/validate-date-with-minimum-year.tsv

This results in the following summary:

====================================================
Data lines read: 7
Data lines passed: 1
Data lines excluded due to invalid values: 6
Data errors reported: 6

The following error message is sent to stderr. The return status is 1.

Errors detected

Value Check: Change the Minimum Valid Year

Suppose we want to exclude all dates before the year 1000. Here's our sample data:

kgtk cat -i examples/docs/validate-date-with-minimum-year.tsv
node1 label node2
john born ^1583-01-01T00:00
jack born ^1582-01-01T00:00
jorge born ^0922-01-01T00:00
jerry born ^0000-01-01T00:00
jon born ^+0000-01-01T00:00
jared born ^-0001-01-01T00:00
jimmy born ^-10001-01-01T00:00
kgtk validate -i examples/docs/validate-date-with-minimum-year.tsv \
              --minimum-valid-year 1000

This results in the following summary:

====================================================
Data lines read: 7
Data lines passed: 2
Data lines excluded due to invalid values: 5
Data errors reported: 5

The following error message is sent to stderr. The return status is 1.

Errors detected

Value Check: Clamp the Minimum Valid Year

Suppose we want to validate all records, converting any negative dates to year 0000. This will not make a significant difference to kgtk validate compared to ignoring the minimum valid year check (see the example below), but clamping may be useful in other contexts.

kgtk cat -i examples/docs/validate-date-with-minimum-year.tsv
node1 label node2
john born ^1583-01-01T00:00
jack born ^1582-01-01T00:00
jorge born ^0922-01-01T00:00
jerry born ^0000-01-01T00:00
jon born ^+0000-01-01T00:00
jared born ^-0001-01-01T00:00
jimmy born ^-10001-01-01T00:00
kgtk validate -i examples/docs/validate-date-with-minimum-year.tsv \
              --minimum-valid-year 0000 \
              --clamp-minimum-year

This results in the following summary:

====================================================
Data lines read: 7
Data lines passed: 7

Value Check: Ignore the Minimum Valid Year Check

kgtk cat -i examples/docs/validate-date-with-minimum-year.tsv
node1 label node2
john born ^1583-01-01T00:00
jack born ^1582-01-01T00:00
jorge born ^0922-01-01T00:00
jerry born ^0000-01-01T00:00
jon born ^+0000-01-01T00:00
jared born ^-0001-01-01T00:00
jimmy born ^-10001-01-01T00:00
kgtk validate -i examples/docs/validate-date-with-minimum-year.tsv \
              --ignore-minimum-year

This results in the following summary:

====================================================
Data lines read: 7
Data lines passed: 7

Value Check: Maximum Valid Year (2100 by Default)

The KGTK File Specification v 2 uses ISO 8601 date format. ISO 8601 is based on the Gregorian calendar, which started on 15 October 1582. The default maximum valid year in ISO 8601 is 9999, although additional digits can be used incertain circumstances.

KGTK somewhat arbitrarily has a default maximum date of 2100. The rationale is that dates beyond that are unlikely in most datasets.

kgtk cat -i examples/docs/validate-date-with-maximum-year.tsv
node1 label node2
john date ^2099-01-01T00:00
jack date ^2100-01-01T00:00
jack date ^2101-01-01T00:00
jorge born ^9999-01-01T00:00
jon born ^+9999-01-01T00:00
jared born ^10000-01-01T00:00
jared born ^+10000-01-01T00:00
kgtk validate -i examples/docs/validate-date-with-maximum-year.tsv

This results in the following summary:

====================================================
Data lines read: 7
Data lines passed: 2
Data lines excluded due to invalid values: 5
Data errors reported: 5

The following error message is sent to stderr. The return status is 1.

Errors detected

Value Check: Changing the Maximum Valid Year

Let's change the maximum valid year to 9999:

kgtk cat -i examples/docs/validate-date-with-maximum-year.tsv
node1 label node2
john date ^2099-01-01T00:00
jack date ^2100-01-01T00:00
jack date ^2101-01-01T00:00
jorge born ^9999-01-01T00:00
jon born ^+9999-01-01T00:00
jared born ^10000-01-01T00:00
jared born ^+10000-01-01T00:00
kgtk validate -i examples/docs/validate-date-with-maximum-year.tsv \
              --maximum-valid-year 9999

This results in the following summary:

====================================================
Data lines read: 7
Data lines passed: 5
Data lines excluded due to invalid values: 2
Data errors reported: 2

The following error message is sent to stderr. The return status is 1.

Errors detected

Value Check: Changing the Maximum Valid Year #2

Let's change the maximum valid year to 99999:

kgtk cat -i examples/docs/validate-date-with-maximum-year.tsv
node1 label node2
john date ^2099-01-01T00:00
jack date ^2100-01-01T00:00
jack date ^2101-01-01T00:00
jorge born ^9999-01-01T00:00
jon born ^+9999-01-01T00:00
jared born ^10000-01-01T00:00
jared born ^+10000-01-01T00:00
kgtk validate -i examples/docs/validate-date-with-maximum-year.tsv \
              --maximum-valid-year 99999

This results in the following summary:

====================================================
Data lines read: 7
Data lines passed: 7

Value Check: Ignoring the Maximum Valid Year

kgtk cat -i examples/docs/validate-date-with-maximum-year.tsv
node1 label node2
john date ^2099-01-01T00:00
jack date ^2100-01-01T00:00
jack date ^2101-01-01T00:00
jorge born ^9999-01-01T00:00
jon born ^+9999-01-01T00:00
jared born ^10000-01-01T00:00
jared born ^+10000-01-01T00:00
kgtk validate -i examples/docs/validate-date-with-maximum-year.tsv \
              --ignore-maximum-year

This results in the following summary:

====================================================
Data lines read: 7
Data lines passed: 7