clean-data
Summary¶
Validate and clean the data in a KGTK file, optionally decompressing the input files and compressing the output file.
Input and output files may be (de)compressed using a algorithm selected by the file extension: .bz2 .gz .lz4 .xy
Default Rules¶
By default, the following rules apply:
- errors that occur while processing a KGTK file's column header line cause an immediate exit:
- An empty column name
- A duplicate column name
- A missing required column name for an edge or node file
- An ambiguous required column name (e.g.,
id
andID
are both present) - empty data lines are silently ignored and not passed through.
- data lines containing only whitespace are silently ignored and not passed through.
- data lines with empty required fields (node1 and node2 for KGTK edge files, id for KGTK node files) are silently ignored and not passed through.
- data lines that have too few fields cause a complaint to be issued, and are not passed through.
- data lines that have too many fields cause a complaint to be issued, and are not passed through.
- lines with data value validation errors cause a complaint to be issued, and are not passed through.
These defaults may be changed through expert options.
Action Codes¶
Action keyword | Action when condition detected |
---|---|
PASS | Silently allow the data line to pass through |
REPORT | Report the data line and let it pass through |
EXCLUDE | Silently exclude (ignore) the data line |
COMPLAIN | Report the data line and exclude (ignore) it |
ERROR | Raise a ValueError |
EXIT | sys.exit(1) |
Usage¶
usage: kgtk clean-data [-h] [-i INPUT_FILE] [-o OUTPUT_FILE]
[--reject-file REJECT_FILE] [-v [optional True|False]]
Validate a KGTK file and output a clean copy. Empty lines, whitespace lines, comment lines, and lines with empty required fields are silently skipped. Header errors cause an immediate exception. Data value errors are reported and the line containing them skipped.
Additional options are shown in expert help.
kgtk --expert clean-data --help
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
The KGTK input file. (May be omitted or '-' for
stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--reject-file REJECT_FILE
Reject file (Optional, use '-' for stdout.)
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
Expert Usage¶
usage: kgtk clean-data [-h] [-i INPUT_FILE] [-o OUTPUT_FILE]
[--reject-file REJECT_FILE]
[--errors-to-stdout [optional True|False] |
--errors-to-stderr [optional True|False]]
[--show-options [optional True|False]]
[-v [optional True|False]]
[--very-verbose [optional True|False]]
[--column-separator COLUMN_SEPARATOR]
[--input-format INPUT_FORMAT]
[--compression-type COMPRESSION_TYPE]
[--error-limit ERROR_LIMIT]
[--use-mgzip [optional True|False]]
[--mgzip-threads MGZIP_THREADS]
[--gzip-in-parallel [optional True|False]]
[--gzip-queue-size GZIP_QUEUE_SIZE]
[--implied-label IMPLIED_LABEL]
[--use-graph-cache-envar [optional True|False]]
[--ignore-stale-graph-cache [optional True|False]]
[--graph-cache GRAPH_CACHE]
[--graph-cache-fetchmany-size GRAPH_CACHE_FETCHMANY_SIZE]
[--graph-cache-filter-batch-size GRAPH_CACHE_FILTER_BATCH_SIZE]
[--mode {NONE,EDGE,NODE,AUTO}]
[--input-column-names FORCE_COLUMN_NAMES [FORCE_COLUMN_NAMES ...]]
[--no-input-header [optional True|False]]
[--supply-missing-column-names [optional True|False]]
[--number-of-columns COUNT]
[--require-column-names REQUIRE_COLUMN_NAMES [REQUIRE_COLUMN_NAMES ...]]
[--no-additional-columns [optional True|False]]
[--unquote-csv-column-names [optional True|False]]
[--header-error-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--unsafe-column-name-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--prohibit-whitespace-in-column-names [optional True|False]]
[--initial-skip-count INITIAL_SKIP_COUNT]
[--every-nth-record EVERY_NTH_RECORD]
[--record-limit RECORD_LIMIT] [--tail-count TAIL_COUNT]
[--repair-and-validate-lines [optional True|False]]
[--repair-and-validate-values [optional True|False]]
[--blank-required-field-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--comment-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--empty-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--fill-short-lines [optional True|False]]
[--invalid-value-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--long-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--prohibited-list-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--short-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--truncate-long-lines [TRUNCATE_LONG_LINES]]
[--whitespace-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--additional-language-codes [ADDITIONAL_LANGUAGE_CODES ...]]
[--allow-lax-qnodes [ALLOW_LAX_QNODES]]
[--allow-language-suffixes [ALLOW_LANGUAGE_SUFFIXES]]
[--allow-lax-strings [ALLOW_LAX_STRINGS]]
[--allow-lax-lq-strings [ALLOW_LAX_LQ_STRINGS]]
[--allow-wikidata-lq-strings [ALLOW_WIKIDATA_LQ_STRINGS]]
[--require-iso8601-extended [REQUIRE_ISO8601_EXTENDED]]
[--force-iso8601-extended [FORCE_ISO8601_EXTENDED]]
[--allow-month-or-day-zero [ALLOW_MONTH_OR_DAY_ZERO]]
[--repair-month-or-day-zero [REPAIR_MONTH_OR_DAY_ZERO]]
[--allow-end-of-day [ALLOW_END_OF_DAY]]
[--minimum-valid-year MINIMUM_VALID_YEAR]
[--clamp-minimum-year [CLAMP_MINIMUM_YEAR]]
[--ignore-minimum-year [IGNORE_MINIMUM_YEAR]]
[--maximum-valid-year MAXIMUM_VALID_YEAR]
[--clamp-maximum-year [CLAMP_MAXIMUM_YEAR]]
[--ignore-maximum-year [IGNORE_MAXIMUM_YEAR]]
[--validate-fromisoformat [VALIDATE_FROMISOFORMAT]]
[--allow-lax-coordinates [ALLOW_LAX_COORDINATES]]
[--repair-lax-coordinates [REPAIR_LAX_COORDINATES]]
[--allow-out-of-range-coordinates [ALLOW_OUT_OF_RANGE_COORDINATES]]
[--minimum-valid-lat MINIMUM_VALID_LAT]
[--clamp-minimum-lat [CLAMP_MINIMUM_LAT]]
[--maximum-valid-lat MAXIMUM_VALID_LAT]
[--clamp-maximum-lat [CLAMP_MAXIMUM_LAT]]
[--minimum-valid-lon MINIMUM_VALID_LON]
[--clamp-minimum-lon [CLAMP_MINIMUM_LON]]
[--maximum-valid-lon MAXIMUM_VALID_LON]
[--clamp-maximum-lon [CLAMP_MAXIMUM_LON]]
[--modulo-repair-lon [MODULO_REPAIR_LON]]
[--escape-list-separators [ESCAPE_LIST_SEPARATORS]]
Validate a KGTK file and output a clean copy. Empty lines, whitespace lines, comment lines, and lines with empty required fields are silently skipped. Header errors cause an immediate exception. Data value errors are reported and the line containing them skipped.
Additional options are shown in expert help.
kgtk --expert clean-data --help
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
The KGTK input file. (May be omitted or '-' for
stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--reject-file REJECT_FILE
Reject file (Optional, use '-' for stdout.)
Error and feedback messages:
Send error messages and feedback to stderr or stdout, control the amount of feedback and debugging messages.
--errors-to-stdout [optional True|False]
Send errors to stdout instead of stderr.
(default=False).
--errors-to-stderr [optional True|False]
Send errors to stderr instead of stdout.
(default=False).
--show-options [optional True|False]
Print the options selected (default=False).
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
--very-verbose [optional True|False]
Print additional progress messages (default=False).
File options:
Options affecting processing.
--column-separator COLUMN_SEPARATOR
Column separator (default=<TAB>).
--input-format INPUT_FORMAT
Specify the input format (default=None).
--compression-type COMPRESSION_TYPE
Specify the compression type (default=None).
--error-limit ERROR_LIMIT
The maximum number of errors to report before failing
(default=1000)
--use-mgzip [optional True|False]
Execute multithreaded gzip. (default=False).
--mgzip-threads MGZIP_THREADS
Multithreaded gzip thread count. (default=3).
--gzip-in-parallel [optional True|False]
Execute gzip in parallel. (default=False).
--gzip-queue-size GZIP_QUEUE_SIZE
Queue size for parallel gzip. (default=1000).
--implied-label IMPLIED_LABEL
When specified, imply a label colum with the specified
value (default=None).
--use-graph-cache-envar [optional True|False]
use KGTK_GRAPH_CACHE if --graph-cache is not
specified. (default=True).
--ignore-stale-graph-cache [optional True|False]
Ignore the graph cache if the file exists with a
differen size or modificatin time. (default=True).
--graph-cache GRAPH_CACHE
When specified, look for input files in a graph cache.
(default=None).
--graph-cache-fetchmany-size GRAPH_CACHE_FETCHMANY_SIZE
Graph cache transfer buffer size. (default=1000).
--graph-cache-filter-batch-size GRAPH_CACHE_FILTER_BATCH_SIZE
Graph cache filter batch size. (default=1000).
--mode {NONE,EDGE,NODE,AUTO}
Determine the KGTK file mode
(default=KgtkReaderMode.AUTO).
Header parsing:
Options affecting header parsing.
--input-column-names FORCE_COLUMN_NAMES [FORCE_COLUMN_NAMES ...], --force-column-names FORCE_COLUMN_NAMES [FORCE_COLUMN_NAMES ...]
Supply input column names when the input file does not
have a header record (--no-input-header=True), or
forcibly override the column names when a header row
exists (--no-input-header=False) (default=None).
--no-input-header [optional True|False]
When the input file does not have a header record,
specify --no-input-header=True and --input-column-
names. When the input file does have a header record
that you want to forcibly override, specify --input-
column-names and --no-input-header=False. --no-input-
header has no effect when --input-column-names has not
been specified. (default=False).
--supply-missing-column-names [optional True|False]
Supply column names that are missing. (default=False).
--number-of-columns COUNT
The expected number of columns in the header.
(default=None).
--require-column-names REQUIRE_COLUMN_NAMES [REQUIRE_COLUMN_NAMES ...]
The list of column names required in the input file.
(default=None).
--no-additional-columns [optional True|False]
When True, do not allow any column names other than
the required column names. When --require-column-names
is not specified, then disallow columns other than
[node1, label, node2, id] (or aliases) for an edge
file, and [id] for a node file. (default=False).
--unquote-csv-column-names [optional True|False]
Remove double quotes from the outside of column names.
(default=True).
--header-error-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a header error is detected.
Only ERROR or EXIT are supported
(default=ValidationAction.EXIT).
--unsafe-column-name-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a column name is unsafe
(default=ValidationAction.REPORT).
--prohibit-whitespace-in-column-names [optional True|False]
Prohibit whitespace in column names. (default=False).
Pre-validation sampling:
Options affecting pre-validation data line sampling.
--initial-skip-count INITIAL_SKIP_COUNT
The number of data records to skip initially
(default=do not skip).
--every-nth-record EVERY_NTH_RECORD
Pass every nth record (default=pass all records).
--record-limit RECORD_LIMIT
Limit the number of records read (default=no limit).
--tail-count TAIL_COUNT
Pass this number of records (default=no tail
processing).
Line parsing:
Options affecting data line parsing.
--repair-and-validate-lines [optional True|False]
Repair and validate lines (default=True).
--repair-and-validate-values [optional True|False]
Repair and validate values (default=True).
--blank-required-field-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a line with a blank node1,
node2, or id field (per mode) is detected
(default=ValidationAction.EXCLUDE).
--comment-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a comment line is detected
(default=ValidationAction.EXCLUDE).
--empty-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when an empty line is detected
(default=ValidationAction.EXCLUDE).
--fill-short-lines [optional True|False]
Fill missing trailing columns in short lines with
empty values (default=False).
--invalid-value-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a data cell value is invalid
(default=ValidationAction.COMPLAIN).
--long-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a long line is detected
(default=ValidationAction.COMPLAIN).
--prohibited-list-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a data cell contains a
prohibited list (default=ValidationAction.COMPLAIN).
--short-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a short line is detected
(default=ValidationAction.COMPLAIN).
--truncate-long-lines [TRUNCATE_LONG_LINES]
Remove excess trailing columns in long lines
(default=False).
--whitespace-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a whitespace line is detected
(default=ValidationAction.EXCLUDE).
Data value parsing:
Options controlling the parsing and processing of KGTK data values.
--additional-language-codes [ADDITIONAL_LANGUAGE_CODES ...]
Additional language codes. (default=use internal
list).
--allow-lax-qnodes [ALLOW_LAX_QNODES]
Allow qnode suffixes in quantities to include alphas
and dash as well as digits. (default=False).
--allow-language-suffixes [ALLOW_LANGUAGE_SUFFIXES]
Allow language identifier suffixes starting with a
dash. (default=False).
--allow-lax-strings [ALLOW_LAX_STRINGS]
Do not check if double quotes are backslashed inside
strings. (default=False).
--allow-lax-lq-strings [ALLOW_LAX_LQ_STRINGS]
Do not check if single quotes are backslashed inside
language qualified strings. (default=False).
--allow-wikidata-lq-strings [ALLOW_WIKIDATA_LQ_STRINGS]
Allow Wikidata language qualifiers. (default=False).
--require-iso8601-extended [REQUIRE_ISO8601_EXTENDED]
Require colon(:) and hyphen(-) in dates and times.
(default=False).
--force-iso8601-extended [FORCE_ISO8601_EXTENDED]
Force colon (:) and hyphen(-) in dates and times.
(default=False).
--allow-month-or-day-zero [ALLOW_MONTH_OR_DAY_ZERO]
Allow month or day zero in dates. (default=False).
--repair-month-or-day-zero [REPAIR_MONTH_OR_DAY_ZERO]
Repair month or day zero in dates. (default=False).
--allow-end-of-day [ALLOW_END_OF_DAY]
Allow 24:00:00 to represent the end of the day.
(default=True).
--minimum-valid-year MINIMUM_VALID_YEAR
The minimum valid year in dates. (default=1583).
--clamp-minimum-year [CLAMP_MINIMUM_YEAR]
Clamp years at the minimum value. (default=False).
--ignore-minimum-year [IGNORE_MINIMUM_YEAR]
Ignore the minimum year constraint. (default=False).
--maximum-valid-year MAXIMUM_VALID_YEAR
The maximum valid year in dates. (default=2100).
--clamp-maximum-year [CLAMP_MAXIMUM_YEAR]
Clamp years at the maximum value. (default=False).
--ignore-maximum-year [IGNORE_MAXIMUM_YEAR]
Ignore the maximum year constraint. (default=False).
--validate-fromisoformat [VALIDATE_FROMISOFORMAT]
Validate that datetime.fromisoformat(...) can parse
this date and time. This checks that the
year/month/day combination is valid. The year must be
in the range 1..9999, inclusive. (default=False).
--allow-lax-coordinates [ALLOW_LAX_COORDINATES]
Allow coordinates using scientific notation.
(default=False).
--repair-lax-coordinates [REPAIR_LAX_COORDINATES]
Allow coordinates using scientific notation.
(default=False).
--allow-out-of-range-coordinates [ALLOW_OUT_OF_RANGE_COORDINATES]
Allow coordinates that don't make sense.
(default=False).
--minimum-valid-lat MINIMUM_VALID_LAT
The minimum valid latitude. (default=-90.000000).
--clamp-minimum-lat [CLAMP_MINIMUM_LAT]
Clamp latitudes at the minimum value. (default=False).
--maximum-valid-lat MAXIMUM_VALID_LAT
The maximum valid latitude. (default=90.000000).
--clamp-maximum-lat [CLAMP_MAXIMUM_LAT]
Clamp latitudes at the maximum value. (default=False).
--minimum-valid-lon MINIMUM_VALID_LON
The minimum valid longitude. (default=-180.000000).
--clamp-minimum-lon [CLAMP_MINIMUM_LON]
Clamp longitudes at the minimum value.
(default=False).
--maximum-valid-lon MAXIMUM_VALID_LON
The maximum valid longitude. (default=180.000000).
--clamp-maximum-lon [CLAMP_MAXIMUM_LON]
Clamp longitudes at the maximum value.
(default=False).
--modulo-repair-lon [MODULO_REPAIR_LON]
Wrap longitude to (-180.0,180.0]. (default=False).
--escape-list-separators [ESCAPE_LIST_SEPARATORS]
Escape all list separators instead of splitting on
them. (default=False).
Examples¶
Sample Data¶
Suppose that file1.tsv
contains the following table in KGTK format:
kgtk cat -i examples/docs/clean-data-file1.tsv
node1 | label | node2 |
---|---|---|
john | woke | ^2020-05-02T00:00 |
john | woke | ^2020-05-00T00:00 |
john | slept | ^2020-05-02T24:00 |
lionheart | born | ^1157-09-08T00:00 |
year0001 | starts | ^0001-01-01T00:00 |
year9999 | ends | ^9999-12-31T11:59:59 |
Clean the data, using default options¶
kgtk clean-data -i examples/docs/clean-data-file1.tsv
Standard output will get the following data:
node1 | label | node2 |
---|---|---|
john | woke | ^2020-05-02T00:00 |
john | slept | ^2020-05-02T24:00 |
The following complaints will be issued on standard error:
Data line 2:
john woke ^2020-05-00T00:00
col 2 (node2) value '^2020-05-00T00:00' is an Invalid Date and Times
Data line 4:
lionheart born ^1157-09-08T00:00
col 2 (node2) value '^1157-09-08T00:00' is an Invalid Date and Times
Data line 5:
year0001 starts ^0001-01-01T00:00
col 2 (node2) value '^0001-01-01T00:00' is an Invalid Date and Times
Data line 6:
year9999 ends ^9999-12-31T11:59:59
col 2 (node2) value '^9999-12-31T11:59:59' is an Invalid Date and Times
The second data line was excluded because it contained "00" in the day field, which violates the ISO 8601 specification.
The fourth data line was excluded because year 1157 is before the start of the ISO 8601 normal era, year 1583.
The fifth data line was excluded because year 1157 is before the start of the ISO 8601 normal era, year 1583.
The sixth data line was excluded because year 9999 is after the sanity check 2100 cutoff.
Repair Month or Day "00"¶
Change day "00" to day "01:
kgtk clean-data -i examples/docs/clean-data-file1.tsv \
--repair-month-or-day-zero
Standard output will get the following data, and other errors will be issued:
node1 | label | node2 |
---|---|---|
john | woke | ^2020-05-02T00:00 |
john | woke | ^2020-05-01T00:00 |
john | slept | ^2020-05-02T24:00 |
Data line 4:
lionheart born ^1157-09-08T00:00
col 2 (node2) value '^1157-09-08T00:00' is an Invalid Date and Times
Data line 5:
year0001 starts ^0001-01-01T00:00
col 2 (node2) value '^0001-01-01T00:00' is an Invalid Date and Times
Data line 6:
year9999 ends ^9999-12-31T11:59:59
col 2 (node2) value '^9999-12-31T11:59:59' is an Invalid Date and Times
Accept Years from 1 AD¶
kgtk clean-data -i examples/docs/clean-data-file1.tsv \
--repair-month-or-day-zero \
--minimum-valid-year 1
Standard output will get the following data, and other errors will be issued:
node1 | label | node2 |
---|---|---|
john | woke | ^2020-05-02T00:00 |
john | woke | ^2020-05-01T00:00 |
john | slept | ^2020-05-02T24:00 |
lionheart | born | ^1157-09-08T00:00 |
year0001 | starts | ^0001-01-01T00:00 |
Data line 6:
year9999 ends ^9999-12-31T11:59:59
col 2 (node2) value '^9999-12-31T11:59:59' is an Invalid Date and Times
Note
Year 0001 is the earliest year that can be processed by the Python standard library date and time modules.
Accept Years up to 9999 AD¶
kgtk clean-data -i examples/docs/clean-data-file1.tsv \
--repair-month-or-day-zero \
--minimum-valid-year 1 \
--maximum-valid-year 9999
Standard output will get the following data, and no other errors will be issued:
node1 | label | node2 |
---|---|---|
john | woke | ^2020-05-02T00:00 |
john | woke | ^2020-05-01T00:00 |
john | slept | ^2020-05-02T24:00 |
lionheart | born | ^1157-09-08T00:00 |
year0001 | starts | ^0001-01-01T00:00 |
year9999 | ends | ^9999-12-31T11:59:59 |
Note
Year 9999 is the latest year that can be processed by the Python standard library date and time modules.