sample
Overview¶
kgtk sample
samples a KGTK file, dividing it into an output file and an optional reject file.
The simplest way to use this command is sample the input file by a fraction.
kgtk sample --probability frac
supplies the sampling probability that an
input record will be passed to the primary output file. The probability value
ranges from 0.0 to 1.0, with 1 being the default. The sampling probability is
applied to each record (edge or node) in the input file independently. The
number of records in the output file might not be exactly the same as the
fraction times the number of records in the input file; occasionally, this
size difference may be significant.
Another simple way to use this command is to specify the number of records to
be included in the sample. kgtk sample --sample-size n
specifies a sample
size of n
(which must be positive). The output file will contain a sample
of n
records unless the input file has fewer than n
records. Candidate
records for the final sample will be buffered in memory as the input file is
processed, thus a significant amount of time and memory may be needed if n
is very large.
Alternatively, --input-size N
and --sample-size n
may be provided. The
sampling probability will be computed as n/N
. The number of output records
may not exactly match the sample size unless --exact
is also
specified. --exact
required more time and memory to process, which might be
significant on very large input files, but less time and memory than is
required when --input-size N
has not been specified.
The input size, if specified, must be positive. The sample size, if specified, must be positive.
This command defaults to --mode=NONE
since it doesn't attach special meaning
to particular columns.
Usage¶
usage: kgtk sample [-h] [-i INPUT_FILE] [-o OUTPUT_FILE]
[--reject-file REJECT_FILE] [--probability PROBABILITY]
[--seed SEED] [--input-size INPUT_SIZE]
[--sample-size SAMPLE_SIZE] [--exact [True|False]]
[-v [optional True|False]]
This utility randomly samples a KGTK file, dividing it into an optput file and an optional reject file. The probability of an input record being passed to the output file is controlled by `--probability n`, where `n` ranges from 0 to 1.
This command defaults to --mode=NONE so it will work with TSV files that do not follow KGTK column naming conventions.
Additional options are shown in expert help.
kgtk --expert sample --help
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
The KGTK input file. (May be omitted or '-' for
stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--reject-file REJECT_FILE
The KGTK reject file for records that fail the filter.
(Optional, use '-' for stdout.)
--probability PROBABILITY
The probability of passing an input record to the
output file (default=1).
--seed SEED The optional random number generator seed
(default=None).
--input-size INPUT_SIZE
The optional number of input records (default=None).
--sample-size SAMPLE_SIZE
The optional desired number of output records
(default=None).
--exact [True|False] Ensure that exactly the desired sample size is
extracted when --input-size and --sample-size are
supplied. (default=False).
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
Examples¶
All of the examples in this section specify an integer seed to the random number generator in order to provide repeatable sampling. For production use, you may prefer to omit the seed.
Sampling .1 Probability with a Fixed Seed¶
Quickly sample one tenth of the input records.
kgtk sample -i examples/docs/sample-example1.tsv \
--probability .1 --seed 123
node1 | label | node2 | id |
---|---|---|---|
red | property | True | |
red | isa | rgbcolor | |
green | maxoccurs | 1 | |
rgbcolor | maxval | 1.0 | |
rgbcolor | requires | green | |
rgbcolor | isa | colorclass | |
colorname | node1_type | symbol | |
colorname | node2_values | green |
Note
Omit --seed 123
to obtain a nonrepeatable sample.
Sampling an Approximate Number of Records Unbuffered¶
Given the number of input records, quickly sample a specified number of output records. The resulting sample might not be the exact size requested.
kgtk sample -i examples/docs/sample-example1.tsv \
--input-size 47 --sample-size 5 \
--seed 123
node1 | label | node2 | id |
---|---|---|---|
red | property | True | |
red | isa | rgbcolor | |
green | maxoccurs | 1 | |
rgbcolor | maxval | 1.0 | |
rgbcolor | requires | green | |
rgbcolor | isa | colorclass | |
colorname | node1_type | symbol | |
colorname | node2_values | green |
Note
Omit --seed 123
to obtain a nonrepeatable sample.
Sampling an Exact Number of Records Unbuffered¶
Given the number of input records, sample an exact number of output records. Additional time and memory is required to plan which records to include in the output sample.
kgtk sample -i examples/docs/sample-example1.tsv \
--input-size 47 --sample-size 5 --exact \
--seed 123
node1 | label | node2 | id |
---|---|---|---|
green | property | True | |
green | maxoccurs | 1 | |
blue | property | True | |
rgbcolor | isa | colorclass | |
colorname | node2_values | yellow |
Note
Omit --seed 123
to obtain a nonrepeatable sample.
Sampling an Exact Number of Records Buffered¶
Sample a specified number of output records without knowing the number of input records. The sampled records are buffered in memory until the input file has been read; a large amount of memory may be needed when the sample size is large.
kgtk sample -i examples/docs/sample-example1.tsv \
--sample-size 5 \
--seed 123
node1 | label | node2 | id |
---|---|---|---|
blue | maxoccurs | 1 | |
roundshape | datatype | True | |
green | isa | rgbcolor | |
colorname | isa | colorclass | |
colorname | node2_values | blue |
Note
The --exact
option is ignored when --sample-size n
is specified and
--input-size N
is not specified.
Note
Omit --seed 123
to obtain a nonrepeatable sample.