Skip to content

sample

Overview

kgtk sample samples a KGTK file, dividing it into an output file and an optional reject file.

The simplest way to use this command is sample the input file by a fraction. kgtk sample --probability frac supplies the sampling probability that an input record will be passed to the primary output file. The probability value ranges from 0.0 to 1.0, with 1 being the default. The sampling probability is applied to each record (edge or node) in the input file independently. The number of records in the output file might not be exactly the same as the fraction times the number of records in the input file; occasionally, this size difference may be significant.

Another simple way to use this command is to specify the number of records to be included in the sample. kgtk sample --sample-size n specifies a sample size of n (which must be positive). The output file will contain a sample of n records unless the input file has fewer than n records. Candidate records for the final sample will be buffered in memory as the input file is processed, thus a significant amount of time and memory may be needed if n is very large.

Alternatively, --input-size N and --sample-size n may be provided. The sampling probability will be computed as n/N. The number of output records may not exactly match the sample size unless --exact is also specified. --exact required more time and memory to process, which might be significant on very large input files, but less time and memory than is required when --input-size N has not been specified.

The input size, if specified, must be positive. The sample size, if specified, must be positive.

This command defaults to --mode=NONE since it doesn't attach special meaning to particular columns.

Usage

usage: kgtk sample [-h] [-i INPUT_FILE] [-o OUTPUT_FILE]
                   [--reject-file REJECT_FILE] [--probability PROBABILITY]
                   [--seed SEED] [--input-size INPUT_SIZE]
                   [--sample-size SAMPLE_SIZE] [--exact [True|False]]
                   [-v [optional True|False]]

This utility randomly samples a KGTK file, dividing it into an optput file and an optional reject file. The probability of an input record being passed to the output file is controlled by `--probability n`, where `n` ranges from 0 to 1. 

This command defaults to --mode=NONE so it will work with TSV files that do not follow KGTK column naming conventions.

Additional options are shown in expert help.
kgtk --expert sample --help

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input-file INPUT_FILE
                        The KGTK input file. (May be omitted or '-' for
                        stdin.)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        The KGTK output file. (May be omitted or '-' for
                        stdout.)
  --reject-file REJECT_FILE
                        The KGTK reject file for records that fail the filter.
                        (Optional, use '-' for stdout.)
  --probability PROBABILITY
                        The probability of passing an input record to the
                        output file (default=1).
  --seed SEED           The optional random number generator seed
                        (default=None).
  --input-size INPUT_SIZE
                        The optional number of input records (default=None).
  --sample-size SAMPLE_SIZE
                        The optional desired number of output records
                        (default=None).
  --exact [True|False]  Ensure that exactly the desired sample size is
                        extracted when --input-size and --sample-size are
                        supplied. (default=False).

  -v [optional True|False], --verbose [optional True|False]
                        Print additional progress messages (default=False).

Examples

All of the examples in this section specify an integer seed to the random number generator in order to provide repeatable sampling. For production use, you may prefer to omit the seed.

Sampling .1 Probability with a Fixed Seed

Quickly sample one tenth of the input records.

kgtk sample -i examples/docs/sample-example1.tsv \
            --probability .1 --seed 123
node1 label node2 id
red property True
red isa rgbcolor
green maxoccurs 1
rgbcolor maxval 1.0
rgbcolor requires green
rgbcolor isa colorclass
colorname node1_type symbol
colorname node2_values green

Note

Omit --seed 123 to obtain a nonrepeatable sample.

Sampling an Approximate Number of Records Unbuffered

Given the number of input records, quickly sample a specified number of output records. The resulting sample might not be the exact size requested.

kgtk sample -i examples/docs/sample-example1.tsv \
            --input-size 47 --sample-size 5 \
        --seed 123
node1 label node2 id
red property True
red isa rgbcolor
green maxoccurs 1
rgbcolor maxval 1.0
rgbcolor requires green
rgbcolor isa colorclass
colorname node1_type symbol
colorname node2_values green

Note

Omit --seed 123 to obtain a nonrepeatable sample.

Sampling an Exact Number of Records Unbuffered

Given the number of input records, sample an exact number of output records. Additional time and memory is required to plan which records to include in the output sample.

kgtk sample -i examples/docs/sample-example1.tsv \
            --input-size 47 --sample-size 5 --exact \
        --seed 123
node1 label node2 id
green property True
green maxoccurs 1
blue property True
rgbcolor isa colorclass
colorname node2_values yellow

Note

Omit --seed 123 to obtain a nonrepeatable sample.

Sampling an Exact Number of Records Buffered

Sample a specified number of output records without knowing the number of input records. The sampled records are buffered in memory until the input file has been read; a large amount of memory may be needed when the sample size is large.

kgtk sample -i examples/docs/sample-example1.tsv \
            --sample-size 5 \
        --seed 123
node1 label node2 id
blue maxoccurs 1
roundshape datatype True
green isa rgbcolor
colorname isa colorclass
colorname node2_values blue

Note

The --exact option is ignored when --sample-size n is specified and --input-size N is not specified.

Note

Omit --seed 123 to obtain a nonrepeatable sample.