import-wikidata
This command will import a Wikidata dump in json format (compressed in gzip or bz2) into KGTK format, generating 3 files:
- A nodes file containing all Qnodes and Pnodes in Wikidata
- An edges file containing all the statements in Wikidata
- A qualifiers file containing all qualifiers on statements in Wikidata
Usage¶
usage: kgtk import-wikidata [-h] [-i INPUT_FILE] [--procs PROCS]
[--max-size-per-mapper-queue MAX_SIZE_PER_MAPPER_QUEUE]
[--mapper-batch-size MAPPER_BATCH_SIZE]
[--single-mapper-queue [True/False]]
[--collector-batch-size COLLECTOR_BATCH_SIZE]
[--use-shm [True/False]]
[--collector-queue-per-proc-size COLLECTOR_QUEUE_PER_PROC_SIZE]
[--node NODE_FILE] [--edge MINIMAL_EDGE_FILE]
[--qual MINIMAL_QUAL_FILE]
[--split-alias-file SPLIT_ALIAS_FILE]
[--split-en-alias-file SPLIT_EN_ALIAS_FILE]
[--split-datatype-file SPLIT_DATATYPE_FILE]
[--split-description-file SPLIT_DESCRIPTION_FILE]
[--split-en-description-file SPLIT_EN_DESCRIPTION_FILE]
[--split-label-file SPLIT_LABEL_FILE]
[--split-en-label-file SPLIT_EN_LABEL_FILE]
[--split-reference-file SPLIT_REFERENCE_FILE]
[--split-sitelink-file SPLIT_SITELINK_FILE]
[--split-en-sitelink-file SPLIT_EN_SITELINK_FILE]
[--split-type-file SPLIT_TYPE_FILE]
[--split-property-edge-file SPLIT_PROPERTY_EDGE_FILE]
[--split-property-qual-file SPLIT_PROPERTY_QUAL_FILE]
[--limit LIMIT] [--nth NTH] [--lang LANG]
[--source SOURCE] [--deprecated]
[--use-python-cat [True/False]]
[--interleave [True/False]]
[--parse-aliases [True/False]]
[--parse-descriptions [True/False]]
[--parse-labels [True/False]]
[--parse-sitelinks [True/False]]
[--parse-claims [True/False]]
[--parse-references [True/False]]
[--fail-if-missing [True/False]]
[--all-languages [True/False]]
[--warn-if-missing [True/False]]
[--progress-interval PROGRESS_INTERVAL]
[--use-mgzip-for-input [True/False]]
[--use-mgzip-for-output [True/False]]
[--mgzip-threads-for-input MGZIP_THREADS_FOR_INPUT]
[--mgzip-threads-for-output MGZIP_THREADS_FOR_OUTPUT]
[--value-hash-width VALUE_HASH_WIDTH]
[--claim-id-hash-width CLAIM_ID_HASH_WIDTH]
[--clean [True/False]]
[--clean-verbose [True/False]]
[--invalid-edge-file INVALID_EDGE_FILE]
[--invalid-qual-file INVALID_QUAL_FILE]
[--skip-validation [True/False]]
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
input path file (may be .bz2) (May be omitted or '-'
for stdin.)
--procs PROCS number of processes to run in parallel, default 2
--max-size-per-mapper-queue MAX_SIZE_PER_MAPPER_QUEUE
max depth of server queues, default 4
--mapper-batch-size MAPPER_BATCH_SIZE
How many statements to queue in a batch to a worker.
(default=5)
--single-mapper-queue [True/False]
If true, use a single queue for worker tasks. If
false, each worker has its own task queue.
(default=False).
--collector-batch-size COLLECTOR_BATCH_SIZE
How many statements to queue in a batch to the
collector. (default=5)
--use-shm [True/False]
If true, use ShmQueue. (default=False).
--collector-queue-per-proc-size COLLECTOR_QUEUE_PER_PROC_SIZE
collector queue depth per proc, default 2
--node NODE_FILE, --node-file NODE_FILE
path to output node file
--edge MINIMAL_EDGE_FILE, --edge-file MINIMAL_EDGE_FILE, --minimal-edge-file MINIMAL_EDGE_FILE
path to output edge file with minimal data
--qual MINIMAL_QUAL_FILE, --qual-file MINIMAL_QUAL_FILE, --minimal-qual-file MINIMAL_QUAL_FILE
path to output qual file with minimal data
--split-alias-file SPLIT_ALIAS_FILE
path to output split alias file
--split-en-alias-file SPLIT_EN_ALIAS_FILE
path to output split English alias file
--split-datatype-file SPLIT_DATATYPE_FILE
path to output split datatype file
--split-description-file SPLIT_DESCRIPTION_FILE
path to output splitdescription file
--split-en-description-file SPLIT_EN_DESCRIPTION_FILE
path to output split English description file
--split-label-file SPLIT_LABEL_FILE
path to output split label file
--split-en-label-file SPLIT_EN_LABEL_FILE
path to output split English label file
--split-reference-file SPLIT_REFERENCE_FILE
path to output split reference file
--split-sitelink-file SPLIT_SITELINK_FILE
path to output split sitelink file
--split-en-sitelink-file SPLIT_EN_SITELINK_FILE
path to output split English sitelink file
--split-type-file SPLIT_TYPE_FILE, --split-entity-type-file SPLIT_TYPE_FILE
path to output split entry type file
--split-property-edge-file SPLIT_PROPERTY_EDGE_FILE
path to output split property edge file
--split-property-qual-file SPLIT_PROPERTY_QUAL_FILE
path to output split property qualifier file
--limit LIMIT number of lines of input file to run on, default runs
on all
--nth NTH Process every nth line, default processes all lines
--lang LANG languages to extract, comma separated, default en
--source SOURCE wikidata version number, default: wikidata
--deprecated option to include deprecated statements, not included
by default
--use-python-cat [True/False]
If true, use portable code to combine file fragments.
(default=False).
--interleave [True/False]
If true, output the edges and qualifiers in a single
file (the edge file). (default=False).
--parse-aliases [True/False]
If true, parse aliases. (default=True).
--parse-descriptions [True/False]
If true, parse descriptions. (default=True).
--parse-labels [True/False]
If true, parse labels. (default=True).
--parse-sitelinks [True/False]
If true, parse sitelinks. (default=True).
--parse-claims [True/False]
If true, parse claims. (default=True).
--parse-references [True/False]
If true, parse references in claims. (default=True).
--fail-if-missing [True/False]
If true, fail if expected data is missing.
(default=True).
--all-languages [True/False]
If true, override --lang and import aliases,
dscriptions, and labels in all languages.
(default=False).
--warn-if-missing [True/False]
If true, print a warning message if expected data is
missing. (default=True).
--progress-interval PROGRESS_INTERVAL
How often to report progress. (default=500000)
--use-mgzip-for-input [True/False]
If true, use the multithreaded gzip package, mgzip,
for input. (default=False).
--use-mgzip-for-output [True/False]
If true, use the multithreaded gzip package, mgzip,
for output. (default=False).
--mgzip-threads-for-input MGZIP_THREADS_FOR_INPUT
The number of threads per mgzip input streama.
(default=3).
--mgzip-threads-for-output MGZIP_THREADS_FOR_OUTPUT
The number of threads per mgzip output streama.
(default=3).
--value-hash-width VALUE_HASH_WIDTH
How many characters should be used in a value hash?
(default=8)
--claim-id-hash-width CLAIM_ID_HASH_WIDTH
How many characters should be used to hash the claim
ID? 0 means do not hash the claim ID. (default=0)
--clean [True/False] If true, clean the input values before writing it.
(default=False).
--clean-verbose [True/False]
If true, give verbose feedback when cleaning input
values. (default=False).
--invalid-edge-file INVALID_EDGE_FILE
path to output edges with invalid input values
--invalid-qual-file INVALID_QUAL_FILE
path to output qual edges with invalid input values
--skip-validation [True/False]
If true, skip output record validation.
(default=False).
Examples¶
Import the entire wikidata dump into kgtk format, extracting english labels, descriptions and aliases.
kgtk import-wikidata -i wikidata-all-20200504.json.bz2 --node nodefile.tsv --edge edgefile.tsv --qual qualfile.tsv
The following command includes optimizations for running the import process in parallel for English:
kgtk --debug --timing --progress import-wikidata \
-i wikidata-all-20200504.json.gz \
--node nodefile.tsv \
--edge edgefile.tsv \
--qual qualfile.tsv \
--use-mgzip-for-input True \
--use-mgzip-for-output True \
--use-shm True \
--procs 6 \
--mapper-batch-size 5 \
--max-size-per-mapper-queue 3 \
--single-mapper-queue True \
--collect-results True \
--collect-seperately True\
--collector-batch-size 10 \
--collector-queue-per-proc-size 3 \
--progress-interval 500000 --fail-if-missing False