import-wikidata
This command will import a Wikidata dump in json format (compressed in gzip or bz2) into KGTK format, generating 3 files:
- A nodes file containing all Qnodes and Pnodes in Wikidata
- An edges file containing all the statements in Wikidata
- A qualifiers file containing all qualifiers on statements in Wikidata
Usage¶
usage: kgtk import-wikidata [-h] [-i INPUT_FILE] [--procs PROCS]
[--max-size-per-mapper-queue MAX_SIZE_PER_MAPPER_QUEUE]
[--mapper-batch-size MAPPER_BATCH_SIZE]
[--single-mapper-queue [True/False]]
[--collect-results [True/False]]
[--collect-seperately [True/False]]
[--collector-batch-size COLLECTOR_BATCH_SIZE]
[--use-shm [True/False]]
[--collector-queue-per-proc-size COLLECTOR_QUEUE_PER_PROC_SIZE]
[--node NODE_FILE] [--edge DETAILED_EDGE_FILE]
[--minimal-edge-file MINIMAL_EDGE_FILE]
[--qual DETAILED_QUAL_FILE]
[--minimal-qual-file MINIMAL_QUAL_FILE]
[--node-file-id-only [True/False]]
[--split-alias-file SPLIT_ALIAS_FILE]
[--split-en-alias-file SPLIT_EN_ALIAS_FILE]
[--split-datatype-file SPLIT_DATATYPE_FILE]
[--split-description-file SPLIT_DESCRIPTION_FILE]
[--split-en-description-file SPLIT_EN_DESCRIPTION_FILE]
[--split-label-file SPLIT_LABEL_FILE]
[--split-en-label-file SPLIT_EN_LABEL_FILE]
[--split-sitelink-file SPLIT_SITELINK_FILE]
[--split-en-sitelink-file SPLIT_EN_SITELINK_FILE]
[--split-type-file SPLIT_TYPE_FILE]
[--split-property-edge-file SPLIT_PROPERTY_EDGE_FILE]
[--split-property-qual-file SPLIT_PROPERTY_QUAL_FILE]
[--limit LIMIT] [--lang LANG] [--source SOURCE]
[--deprecated] [--explode-values [True/False]]
[--use-python-cat [True/False]]
[--keep-temp-files [True/False]]
[--skip-processing [True/False]]
[--skip-merging [True/False]]
[--interleave [True/False]]
[--entry-type-edges [True/False]]
[--alias-edges [True/False]]
[--datatype-edges [True/False]]
[--description-edges [True/False]]
[--label-edges [True/False]]
[--sitelink-edges [True/False]]
[--sitelink-verbose-edges [True/False]]
[--sitelink-verbose-qualifiers [True/False]]
[--parse-aliases [True/False]]
[--parse-descriptions [True/False]]
[--parse-labels [True/False]]
[--parse-sitelinks [True/False]]
[--parse-claims [True/False]]
[--fail-if-missing [True/False]]
[--all-languages [True/False]]
[--warn-if-missing [True/False]]
[--progress-interval PROGRESS_INTERVAL]
[--use-kgtkwriter [True/False]]
[--use-mgzip-for-input [True/False]]
[--use-mgzip-for-output [True/False]]
[--mgzip-threads-for-input MGZIP_THREADS_FOR_INPUT]
[--mgzip-threads-for-output MGZIP_THREADS_FOR_OUTPUT]
[--value-hash-width VALUE_HASH_WIDTH]
[--claim-id-hash-width CLAIM_ID_HASH_WIDTH]
[--skip-validation [True/False]]
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
input path file (may be .bz2) (May be omitted or
'-' for stdin.)
--procs PROCS number of processes to run in parallel, default 2
--max-size-per-mapper-queue MAX_SIZE_PER_MAPPER_QUEUE
max depth of server queues, default 4
--mapper-batch-size MAPPER_BATCH_SIZE
How many statements to queue in a batch to a
worker. (default=5)
--single-mapper-queue [True/False]
If true, use a single queue for worker tasks. If
false, each worker has its own task queue.
(default=False).
--collect-results [True/False]
If true, collect the results before writing to
disk. If false, write results to disk, then
concatenate. (default=False).
--collect-seperately [True/False]
If true, collect the node, edge, and qualifier
results using seperate processes. If false, collect
the results with a single process. (default=False).
--collector-batch-size COLLECTOR_BATCH_SIZE
How many statements to queue in a batch to the
collector. (default=5)
--use-shm [True/False]
If true, use ShmQueue. (default=False).
--collector-queue-per-proc-size COLLECTOR_QUEUE_PER_PROC_SIZE
collector queue depth per proc, default 2
--node NODE_FILE, --node-file NODE_FILE
path to output node file
--edge DETAILED_EDGE_FILE, --edge-file DETAILED_EDGE_FILE, --detailed-edge-file DETAILED_EDGE_FILE
path to output edge file with detailed data
--minimal-edge-file MINIMAL_EDGE_FILE
path to output edge file with minimal data
--qual DETAILED_QUAL_FILE, --qual-file DETAILED_QUAL_FILE, --detailed-qual-file DETAILED_QUAL_FILE
path to output qualifier file with full data
--minimal-qual-file MINIMAL_QUAL_FILE
path to output qualifier file with minimal data
--node-file-id-only [True/False]
Option to write only the node ID in the node file.
(default=False)
--split-alias-file SPLIT_ALIAS_FILE
path to output split alias file
--split-en-alias-file SPLIT_EN_ALIAS_FILE
path to output split English alias file
--split-datatype-file SPLIT_DATATYPE_FILE
path to output split datatype file
--split-description-file SPLIT_DESCRIPTION_FILE
path to output splitdescription file
--split-en-description-file SPLIT_EN_DESCRIPTION_FILE
path to output split English description file
--split-label-file SPLIT_LABEL_FILE
path to output split label file
--split-en-label-file SPLIT_EN_LABEL_FILE
path to output split English label file
--split-sitelink-file SPLIT_SITELINK_FILE
path to output split sitelink file
--split-en-sitelink-file SPLIT_EN_SITELINK_FILE
path to output split English sitelink file
--split-type-file SPLIT_TYPE_FILE, --split-entity-type-file SPLIT_TYPE_FILE
path to output split entry type file
--split-property-edge-file SPLIT_PROPERTY_EDGE_FILE
path to output split property edge file
--split-property-qual-file SPLIT_PROPERTY_QUAL_FILE
path to output split property qualifier file
--limit LIMIT number of lines of input file to run on, default
runs on all
--lang LANG languages to extract, comma separated, default en
--source SOURCE wikidata version number, default: wikidata
--deprecated option to include deprecated statements, not
included by default
--explode-values [True/False]
If true, create columns with exploded value
information. (default=True).
--use-python-cat [True/False]
If true, use portable code to combine file
fragments. (default=False).
--keep-temp-files [True/False]
If true, keep temporary files (for debugging).
(default=False).
--skip-processing [True/False]
If true, skip processing the input file (for
debugging). (default=False).
--skip-merging [True/False]
If true, skip merging temporary files (for
debugging). (default=False).
--interleave [True/False]
If true, output the edges and qualifiers in a
single file (the edge file). (default=False).
--entry-type-edges [True/False]
If true, create edge records for the entry type
field. (default=False).
--alias-edges [True/False]
If true, create edge records for aliases.
(default=False).
--datatype-edges [True/False]
If true, create edge records for property
datatypes. (default=False).
--description-edges [True/False]
If true, create edge records for descriptions.
(default=False).
--label-edges [True/False]
If true, create edge records for labels.
(default=False).
--sitelink-edges [True/False]
If true, create edge records for sitelinks.
(default=False).
--sitelink-verbose-edges [True/False]
If true, create edge records for sitelink details
(lang, site, badges). (default=False).
--sitelink-verbose-qualifiers [True/False]
If true, create qualifier records for sitelink
details (lang, site, badges). (default=False).
--parse-aliases [True/False]
If true, parse aliases. (default=True).
--parse-descriptions [True/False]
If true, parse descriptions. (default=True).
--parse-labels [True/False]
If true, parse labels. (default=True).
--parse-sitelinks [True/False]
If true, parse sitelinks. (default=True).
--parse-claims [True/False]
If true, parse claims. (default=True).
--fail-if-missing [True/False]
If true, fail if expected data is missing.
(default=True).
--all-languages [True/False]
If true, override --lang and import aliases,
dscriptions, and labels in all languages.
(default=False).
--warn-if-missing [True/False]
If true, print a warning message if expected data
is missing. (default=True).
--progress-interval PROGRESS_INTERVAL
How often to report progress. (default=500000)
--use-kgtkwriter [True/False]
If true, use KgtkWriter instead of csv.writer.
(default=True).
--use-mgzip-for-input [True/False]
If true, use the multithreaded gzip package, mgzip,
for input. (default=False).
--use-mgzip-for-output [True/False]
If true, use the multithreaded gzip package, mgzip,
for output. (default=False).
--mgzip-threads-for-input MGZIP_THREADS_FOR_INPUT
The number of threads per mgzip input streama.
(default=3).
--mgzip-threads-for-output MGZIP_THREADS_FOR_OUTPUT
The number of threads per mgzip output streama.
(default=3).
--value-hash-width VALUE_HASH_WIDTH
How many characters should be used in a value hash?
(default=8)
--claim-id-hash-width CLAIM_ID_HASH_WIDTH
How many characters should be used to hash the
claim ID? 0 means do not hash the claim ID.
(default=0)
--skip-validation [True/False]
If true, skip output record validation.
(default=False).
Examples¶
Import the entire wikidata dump into kgtk format, extracting english labels, descriptions and aliases.
kgtk import-wikidata -i wikidata-all-20200504.json.bz2 --node nodefile.tsv --edge edgefile.tsv --qual qualfile.tsv
The following command includes optimizations for running the import process in parallel for English:
kgtk --debug --timing --progress import-wikidata \
-i wikidata-all-20200504.json.gz \
--node nodefile.tsv \
--edge edgefile.tsv \
--qual qualfile.tsv \
--use-mgzip-for-input True \
--use-mgzip-for-output True \
--use-shm True \
--procs 6 \
--mapper-batch-size 5 \
--max-size-per-mapper-queue 3 \
--single-mapper-queue True \
--collect-results True \
--collect-seperately True\
--collector-batch-size 10 \
--collector-queue-per-proc-size 3 \
--progress-interval 500000 --fail-if-missing False