lexicalize
Overview¶
kgtk lexicalize
builds English sentences from KGTK edge files.
The primary purpose of this command is to construct inputs for text-based distance vector analysis. However, it may also prove useful for explaining the contents of local subsets of Knowledge Graphs.
Input Files¶
kgtk lexicalize
has a primary input file which contains:
label
properties (entity labels)description
propertiesisa
propertieshas
properties- property values
There may also be one or more entity input files that contain additional
entity labels. These files are specified with the --entity-label-file
option.
Optimization for Presorted Input¶
Normally, the primary input file is loaded into memory before any output is produced. This can lead to performance problems when the amount of memory required exceeds the amount that is available.
If the primary input file is presorted on node1
, a presorted processing mode may be used
to minimize memory consumption. This mode ignores the --add-entity-labels-from-input
option,
so the primary input file may not contain entity labels when using presorted input.
--presorted
(default FALSE) is used to indicate that the presorted input
processing mode should be used. The input file will be checked as it is read
to ensure that it properly sorted. If it is not, an error will occur and
processing will stop.
Output File¶
The output file is a KGTK file containing sentence
properties
constructed during lexicalization. An optional explanation
column gives a brief summary of how the sentence was constructed.
node1 | label | node2 | explaination |
---|---|---|---|
Q75952970 | sentence | "It is a census in Austria-Hungary." | "isa(\'census\'->\'a census\')+property_values(\'country Austria-Hungary\'->[\'in Austria-Hungary\'])" |
Q75952971 | sentence | "Philippe Greenway, born 1991, is a human and male." | "label(\'Philippe Greenway\')+description(\'born 1991\')+isa(\'human\',\'male\'->\'a human and male\')" |
Q75952972 | sentence | "Sir Patrick Hastings, Peerage person ID=426177, is a human and male." | "label(\'Sir Patrick Hastings\')+description(\'Peerage person ID=426177\')+isa(\'human\',\'male\'->\'a human and male\')" |
Q75952973 | sentence | "Philip Maitland Gore Anley, Peerage person ID=426178, is a human and male." | "label(\'Philip Maitland Gore Anley\')+description(\'Peerage person ID=426178\')+isa(\'human\',\'male\'->\'a human and male\')" |
Q75952974 | sentence | "It is a star." | "isa(\'star\'->\'a star\')" |
Q75952975 | sentence | "Sarah Louise Anley, died 2010, is a female and human." | "label(\'Sarah Louise Anley\')+description(\'died 2010\')+isa(\'female\',\'human\'->\'a female and human\')" |
The --sentence-label
option provides the relationship name
in the label
column of the output file. The default value is "sentence".
--explain
option controls whether or not an explanation column
is included in the output file. The default is not to include explanations
in the output file.
Entity Label Loading¶
When entity label files are provided, thet are read befor the primary
input file is processed. Edges where the value in the label
column
matches one if the properties in the --label-properties
list (default ['label']
are loaded into memory in the entity label dictionary.
When --add-entity-labels-from-input
is TRUE, and the input file is
not presorted, any edges in the primary input file where the value in
the label
column matches one of the properties in the --label-properties
list will also be added to the entity label dictionary.
id | node1 | label | node2 |
---|---|---|---|
Q11247242-label-en | Q11247242 | label | 'Steady & Co.'@en |
Q11247279-label-en | Q11247279 | label | 'Ford'@en |
Q11247470-label-en | Q11247470 | label | 'commanding officer'@en |
Q1124841-label-en | Q1124841 | label | 'PFC Lokomotiv Plovdiv'@en |
Q1124849-label-en | Q1124849 | label | 'Verve Records'@en |
Entity Label Dictionary and Priority¶
Entity label edges are used build a dictionary between the node1
value in
entity label edges and the node2
value in the entity label edge.
When there are multiple entity label edges for a given node1
value,
only one node2
value is retained. The following priority is used:
- If there are any
node2
values which are language-qualified strings for the English language (language codeen
and no language suffix) the last-seen such value is retained. - Otherwise, the first
node2
value seen is retained.
Property List Defaults¶
Sentences are built by assembling labels, descriptions, and other properties under a hard-coded template. Each of the property list options supplies property values into a particular slot in the sentence generation template.
Property Option | Default List |
---|---|
--description-properties | description |
--has-properties | |
--isa-properties | P21 P31 P39 P106 P279 |
--label-properties | label |
--property-values | P17 |
Wikidata Property | Label |
---|---|
P17 | country |
P21 | sex or gender |
P31 | instance of |
P39 | position held |
P106 | occupation |
P279 | subclass of |
Usage¶
usage: kgtk lexicalize [-h] [-i INPUT_FILE]
[--entity-label-file ENTITY_LABEL_FILE]
[-o OUTPUT_FILE]
[--label-properties [LABEL_PROPERTIES ...]]
[--description-properties [DESCRIPTION_PROPERTIES ...]]
[--language [LANGUAGE ...]]
[--isa-properties [ISA_PROPERTIES ...]]
[--has-properties [HAS_PROPERTIES ...]]
[--property-values [PROPERTY_VALUES ...]]
[--sentence-label SENTENCE_LABEL]
[--explain [True|False]] [--presorted [True|False]]
[--add-entity-labels-from-input [True|False]]
[-v [optional True|False]]
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
The KGTK input file. (May be omitted or '-' for
stdin.)
--entity-label-file ENTITY_LABEL_FILE
The entity label file(s) (Optional, use '-' for
stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--label-properties [LABEL_PROPERTIES ...]
The label properties. (default=['label'])
--description-properties [DESCRIPTION_PROPERTIES ...]
The description properties. (default=['description'])
--language [LANGUAGE ...]
The label and description language. (default='en')
--isa-properties [ISA_PROPERTIES ...]
The isa properties. (default=['P21', 'P31', 'P39',
'P106', 'P279'])
--has-properties [HAS_PROPERTIES ...]
The has properties. (default=[])
--property-values [PROPERTY_VALUES ...]
The property values. (default=['P17'])
--sentence-label SENTENCE_LABEL
The relationship to write in the output file.
(default=sentence)
--explain [True|False]
When true, include an explanation column that tells
how the sentence was constructed. (default=False).
--presorted [True|False]
When true, the input file is presorted on node1.
(default=False).
--add-entity-labels-from-input [True|False]
When true, extract entity labels from the unsorted
input file. (default=False).
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
Examples¶
One isa
Property, Seperate Labels¶
The following input file has a single entity with a single isa
relationship
(P31
, instance of
).
kgtk cat -i examples/docs/lexicalize-one-isa-input.tsv
id | node1 | label | node2 |
---|---|---|---|
Q75952971-P31-Q5-d020ba0c-0 | Q75952971 | P31 | Q5 |
The following label file has the labels needed by the input file:
kgtk cat -i examples/docs/lexicalize-one-isa-labels.tsv
id | node1 | label | node2 |
---|---|---|---|
Q75952971-label-en | Q75952971 | label | 'Philippe Greenway'@en |
Q5-label-en | Q5 | label | 'human'@en |
Convert this data to a sentence:
kgtk lexicalize --input-file examples/docs/lexicalize-one-isa-input.tsv \
--entity-label-file examples/docs/lexicalize-one-isa-labels.tsv
node1 | label | node2 |
---|---|---|
Q75952971 | sentence | "Philippe Greenway is a human." |
One isa
Property, a Single File¶
The following input file has a single entity with a single isa
relationship.
The matching labels are in the same file.
kgtk cat -i examples/docs/lexicalize-one-isa-combined.tsv
id | node1 | label | node2 |
---|---|---|---|
Q75952971-P31-Q5-d020ba0c-0 | Q75952971 | P31 | Q5 |
Q75952971-label-en | Q75952971 | label | 'Philippe Greenway'@en |
Q5-label-en | Q5 | label | 'human'@en |
Convert this data to a sentence:
kgtk lexicalize --input-file examples/docs/lexicalize-one-isa-combined.tsv \
--add-entity-labels-from-input
node1 | label | node2 |
---|---|---|
Q75952971 | sentence | "Philippe Greenway is a human." |
Two isa
Properties¶
The following input file has a single entity with two isa
relationships
(P31
, instance of
)
kgtk cat -i examples/docs/lexicalize-two-isas-input.tsv
id | node1 | label | node2 |
---|---|---|---|
Q75952971-P31-Q5-d020ba0c-0 | Q75952971 | P31 | Q5 |
Q75952971-P21-Q6581097-018e8019-0 | Q75952971 | P21 | Q6581097 |
Here are the matching labels:
kgtk cat -i examples/docs/lexicalize-two-isas-labels.tsv
id | node1 | label | node2 |
---|---|---|---|
Q5-label-en | Q5 | label | 'human'@en |
Q6581097-label-en | Q6581097 | label | 'male'@en |
Q75952971-label-en | Q75952971 | label | 'Philippe Greenway'@en |
Convert this data to a sentence:
kgtk lexicalize --input-file examples/docs/lexicalize-two-isas-input.tsv \
--entity-label-file examples/docs/lexicalize-two-isas-labels.tsv
node1 | label | node2 |
---|---|---|
Q75952971 | sentence | "Philippe Greenway is a human and male." |
Two isa
Properties Reordered¶
The following input file has a single entity with two isa
relationships (P31
, instance of
).
The order of the isa
relationships in the input file is different from the order in the example above.
kgtk cat -i examples/docs/lexicalize-two-isas-reordered.tsv
id | node1 | label | node2 |
---|---|---|---|
Q75952971-P21-Q6581097-018e8019-0 | Q75952971 | P21 | Q6581097 |
Q75952971-P31-Q5-d020ba0c-0 | Q75952971 | P31 | Q5 |
Convert this data to a sentence:
kgtk lexicalize --input-file examples/docs/lexicalize-two-isas-reordered.tsv \
--entity-label-file examples/docs/lexicalize-two-isas-labels.tsv
The output sentence is the same, because the properties are collected and sorted internally during processing.
node1 | label | node2 |
---|---|---|
Q75952971 | sentence | "Philippe Greenway is a human and male." |
Two isa
Properties, Presorted Input¶
When an input file is presorted on the node1
column, and the labels are
read from an external file, kgtk lexicalize
can use an optimized
implementation that reduces the amount of memory it requires to process
large files.
The following input file has a single entity with two isa
relationships (P31
, instance of
).
There is only one node1
value, so we can use this as an example of
presorted input in a degenerate case.
kgtk cat -i examples/docs/lexicalize-two-isas-input.tsv
id | node1 | label | node2 |
---|---|---|---|
Q75952971-P31-Q5-d020ba0c-0 | Q75952971 | P31 | Q5 |
Q75952971-P21-Q6581097-018e8019-0 | Q75952971 | P21 | Q6581097 |
Here are the matching labels:
kgtk cat -i examples/docs/lexicalize-two-isas-labels.tsv
id | node1 | label | node2 |
---|---|---|---|
Q5-label-en | Q5 | label | 'human'@en |
Q6581097-label-en | Q6581097 | label | 'male'@en |
Q75952971-label-en | Q75952971 | label | 'Philippe Greenway'@en |
Convert this data to a sentence:
kgtk lexicalize --input-file examples/docs/lexicalize-two-isas-input.tsv \
--presorted \
--entity-label-file examples/docs/lexicalize-two-isas-labels.tsv
node1 | label | node2 |
---|---|---|
Q75952971 | sentence | "Philippe Greenway is a human and male." |
Two isa
Properties and Description¶
The following input file has a single entity with two isa
relationships (P31
, instance of
)
and a description
property. The input file also contains the matching labels.
kgtk cat -i examples/docs/lexicalize-two-isas-and-description.tsv
id | node1 | label | node2 |
---|---|---|---|
Q75952971-P31-Q5-d020ba0c-0 | Q75952971 | P31 | Q5 |
Q75952971-P21-Q6581097-018e8019-0 | Q75952971 | P21 | Q6581097 |
Q5-label-en | Q5 | label | 'human'@en |
Q6581097-label-en | Q6581097 | label | 'male'@en |
Q75952971-label-en | Q75952971 | label | 'Philippe Greenway'@en |
Q75952971-description-en | Q75952971 | description | 'born 1991'@en |
Convert this data to a sentence:
kgtk lexicalize --input-file examples/docs/lexicalize-two-isas-and-description.tsv \
--add-entity-labels-from-input
node1 | label | node2 |
---|---|---|
Q75952971 | sentence | "Philippe Greenway, born 1991, is a human and male." |
Two isa
Properties, a Description, and and Property Value P17¶
The following input file has a single entity with two isa
relationships (P31
, instance of
)
and a property value (P17
, country
). The input file also contains the matching labels.
kgtk cat -i examples/docs/lexicalize-two-isas-and-property-value-P17.tsv
id | node1 | label | node2 |
---|---|---|---|
Q75952971-P31-Q5-d020ba0c-0 | Q75952971 | P31 | Q5 |
Q75952971-P21-Q6581097-018e8019-0 | Q75952971 | P21 | Q6581097 |
Q5-label-en | Q5 | label | 'human'@en |
Q6581097-label-en | Q6581097 | label | 'male'@en |
Q75952971-label-en | Q75952971 | label | 'Philippe Greenway'@en |
Q75952971-description-en | Q75952971 | description | 'born 1991'@en |
Q75952971-P17-Q28513-33ddd57d-0 | Q75952971 | P17 | Q28513 |
Q28513-label-en | Q28513 | label | 'Austria-Hungary'@en |
P17-label-en | P17 | label | 'country'@en |
Convert this data to a sentence:
kgtk lexicalize --input-file examples/docs/lexicalize-two-isas-and-property-value-P17.tsv \
--add-entity-labels-from-input
node1 | label | node2 |
---|---|---|
Q75952971 | sentence | "Philippe Greenway, born 1991, is a human and male in Austria-Hungary." |
Note
The code that produces a readable sentence for this example is
hard coded for property value P17.
The use of other property
values might require changes to the code in order to
produce reasonable sentences.
Two isa
Properties and Property Values P569 and P570¶
The following input file has a single entity with two isa
relationships (P31
, instance of
)
and two property value (P569
, date of birth
, and P570
, date of death
).
he input file also contains the matching labels.
kgtk cat -i examples/docs/lexicalize-two-isas-and-property-values-P569-P570.tsv
id | node1 | label | node2 |
---|---|---|---|
Q75952971-P31-Q5-d020ba0c-0 | Q75952971 | P31 | Q5 |
Q75952971-P21-Q6581097-018e8019-0 | Q75952971 | P21 | Q6581097 |
Q5-label-en | Q5 | label | 'human'@en |
Q6581097-label-en | Q6581097 | label | 'male'@en |
Q75952971-label-en | Q75952971 | label | 'Philippe Greenway'@en |
Q75952971-P569-52d50d-5fe626e5-0 | Q75952971 | P569 | ^1876-05-07T00:00:00Z/11 |
Q75952971-P570-5f1346-586e365b-0 | Q75952971 | P570 | ^1957-02-26T00:00:00Z/11 |
P569-label-en | P569 | label | 'date of birth'@en |
P570-label-en | P570 | label | 'date of death'@en |
Convert this data to a sentence:
kgtk lexicalize --input-file examples/docs/lexicalize-two-isas-and-property-values-P569-P570.tsv \
--add-entity-labels-from-input \
--property-values P569 P570
node1 | label | node2 |
---|---|---|
Q75952971 | sentence | "Philippe Greenway is a human and male date of birth ^1876-05-07T00:00:00Z/11 and date of death ^1957-02-26T00:00:00Z/11." |
Note
At present, changes to the code would be needed to improve the quality of the output.
Complex Example: Q75992564, a Music Track¶
Here is a more complex looking example. Most of the work is done by the description property.
kgtk cat -i examples/docs/lexicalize-Q75992564.tsv
id | node1 | label | node2 | rank | node2;wikidatatype |
---|---|---|---|---|---|
Q75992564-P136-Q83440-5a7171a8-0 | Q75992564 | P136 | Q83440 | normal | wikibase-item |
Q75992564-P1433-Q2598379-92fd18d2-0 | Q75992564 | P1433 | Q2598379 | normal | wikibase-item |
Q75992564-P1433-Q75998294-3aa16bcc-0 | Q75992564 | P1433 | Q75998294 | normal | wikibase-item |
Q75992564-P1476-7e22ec-4db02c7e-0 | Q75992564 | P1476 | 'Because of You'@en | normal | monolingualtext |
Q75992564-P1552-Q109940-802348fe-0 | Q75992564 | P1552 | Q109940 | normal | wikibase-item |
Q75992564-P1552-Q155171-582f91ae-0 | Q75992564 | P1552 | Q155171 | normal | wikibase-item |
Q75992564-P1552-Q15975575-a54239e8-0 | Q75992564 | P1552 | Q15975575 | normal | wikibase-item |
Q75992564-P162-Q229430-4d2b5fa7-0 | Q75992564 | P162 | Q229430 | normal | wikibase-item |
Q75992564-P162-Q7821969-674ded3a-0 | Q75992564 | P162 | Q7821969 | normal | wikibase-item |
Q75992564-P175-Q229430-aefd7965-0 | Q75992564 | P175 | Q229430 | normal | wikibase-item |
Q75992564-P175-Q483507-a1ad6642-0 | Q75992564 | P175 | Q483507 | normal | wikibase-item |
Q75992564-P1889-Q400557-94928f8d-0 | Q75992564 | P1889 | Q400557 | normal | wikibase-item |
Q75992564-P2550-Q868569-cbe8942c-0 | Q75992564 | P2550 | Q868569 | normal | wikibase-item |
Q75992564-P31-Q55850593-156262eb-0 | Q75992564 | P31 | Q55850593 | ||
Q75992564-P31-Q55850593-156262eb-0 | Q75992564 | P31 | Q55850593 | normal | wikibase-item |
Q75992564-P4404-69fc83-9222fd1e-0 | Q75992564 | P4404 | "1c5c05d6-dff6-440a-ba1f-54000e2d04bd" | normal | external-id |
Q75992564-P4404-fa8703-43654e34-0 | Q75992564 | P4404 | "102eb099-3119-4d08-80f5-06511af875a4" | normal | external-id |
Q75992564-description-en | Q75992564 | description | 'vocal track by Reba in duet with Kelly Clarkson; 2007 studio recording; cover version'@en | ||
Q75992564-directed_pagerank-31405997 | Q75992564 | directed_pagerank | 8.715560990765719e-09 | ||
Q75992564-in_degree-2-0000 | Q75992564 | in_degree | 2 | ||
Q75992564-isa-Q55850593-0000 | Q75992564 | isa | Q55850593 | ||
Q75992564-label-en | Q75992564 | label | 'Because of You'@en | ||
Q75992564-out_degree-16-0000 | Q75992564 | out_degree | 16 | ||
Q75992564-undirected_pagerank-31405997 | Q75992564 | undirected_pagerank | 8.715560990765719e-09 | ||
Q75992564-vertex_in_degree-31405995 | Q75992564 | vertex_in_degree | 2 | ||
Q75992564-vertex_in_degree-31405995 | Q75992564 | vertex_in_degree | 2 | ||
Q75992564-vertex_out_degree-31405996 | Q75992564 | vertex_out_degree | 13 | ||
Q75992564-vertex_out_degree-31405996 | Q75992564 | vertex_out_degree | 13 | ||
Q55850593-label-en | Q55850593 | label | 'music track with vocals'@en | ||
P1552-label-en | P1552 | label | 'has quality'@en | ||
Q109940-label-en | Q109940 | label | 'duet'@en | ||
Q155171-label-en | Q155171 | label | 'cover version'@en | ||
Q15975575-label-en | Q15975575 | label | 'studio recording'@en |
Convert this data to a sentence:
kgtk lexicalize --input-file examples/docs/lexicalize-Q75992564.tsv \
--add-entity-labels-from-input
node1 | label | node2 |
---|---|---|
Q75992564 | sentence | "Because of You, vocal track by Reba in duet with Kelly Clarkson; 2007 studio recording; cover version, is a music track with vocals." |
Complex Example: Q75992564, a Music Track, with has
Properties and Omitting the Description¶
Let's try adding P1552
(has quality
) to the list of has
properties
and omit the description property.
kgtk cat -i examples/docs/lexicalize-Q75992564.tsv
id | node1 | label | node2 | rank | node2;wikidatatype |
---|---|---|---|---|---|
Q75992564-P136-Q83440-5a7171a8-0 | Q75992564 | P136 | Q83440 | normal | wikibase-item |
Q75992564-P1433-Q2598379-92fd18d2-0 | Q75992564 | P1433 | Q2598379 | normal | wikibase-item |
Q75992564-P1433-Q75998294-3aa16bcc-0 | Q75992564 | P1433 | Q75998294 | normal | wikibase-item |
Q75992564-P1476-7e22ec-4db02c7e-0 | Q75992564 | P1476 | 'Because of You'@en | normal | monolingualtext |
Q75992564-P1552-Q109940-802348fe-0 | Q75992564 | P1552 | Q109940 | normal | wikibase-item |
Q75992564-P1552-Q155171-582f91ae-0 | Q75992564 | P1552 | Q155171 | normal | wikibase-item |
Q75992564-P1552-Q15975575-a54239e8-0 | Q75992564 | P1552 | Q15975575 | normal | wikibase-item |
Q75992564-P162-Q229430-4d2b5fa7-0 | Q75992564 | P162 | Q229430 | normal | wikibase-item |
Q75992564-P162-Q7821969-674ded3a-0 | Q75992564 | P162 | Q7821969 | normal | wikibase-item |
Q75992564-P175-Q229430-aefd7965-0 | Q75992564 | P175 | Q229430 | normal | wikibase-item |
Q75992564-P175-Q483507-a1ad6642-0 | Q75992564 | P175 | Q483507 | normal | wikibase-item |
Q75992564-P1889-Q400557-94928f8d-0 | Q75992564 | P1889 | Q400557 | normal | wikibase-item |
Q75992564-P2550-Q868569-cbe8942c-0 | Q75992564 | P2550 | Q868569 | normal | wikibase-item |
Q75992564-P31-Q55850593-156262eb-0 | Q75992564 | P31 | Q55850593 | ||
Q75992564-P31-Q55850593-156262eb-0 | Q75992564 | P31 | Q55850593 | normal | wikibase-item |
Q75992564-P4404-69fc83-9222fd1e-0 | Q75992564 | P4404 | "1c5c05d6-dff6-440a-ba1f-54000e2d04bd" | normal | external-id |
Q75992564-P4404-fa8703-43654e34-0 | Q75992564 | P4404 | "102eb099-3119-4d08-80f5-06511af875a4" | normal | external-id |
Q75992564-description-en | Q75992564 | description | 'vocal track by Reba in duet with Kelly Clarkson; 2007 studio recording; cover version'@en | ||
Q75992564-directed_pagerank-31405997 | Q75992564 | directed_pagerank | 8.715560990765719e-09 | ||
Q75992564-in_degree-2-0000 | Q75992564 | in_degree | 2 | ||
Q75992564-isa-Q55850593-0000 | Q75992564 | isa | Q55850593 | ||
Q75992564-label-en | Q75992564 | label | 'Because of You'@en | ||
Q75992564-out_degree-16-0000 | Q75992564 | out_degree | 16 | ||
Q75992564-undirected_pagerank-31405997 | Q75992564 | undirected_pagerank | 8.715560990765719e-09 | ||
Q75992564-vertex_in_degree-31405995 | Q75992564 | vertex_in_degree | 2 | ||
Q75992564-vertex_in_degree-31405995 | Q75992564 | vertex_in_degree | 2 | ||
Q75992564-vertex_out_degree-31405996 | Q75992564 | vertex_out_degree | 13 | ||
Q75992564-vertex_out_degree-31405996 | Q75992564 | vertex_out_degree | 13 | ||
Q55850593-label-en | Q55850593 | label | 'music track with vocals'@en | ||
P1552-label-en | P1552 | label | 'has quality'@en | ||
Q109940-label-en | Q109940 | label | 'duet'@en | ||
Q155171-label-en | Q155171 | label | 'cover version'@en | ||
Q15975575-label-en | Q15975575 | label | 'studio recording'@en |
kgtk lexicalize --input-file examples/docs/lexicalize-Q75992564.tsv \
--add-entity-labels-from-input \
--has-properties P1552 \
--description-properties ""
node1 | label | node2 |
---|---|---|
Q75992564 | sentence | "Because of You is a music track with vocals, and has cover version and studio recording and duet." |