KGTK Text Embedding Utilities¶
Assumptions¶
The input is a KGTK edge file.
Usage¶
usage: kgtk text-embedding [-h] [-i INPUT_FILE]
[-m {bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} [{bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} ...]]
[--sentence-property SENTENCE_PROPERTY]
[--output-property OUTPUT_PROPERTIES]
[--batch-size BATCH_SIZE] [-o OUTPUT_FILE]
[--output-data-format {w2v,kgtk}]
[-v [optional True|False]]
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
The KGTK input file. (May be omitted or '-' for
stdin.)
-m {bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} [{bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} ...], --model {bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} [{bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} ...]
the model to used for embedding
--sentence-property SENTENCE_PROPERTY
The name of the property with sentence for each Qnode.
Default is 'sentence'
--output-property OUTPUT_PROPERTIES
The output property name used to record the embedding.
Default is `output_properties`. This argument is only
valid for output in kgtk format.
--batch-size BATCH_SIZE
The number of sentences to be processed at a time.
Default is 100000. Set this value to '-1' to process
the whole file as one batch
-o OUTPUT_FILE, --out-file OUTPUT_FILE
output path for the text embedding file, by default it
will be printed in console
--output-data-format {w2v,kgtk}
output format, can either be `w2v` or `kgtk`. If
choose `w2v`, the output will be a text file, with
each row contains the qnode and the vector
representation, separated by a space. The first line
is the number of qnodes and dimension of vectors,
separated by space
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
The output is an edge file where each node appears once; a user defined property is used to store the embedding, and the value is a string containing the embedding. For example:
To create a sentence for a Qnode using the properties, please run the kgtk lexicalize command.
An example input sentence is “Saint David, patron saint of Wales is a human, Catholic priest, Catholic bishop, and has date of death, religion and canonization status”
subject | predicate | object |
---|---|---|
Q1 | text_embedding | “0.222, 0.333, ..” |
Q2 | text_embedding | “0.444, 0.555, ..” |
Run¶
You can call the functions directly with given args as
kgtk text-embedding \
-input-file / -i <string> \ # * optional, path to the file
--model / -m <list_of_string> \ # optional, default is `bert-base-wikipedia-sections-mean-tokens`
--output-data-format <string> {w2v, kgtk}, default is `kgtk`
--output-property <string> \ # optional, default is "text_embedding"
--out-file/ -o <string> \ by default embeddings to console
--sentence-property \ The name of the property with sentence for each Qnode. Default is 'sentence'
--batch-size The number of sentences to be processed at a time. Default is 100000. Set this value to '-1' to process the whole file as one batch
Example 1:¶
For easiest running, just give the input file and let it write output to output_embeddings.csv
at current folder
using kgtk text-embedding -i input_file.csv -o output_embeddings.csv
Example 2:¶
Running with more specific parameters:
kgtk text-embedding --debug \
--input-file test_edges_file.tsv \
--model bert-base-wikipedia-sections-mean-tokens bert-base-nli-cls-token \
--sentence-property sentence
--input-file / -i (input file)¶
The path to the input file. For example: input_file1.csv
, it also support to send like < input_file1.csv
--model/ -m Embedding_Model(s)¶
The embedding models want to apply on the sentences. If multiple models given, they will be applied to the same data one by one and output the results with all models.
currently followly 16 models are pretrained and could be used. If not specified, the default model will be bert-base-wikipedia-sections-mean-tokens
bert-base-nli-cls-token
bert-base-nli-max-tokens
bert-base-nli-mean-tokens
bert-base-nli-stsb-mean-tokens
bert-base-wikipedia-sections-mean-tokens
bert-large-nli-cls-token
bert-large-nli-max-tokens
bert-large-nli-mean-tokens
bert-large-nli-stsb-mean-tokens
distilbert-base-nli-mean-tokens
distilbert-base-nli-stsb-mean-tokens
distiluse-base-multilingual-cased
roberta-base-nli-mean-tokens
roberta-base-nli-stsb-mean-tokens
roberta-large-nli-mean-tokens
roberta-large-nli-stsb-mean-tokens
--output-property¶
the property used to record the embedding. If not given, the program will use the edge(property) name as text_embedding
.
This option is only available when output format is set to kgtk
.
--sentence-property¶
The name of the property with sentence for each Qnode. Default is 'sentence'
--batch-size¶
The number of sentences to be processed at a time. Default is 100000. Set this value to '-1' to process the whole file as one batch
Output files¶
There will be 3 part of files:
Logger file¶
If passed with global parameter --debug
, an extra debugging logger file will be stored at user's home directory.
Embedding Vectors¶
This will have all the embedded vectors values for each Q nodes. This will be print on stddout and can be redirected to a file. Note: There will only texet embedding related things outputed, please run other commands
If output as kgtk
, the output file will looks like:
Q1 | text_embedding | 0.2,0.3,0.4,0.5 |
Q2 | text_embedding | 0.3,0.4,-0.5,-0.6 |
The output will be a TSV file with 3 columns:
First column is the node name.
Second column is the property name as required, default is text_embedding
.
Third column is the embeded vecotrs.