Skip to content

KGTK Text Embedding Utilities

Assumptions

The input is a KGTK edge file.

Usage

usage: kgtk text-embedding [-h] [-i INPUT_FILE]
                           [-m {bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} [{bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} ...]]
                           [--sentence-property SENTENCE_PROPERTY]
                           [--output-property OUTPUT_PROPERTIES]
                           [--batch-size BATCH_SIZE] [-o OUTPUT_FILE]
                           [--output-data-format {w2v,kgtk}]
                           [-v [optional True|False]]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input-file INPUT_FILE
                        The KGTK input file. (May be omitted or '-' for
                        stdin.)
  -m {bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} [{bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} ...], --model {bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} [{bert-base-nli-cls-token,bert-base-nli-max-tokens,bert-base-nli-mean-tokens,bert-base-nli-stsb-mean-tokens,bert-base-wikipedia-sections-mean-tokens,bert-large-nli-cls-token,bert-large-nli-max-tokens,bert-large-nli-mean-tokens,bert-large-nli-stsb-mean-tokens,distilbert-base-nli-mean-tokens,distilbert-base-nli-stsb-mean-tokens,distiluse-base-multilingual-cased,roberta-base-nli-mean-tokens,roberta-base-nli-stsb-mean-tokens,roberta-large-nli-mean-tokens,roberta-large-nli-stsb-mean-tokens,sentence-transformers/all-distilroberta-v1} ...]
                        the model to used for embedding
  --sentence-property SENTENCE_PROPERTY
                        The name of the property with sentence for each Qnode.
                        Default is 'sentence'
  --output-property OUTPUT_PROPERTIES
                        The output property name used to record the embedding.
                        Default is `output_properties`. This argument is only
                        valid for output in kgtk format.
  --batch-size BATCH_SIZE
                        The number of sentences to be processed at a time.
                        Default is 100000. Set this value to '-1' to process
                        the whole file as one batch
  -o OUTPUT_FILE, --out-file OUTPUT_FILE
                        output path for the text embedding file, by default it
                        will be printed in console
  --output-data-format {w2v,kgtk}
                        output format, can either be `w2v` or `kgtk`. If
                        choose `w2v`, the output will be a text file, with
                        each row contains the qnode and the vector
                        representation, separated by a space. The first line
                        is the number of qnodes and dimension of vectors,
                        separated by space

  -v [optional True|False], --verbose [optional True|False]
                        Print additional progress messages (default=False).
Computes embeddings of nodes using properties of nodes using a pre-trained language model.

The output is an edge file where each node appears once; a user defined property is used to store the embedding, and the value is a string containing the embedding. For example:

To create a sentence for a Qnode using the properties, please run the kgtk lexicalize command.

An example input sentence is “Saint David, patron saint of Wales is a human, Catholic priest, Catholic bishop, and has date of death, religion and canonization status”

subject predicate object
Q1 text_embedding “0.222, 0.333, ..”
Q2 text_embedding “0.444, 0.555, ..”

Run

You can call the functions directly with given args as

kgtk text-embedding \ 
    -input-file / -i <string> \ # * optional, path to the file
    --model / -m <list_of_string> \  # optional, default is `bert-base-wikipedia-sections-mean-tokens`
    --output-data-format <string> {w2v, kgtk}, default is `kgtk`
    --output-property <string> \ # optional, default is "text_embedding"
    --out-file/ -o <string> \ by default embeddings to console
    --sentence-property \ The name of the property with sentence for each Qnode. Default is 'sentence'
    --batch-size The number of sentences to be processed at a time. Default is 100000. Set this value to '-1' to process the whole file as one batch
Example 1:

For easiest running, just give the input file and let it write output to output_embeddings.csv at current folder using kgtk text-embedding -i input_file.csv -o output_embeddings.csv

Example 2:

Running with more specific parameters:

kgtk text-embedding --debug \ 
    --input-file test_edges_file.tsv \
    --model bert-base-wikipedia-sections-mean-tokens bert-base-nli-cls-token \
    --sentence-property sentence

--input-file / -i (input file)

The path to the input file. For example: input_file1.csv, it also support to send like < input_file1.csv

--model/ -m Embedding_Model(s)

The embedding models want to apply on the sentences. If multiple models given, they will be applied to the same data one by one and output the results with all models.

currently followly 16 models are pretrained and could be used. If not specified, the default model will be bert-base-wikipedia-sections-mean-tokens

bert-base-nli-cls-token
bert-base-nli-max-tokens
bert-base-nli-mean-tokens
bert-base-nli-stsb-mean-tokens
bert-base-wikipedia-sections-mean-tokens
bert-large-nli-cls-token
bert-large-nli-max-tokens
bert-large-nli-mean-tokens
bert-large-nli-stsb-mean-tokens
distilbert-base-nli-mean-tokens
distilbert-base-nli-stsb-mean-tokens
distiluse-base-multilingual-cased
roberta-base-nli-mean-tokens
roberta-base-nli-stsb-mean-tokens
roberta-large-nli-mean-tokens
roberta-large-nli-stsb-mean-tokens

--output-property

the property used to record the embedding. If not given, the program will use the edge(property) name as text_embedding. This option is only available when output format is set to kgtk.

--sentence-property

The name of the property with sentence for each Qnode. Default is 'sentence'

--batch-size

The number of sentences to be processed at a time. Default is 100000. Set this value to '-1' to process the whole file as one batch

Output files

There will be 3 part of files:

Logger file

If passed with global parameter --debug, an extra debugging logger file will be stored at user's home directory.

Embedding Vectors

This will have all the embedded vectors values for each Q nodes. This will be print on stddout and can be redirected to a file. Note: There will only texet embedding related things outputed, please run other commands

If output as kgtk, the output file will looks like:

Q1 text_embedding 0.2,0.3,0.4,0.5
Q2 text_embedding 0.3,0.4,-0.5,-0.6

The output will be a TSV file with 3 columns: First column is the node name. Second column is the property name as required, default is text_embedding. Third column is the embeded vecotrs.