Skip to content

graph-embeddings

Overview

Given a kgtk format file, this command will compute the the embeddings of this files' entities. We are using structure of nodes and their relations to compute embeddings of nodes. The set of metrics to compute are specified by the user.

Input format

The input is an kgtk format .tsv file where each line of these files contains information about nodes and relation. Each line is separated by tabs into columns which contains the node and relation data. For example:

id node1 relation node2 node1;label node2;label relation;label relation;dimension source sentence
/c/en/000-/r/RelatedTo-/c/en/112-0000 /c/en/000 /r/RelatedTo /c/en/112 000 112 related to CN ...

For further format details, please refer to the KGTK data specification.

Output format

There are three supported formats: glove, w2v, and kgtk.

glove format

When using this format, the output is a .tsv file where each line is the embedding for a node. Each line is represented by a single node followed respectively by the components of its embedding, each in a different column, all separated by tabs. For example:

"work" -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ...

w2v format (default)

When using this format, the output is a .tsv file which it is almost the same as glove format, the only difference is that the word2vec format has a first line which indicates the shape of the embedding (e.g., "9 4" for 9 entities with 4 dimensions), each column of first line is separated by tabs. Here we use w2v as our default output format. For example:

16213 100
"work" -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ...
"home" -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ...

Here 16231 represents the number of nodes, 100 represents the dimension number of each node embedding.

kgtk format

When using this format, the output is a .tsv file where each line is the embedding for a node. Each line has 3 columns, first column represents entity node, second node represent its embedding type (here is graph_embeddings), third column represents the entity's embeddings. For example:

Q5 graph_embeddings
Q6 graph_embeddings

Algorithm

The algorithm is defined with the operator (-op) parameter. By default, it is ComplEx. It could be switched to: TransE, DistMult, or RESCAL. The operator is case insensitive, for example, users can input the string like complex to assign embedding method. For more details and pointers, see this documentation page.

Usage

You can call the functions directly with given args as

usage: kgtk graph-embeddings [-h] [-i INPUT_FILE] [-o OUTPUT_FILE] [-l] [-T]
                             [-ot] [-r True|False] [-d] [-s]
                             [-c dot|cos|l2|squared_l2]
                             [-op RESCAL|DistMult|ComplEx|TransE] [-e]
                             [-b True|False] [-w] [-bs]
                             [-lf ranking|logistic|softmax] [-lr] [-ef]
                             [-dr True|False] [-ge True|False]
                             [--no-output-header [True|False]]
                             [-v [optional True|False]]

Generate graph embedding in kgtk tsv format, here we use PytorchBigGraph as         low-level implementation 

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input-file INPUT_FILE
                        The KGTK input file. (May be omitted or '-' for
                        stdin.)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        The KGTK output file. (May be omitted or '-' for
                        stdout.)
  -l , --log            Setting the log path [Default: None]
  -T , --temporary_directory 
                        Sepecify the directory location to store temporary
                        file
  -ot , --output_format 
                        Outputformat for embeddings [Default: w2v] Choice:
                        kgtk | w2v | glove
  -r True|False, --retain_temporary_data True|False
                        When opearte graph, some tempory files will be
                        generated, set True to retain these files
  -d , --dimension      Dimension of the real space the embedding live in
                        [Default: 100]
  -s , --init_scale     Generating the initial embedding with this standard
                        deviation [Default: 0.001]If no initial embeddings are
                        provided, they are generated by sampling each
                        dimensionfrom a centered normal distribution having
                        this standard deviation.
  -c dot|cos|l2|squared_l2, --comparator dot|cos|l2|squared_l2
                        How the embeddings of the two sides of an edge (after
                        having already undergone some processing) are compared
                        to each other to produce a score[Default: dot],Choice:
                        dot|cos|l2|squared_l2
  -op RESCAL|DistMult|ComplEx|TransE, --operator RESCAL|DistMult|ComplEx|TransE
                        The transformation to apply to the embedding of one of
                        the sides of the edge (typically the right-hand one)
                        before comparing it with the other one. It
                        reflectswhich model that embedding uses.
                        [Default:ComplEx]
  -e , --num_epochs     The number of times the training loop iterates over
                        all the edges.[Default:100]
  -b True|False, --bias True|False
                        Whether use the bias choice [Default: False],If
                        enabled, withhold the first dimension of the
                        embeddings from the comparator and instead use it as a
                        bias, adding back to the score. Makes sense for
                        logistic and softmax loss functions.
  -w , --workers        The number of worker processes for training. If not
                        given, set to CPU count.
  -bs , --batch_size    The number of edges per batch.[Default:1000]
  -lf ranking|logistic|softmax, --loss_fn ranking|logistic|softmax
                        How the scores of positive edges and their
                        corresponding negatives are evaluated.[Default:
                        ranking], Choice: ranking|logistic|softmax
  -lr , --learning_rate 
                        The learning rate for the optimizer.[Default: 0.1]
  -ef , --eval_fraction 
                        The fraction of edges withheld from training and used
                        to track evaluation metrics during training.
                        [Defalut:0.0 training all edges ]
  -dr True|False, --dynamic_relations True|False
                        Whether use dynamic relations (when graphs with a
                        large number of relations) [Default: True]
  -ge True|False, --global_emb True|False
                        Whether use global embedding, if enabled, add to each
                        embedding a vector that is common to all the entities
                        of a certain type. This vector is learned during
                        training.[Default: False]
  --no-output-header [True|False]
                        When true, do not write a header to the output file
                        (default=False).

  -v [optional True|False], --verbose [optional True|False]
                        Print additional progress messages (default=False).

Examples

Example 1

For easiest running, just give the input file and let it write its output to output_embeddings.csv at current folder

kgtk graph-embeddings -i input_file.tsv  -o output_file.tsv

The output_file.tsv may look like:

172131 100
"work" -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ...
"home" -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ...

Example 2

Running with more specific parameters (TransE algorithm and 200-dimensional vectors):

kgtk graph-embeddings \
    --input-file input_file.tsv \
    --output-file output_file.tsv \
    --dimension 200 \
    --comparator dot \
    --operator translation \
    --loss_fn softmax \
    --learning_rate 0.1

The output_file.tsv may look like:

172131 100
"work" -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ...
"home" -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ...

Example 3

Using glove format to generate graph embeddings

kgtk graph-embeddings \
    --input-file input_file.tsv \
    --output-file output_file.tsv \
    --output_format glove

The output_file.tsv may look like:

"work" -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ...
"home" -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ...

Example 4

Using kgtk format to generate graph embeddings

kgtk graph-embeddings \
    --input-file input_file.tsv \
    --output-file output_file.tsv \
    --output_format kgtk --no-output-headers

The output_file.tsv may look like:

"work" graph_embeddings -0.014022544,-0.062030070,-0.012535412,-0.023111001,-0.038317516 ...
"home" graph_embeddings -0.014021411,-0.090830070,-0.012534120,-0.073111301,-0.068317516 ...