graph-embeddings¶
Overview¶
Given a kgtk format file, this command will compute the the embeddings of this files' entities. We are using structure of nodes and their relations to compute embeddings of nodes. The set of metrics to compute are specified by the user.
Input format¶
The input is an kgtk format .tsv file where each line of these files contains information about nodes and relation. Each line is separated by tabs into columns which contains the node and relation data. For example:
id | node1 | relation | node2 | node1;label | node2;label | relation;label | relation;dimension | source | sentence |
---|---|---|---|---|---|---|---|---|---|
/c/en/000-/r/RelatedTo-/c/en/112-0000 | /c/en/000 | /r/RelatedTo | /c/en/112 | 000 | 112 | related to | CN | ... |
For further format details, please refer to the KGTK data specification.
Output format¶
There are three supported formats: glove, w2v, and kgtk.
glove format¶
When using this format, the output is a .tsv file where each line is the embedding for a node. Each line is represented by a single node followed respectively by the components of its embedding, each in a different column, all separated by tabs. For example:
"work" | -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ... |
w2v format (default)¶
When using this format, the output is a .tsv file which it is almost the same as glove format, the only difference is that the word2vec format has a first line which indicates the shape of the embedding (e.g., "9 4" for 9 entities with 4 dimensions), each column of first line is separated by tabs. Here we use w2v as our default output format. For example:
16213 | 100 |
"work" | -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ... |
"home" | -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ... |
Here 16231 represents the number of nodes, 100 represents the dimension number of each node embedding.
kgtk format¶
When using this format, the output is a .tsv file where each line is the embedding for a node. Each line has 3 columns, first column represents entity node, second node represent its embedding type (here is graph_embeddings
), third column represents the entity's embeddings. For example:
Q5 | graph_embeddings |
Q6 | graph_embeddings |
Algorithm¶
The algorithm is defined with the operator
(-op
) parameter. By default, it is ComplEx
. It could be switched to: TransE
, DistMult
, or RESCAL
. The operator
is case insensitive, for example, users can input the string like complex
to assign embedding method. For more details and pointers, see this documentation page.
Usage¶
You can call the functions directly with given args as
usage: kgtk graph-embeddings [-h] [-i INPUT_FILE] [-o OUTPUT_FILE] [-l] [-T]
[-ot] [-r True|False] [-d] [-s]
[-c dot|cos|l2|squared_l2]
[-op RESCAL|DistMult|ComplEx|TransE] [-e]
[-b True|False] [-w] [-bs]
[-lf ranking|logistic|softmax] [-lr] [-ef]
[-dr True|False] [-ge True|False]
[--no-output-header [True|False]]
[-v [optional True|False]]
Generate graph embedding in kgtk tsv format, here we use PytorchBigGraph as low-level implementation
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
The KGTK input file. (May be omitted or '-' for
stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
-l , --log Setting the log path [Default: None]
-T , --temporary_directory
Sepecify the directory location to store temporary
file
-ot , --output_format
Outputformat for embeddings [Default: w2v] Choice:
kgtk | w2v | glove
-r True|False, --retain_temporary_data True|False
When opearte graph, some tempory files will be
generated, set True to retain these files
-d , --dimension Dimension of the real space the embedding live in
[Default: 100]
-s , --init_scale Generating the initial embedding with this standard
deviation [Default: 0.001]If no initial embeddings are
provided, they are generated by sampling each
dimensionfrom a centered normal distribution having
this standard deviation.
-c dot|cos|l2|squared_l2, --comparator dot|cos|l2|squared_l2
How the embeddings of the two sides of an edge (after
having already undergone some processing) are compared
to each other to produce a score[Default: dot],Choice:
dot|cos|l2|squared_l2
-op RESCAL|DistMult|ComplEx|TransE, --operator RESCAL|DistMult|ComplEx|TransE
The transformation to apply to the embedding of one of
the sides of the edge (typically the right-hand one)
before comparing it with the other one. It
reflectswhich model that embedding uses.
[Default:ComplEx]
-e , --num_epochs The number of times the training loop iterates over
all the edges.[Default:100]
-b True|False, --bias True|False
Whether use the bias choice [Default: False],If
enabled, withhold the first dimension of the
embeddings from the comparator and instead use it as a
bias, adding back to the score. Makes sense for
logistic and softmax loss functions.
-w , --workers The number of worker processes for training. If not
given, set to CPU count.
-bs , --batch_size The number of edges per batch.[Default:1000]
-lf ranking|logistic|softmax, --loss_fn ranking|logistic|softmax
How the scores of positive edges and their
corresponding negatives are evaluated.[Default:
ranking], Choice: ranking|logistic|softmax
-lr , --learning_rate
The learning rate for the optimizer.[Default: 0.1]
-ef , --eval_fraction
The fraction of edges withheld from training and used
to track evaluation metrics during training.
[Defalut:0.0 training all edges ]
-dr True|False, --dynamic_relations True|False
Whether use dynamic relations (when graphs with a
large number of relations) [Default: True]
-ge True|False, --global_emb True|False
Whether use global embedding, if enabled, add to each
embedding a vector that is common to all the entities
of a certain type. This vector is learned during
training.[Default: False]
--no-output-header [True|False]
When true, do not write a header to the output file
(default=False).
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
Examples¶
Example 1¶
For easiest running, just give the input file and let it write its output to output_embeddings.csv
at current folder
kgtk graph-embeddings -i input_file.tsv -o output_file.tsv
The output_file.tsv may look like:
172131 | 100 |
"work" | -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ... |
"home" | -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ... |
Example 2¶
Running with more specific parameters (TransE algorithm and 200-dimensional vectors):
kgtk graph-embeddings \
--input-file input_file.tsv \
--output-file output_file.tsv \
--dimension 200 \
--comparator dot \
--operator translation \
--loss_fn softmax \
--learning_rate 0.1
The output_file.tsv
may look like:
172131 | 100 |
"work" | -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ... |
"home" | -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ... |
Example 3¶
Using glove format to generate graph embeddings
kgtk graph-embeddings \
--input-file input_file.tsv \
--output-file output_file.tsv \
--output_format glove
The output_file.tsv
may look like:
"work" | -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ... |
"home" | -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ... |
Example 4¶
Using kgtk format to generate graph embeddings
kgtk graph-embeddings \
--input-file input_file.tsv \
--output-file output_file.tsv \
--output_format kgtk --no-output-headers
The output_file.tsv
may look like:
"work" | graph_embeddings | -0.014022544,-0.062030070,-0.012535412,-0.023111001,-0.038317516 ... |
"home" | graph_embeddings | -0.014021411,-0.090830070,-0.012534120,-0.073111301,-0.068317516 ... |