graph-embeddings¶
Overview¶
Given a kgtk format file, this command will compute the the embeddings of this files' entities. We are using structure of nodes and their relations to compute embeddings of nodes. The set of metrics to compute are specified by the user.
Input format¶
The input is an kgtk format .tsv file where each line of these files contains information about nodes and relation. Each line is separated by tabs into columns which contains the node and relation data. For example:
id | node1 | relation | node2 | node1;label | node2;label | relation;label | relation;dimension | source | sentence |
---|---|---|---|---|---|---|---|---|---|
/c/en/000-/r/RelatedTo-/c/en/112-0000 | /c/en/000 | /r/RelatedTo | /c/en/112 | 000 | 112 | related to | CN | ... |
For further format details, please refer to the KGTK data specification.
Output format¶
There are three supported formats: glove, w2v, and kgtk.
glove format¶
When using this format, the output is a .tsv file where each line is the embedding for a node. Each line is represented by a single node followed respectively by the components of its embedding, each in a different column, all separated by tabs. For example:
"work" | -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ... |
w2v format (default)¶
When using this format, the output is a .tsv file which it is almost the same as glove format, the only difference is that the word2vec format has a first line which indicates the shape of the embedding (e.g., "9 4" for 9 entities with 4 dimensions), each column of first line is separated by tabs. Here we use w2v as our default output format. For example:
16213 | 100 |
"work" | -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ... |
"home" | -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ... |
Here 16231 represents the number of nodes, 100 represents the dimension number of each node embedding.
kgtk format¶
When using this format, the output is a .tsv file where each line is the embedding for a node. Each line has 3 columns, first column represents entity node, second node represent its embedding type (here is graph_embeddings
), third column represents the entity's embeddings. For example:
Q5 | graph_embeddings |
Q6 | graph_embeddings |
Algorithm¶
The algorithm is defined with the operator
(-op
) parameter. By default, it is ComplEx
. It could be switched to: TransE
, DistMult
, or RESCAL
. The operator
is case insensitive, for example, users can input the string like complex
to assign embedding method. For more details and pointers, see this documentation page.
Usage¶
You can call the functions directly with given args as
usage: kgtk graph-embeddings [-h] [-i INPUT_FILE_PATH] [-o OUTPUT_FILE_PATH]
[-l] [-T] [-ot] [-r True|False] [-d] [-s]
[-c dot|cos|l2|squared_l2]
[-op linear|diagonal|complex_diagonal|translation]
[-e] [-b True|False] [-w] [-bs]
[-lf ranking|logistic|softmax] [-lr] [-ef]
[-dr True|False] [-ge True|False]
[-v [optional True|False]]
[--column-separator COLUMN_SEPARATOR]
[--input-format INPUT_FORMAT]
[--compression-type COMPRESSION_TYPE]
[--error-limit ERROR_LIMIT]
[--use-mgzip [optional True|False]]
[--mgzip-threads MGZIP_THREADS]
[--gzip-in-parallel [optional True|False]]
[--gzip-queue-size GZIP_QUEUE_SIZE]
[--mode {NONE,EDGE,NODE,AUTO}]
[--force-column-names FORCE_COLUMN_NAMES [FORCE_COLUMN_NAMES ...]]
[--header-error-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--skip-header-record [optional True|False]]
[--unsafe-column-name-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--initial-skip-count INITIAL_SKIP_COUNT]
[--every-nth-record EVERY_NTH_RECORD]
[--record-limit RECORD_LIMIT]
[--tail-count TAIL_COUNT]
[--repair-and-validate-lines [optional True|False]]
[--repair-and-validate-values [optional True|False]]
[--blank-required-field-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--comment-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--empty-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--fill-short-lines [optional True|False]]
[--invalid-value-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--long-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--prohibited-list-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--short-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
[--truncate-long-lines [TRUNCATE_LONG_LINES]]
[--whitespace-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE_PATH, --input-file INPUT_FILE_PATH
The KGTK input file. (default=-)
-o OUTPUT_FILE_PATH, --output-file OUTPUT_FILE_PATH
The KGTK output file. (default=-).
-l , --log Setting the log path [Default: None]
-T , --temporary_directory
Sepecify the directory location to store temporary
file
-ot , --output_format
Outputformat for embeddings [Default: w2v] Choice: kgtk
| w2v | glove
-r True|False, --retain_temporary_data True|False
When opearte graph, some tempory files will be
generated, set True to retain these files
-d , --dimension Dimension of the real space the embedding live in
[Default: 100]
-s , --init_scale Generating the initial embedding with this standard
deviation [Default: 0.001]If no initial embeddings are
provided, they are generated by sampling each
dimensionfrom a centered normal distribution having
this standard deviation.
-c dot|cos|l2|squared_l2, --comparator dot|cos|l2|squared_l2
How the embeddings of the two sides of an edge (after
having already undergone some processing) are compared
to each other to produce a score[Default: dot],Choice:
dot|cos|l2|squared_l2
-op RESCAL|DistMult|ComplEx|TransE, --operator RESCAL|DistMult|ComplEx|TransE
The transformation to apply to the embedding of one of
the sides of the edge (typically the right-hand one)
before comparing it with the other one. It reflects
which model that embedding uses. [Default:ComplEx]
-e , --num_epochs The number of times the training loop iterates over
all the edges.[Default:100]
-b True|False, --bias True|False
Whether use the bias choice [Default: False],If
enabled, withhold the first dimension of the
embeddings from the comparator and instead use it as a
bias, adding back to the score. Makes sense for
logistic and softmax loss functions.
-w , --workers The number of worker processes for training. If not
given, set to CPU count.
-bs , --batch_size The number of edges per batch.[Default:1000]
-lf ranking|logistic|softmax, --loss_fn ranking|logistic|softmax
How the scores of positive edges and their
corresponding negatives are evaluated.[Default:
ranking], Choice: ranking|logistic|softmax
-lr , --learning_rate
The learning rate for the optimizer.[Default: 0.1]
-ef , --eval_fraction
The fraction of edges withheld from training and used
to track evaluation metrics during training.
[Defalut:0.0 training all edges ]
-dr True|False, --dynamic_relaitons True|False
Whether use dynamic relations (when graphs with a
large number of relations) [Default: True]
-ge True|False, --global_emb True|False
Whether use global embedding, if enabled, add to each
embedding a vector that is common to all the entities
of a certain type. This vector is learned during
training.[Default: False]
-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
File options:
Options affecting processing.
--column-separator COLUMN_SEPARATOR
Column separator (default=<TAB>).
--input-format INPUT_FORMAT
Specify the input format (default=None).
--compression-type COMPRESSION_TYPE
Specify the compression type (default=None).
--error-limit ERROR_LIMIT
The maximum number of errors to report before failing
(default=1000)
--use-mgzip [optional True|False]
Execute multithreaded gzip. (default=False).
--mgzip-threads MGZIP_THREADS
Multithreaded gzip thread count. (default=3).
--gzip-in-parallel [optional True|False]
Execute gzip in parallel. (default=False).
--gzip-queue-size GZIP_QUEUE_SIZE
Queue size for parallel gzip. (default=1000).
--mode {NONE,EDGE,NODE,AUTO}
Determine the KGTK file mode
(default=KgtkReaderMode.AUTO).
Header parsing:
Options affecting header parsing.
--force-column-names FORCE_COLUMN_NAMES [FORCE_COLUMN_NAMES ...]
Force the column names (default=None).
--header-error-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a header error is detected.
Only ERROR or EXIT are supported
(default=ValidationAction.EXIT).
--skip-header-record [optional True|False]
Skip the first record when forcing column names
(default=False).
--unsafe-column-name-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a column name is unsafe
(default=ValidationAction.REPORT).
Pre-validation sampling:
Options affecting pre-validation data line sampling.
--initial-skip-count INITIAL_SKIP_COUNT
The number of data records to skip initially
(default=do not skip).
--every-nth-record EVERY_NTH_RECORD
Pass every nth record (default=pass all records).
--record-limit RECORD_LIMIT
Limit the number of records read (default=no limit).
--tail-count TAIL_COUNT
Pass this number of records (default=no tail
processing).
Line parsing:
Options affecting data line parsing.
--repair-and-validate-lines [optional True|False]
Repair and validate lines (default=False).
--repair-and-validate-values [optional True|False]
Repair and validate values (default=False).
--blank-required-field-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a line with a blank node1,
node2, or id field (per mode) is detected
(default=ValidationAction.EXCLUDE).
--comment-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a comment line is detected
(default=ValidationAction.EXCLUDE).
--empty-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when an empty line is detected
(default=ValidationAction.EXCLUDE).
--fill-short-lines [optional True|False]
Fill missing trailing columns in short lines with
empty values (default=False).
--invalid-value-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a data cell value is invalid
(default=ValidationAction.COMPLAIN).
--long-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a long line is detected
(default=ValidationAction.COMPLAIN).
--prohibited-list-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a data cell contains a
prohibited list (default=ValidationAction.COMPLAIN).
--short-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a short line is detected
(default=ValidationAction.COMPLAIN).
--truncate-long-lines [TRUNCATE_LONG_LINES]
Remove excess trailing columns in long lines
(default=False).
--whitespace-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
The action to take when a whitespace line is detected
(default=ValidationAction.EXCLUDE).
Examples¶
Example 1¶
For easiest running, just give the input file and let it write its output to output_embeddings.csv
at current folder
kgtk graph-embeddings -i input_file.tsv -o output_file.tsv
The output_file.tsv may look like:
172131 | 100 |
"work" | -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ... |
"home" | -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ... |
Example 2¶
Running with more specific parameters (TransE algorithm and 200-dimensional vectors):
kgtk graph-embeddings
--input-file input_file.tsv \
--output-file output_file.tsv \
--dimension 200 \
--comparator dot \
--operator translation \
--loss_fn softmax \
--learning_rate 0.1
The output_file.tsv
may look like:
172131 | 100 |
"work" | -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ... |
"home" | -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ... |
Example 3¶
Using glove format to generate graph embeddings
kgtk graph-embeddings
--input-file input_file.tsv \
--output-file output_file.tsv \
--output_format glove
The output_file.tsv
may look like:
"work" | -0.014022544 -0.062030070 -0.012535412 -0.023111001 -0.038317516 ... |
"home" | -0.014021411 -0.090830070 -0.012534120 -0.073111301 -0.068317516 ... |
Example 4¶
Using kgtk format to generate graph embeddings
kgtk graph-embeddings
--input-file input_file.tsv \
--output-file output_file.tsv \
--output_format kgtk --no-output-headers
The output_file.tsv
may look like:
"work" | graph_embeddings | -0.014022544,-0.062030070,-0.012535412,-0.023111001,-0.038317516 ... |
"home" | graph_embeddings | -0.014021411,-0.090830070,-0.012534120,-0.073111301,-0.068317516 ... |