convert-embeddings-format
Overview¶
The convert-embeddings-command converts a KGTK edge file to word2vec format or Google Projector files format. Currently, only these two file formats are supported, more file formats will be added in a later revision of the command.
The vectors produced by the text-embeddings and the graph-embeddings
commands are stored as ,
(comma) separated string in node2
.
Word2vec Format¶
The word2vec is a text file, where the first line has the number of vectors in the file and the dimension of each vector.
There is one line per entity, with entity id and the vector components separated by space
.
Google Projector File(s) Format¶
Google Projector allows users to visualize high dimensional vectors. Google projector
requires two files, an embeddings
file and a metadata
file.
The embeddings
file contains vectors for each entity, with vector components separated by tabs
.
The metadata
file contains information about each entity like label, description, counts etc. The order of entities in the metadata
file should be same as the embeddings
file.
Usage¶
usage: kgtk convert-embeddings-format [-h] [-i INPUT_FILE]
[--node-file NODE_FILE] [-o OUTPUT_FILE]
[--metadata-file METADATA_FILE]
[--input-property INPUT_PROPERTY]
[--output-format OUTPUT_FORMAT]
[--metadata-columns METADATA_COLUMNS]
Converts KGTK edge embeddings file to word2vec or Google Projector format.Takes an optional node file for Google Project format to create a metadata file. Processes only top 10,000 rows from the edge file for Google Projector as it only accepts 10,000 rows.
Additional options are shown in expert help.
kgtk --expert convert-embeddings-format --help
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
KGTK input files (May be omitted or '-' for stdin.)
--node-file NODE_FILE
The KGTK node file for creating Google Projector
Metadata. All the columns in the node file will be
added to the metadata file by default. You can
customise this with the --metadata-columns option.
(Optional)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--metadata-file METADATA_FILE
The output metadata file for Google Projector. If
--output-format == gprojector and--metadata-file is
not specified, a file named
`kgtk_embeddings_gprojector_metadata.tsv` will be
created in USER_HOME (Optional)
--input-property INPUT_PROPERTY
The property name for embeddings in the input KGTK
edge file. (default=embeddings).
--output-format OUTPUT_FORMAT
The desired output file format: word2vec|gprojector
(default=word2vec)
--metadata-columns METADATA_COLUMNS
A comma separated string of columns names in the input
file or the --node-file to be used for creating the
metadata file for Google projector. Only to be used
when --output-format == 'gprojector'. If --node-file
is specified, the command will look for --metadata-
columns in the --node-file, otherwise input file.The
command will throw an error if the columns specified
are not in either of the files.
Examples¶
Convert KGTK to word2vec format¶
kgtk convert-embeddings-format -i examples/doc/convert_embeddings_edge.tsv --input-property graph_embeddings -o embeddings_word2vec.txt
Let's look at the output word2vec file ,
>>>head embeddings_word2vec.txt
19 30
Q494335 -0.162911773 0.071842454 -0.223435551 -0.289004564 0.834948838 ...
Q1278301 0.039679553 -0.115788229 -0.179974616 0.590080559 0.158913493 ...
Q611586 -0.015744781 -0.020170633 -0.313573331 0.515067458 0.039014913 ...
Q816369 0.488982528 -0.719077468 0.109514274 0.301486224 -0.402110636 ...
Q26833575 0.095825180 -0.207610607 -0.293776900 0.226735979 0.113529690 ...
Convert KGTK to gprojector format, use all columns in the node file for metadata¶
kgtk convert-embeddings-format \
-i examples/doc/convert_embeddings_edge.tsv \
--node-file examples/doc/convert_embeddings_node.tsv \
--output-format gprojector \
--input-property graph_embeddings \
--metadata-file gprojector_metadata.tsv \
-o embeddings_gprojector.tsv
Let's take a look at the metadata
and embeddings
file ,
>>> head gprojector_metadata.tsv
id | label | type | type_label |
---|---|---|---|
Q494335 | Tours University | Q3551775 | university in France |
Q1278301 | Robert Bouline | Q5 | human |
Q611586 | William Monahan | Q5 | human |
Q816369 | rated voltage | Q25428 | voltage |
>>> head embeddings_gprojector.tsv
-0.162911773 | 0.071842454 | -0.223435551 | -0.289004564 | 0.834948838 | -0.373376131 | 1.436196566 | -0.942946911 | ... |
---|---|---|---|---|---|---|---|---|
0.039679553 | -0.115788229 | -0.179974616 | 0.590080559 | 0.158913493 | 0.008464743 | 0.712676883 | 0.380636603 | ... |
-0.015744781 | -0.020170633 | -0.313573331 | 0.515067458 | 0.039014913 | -0.114478707 | 0.770638645 | 0.304640383 | ... |
0.488982528 | -0.719077468 | 0.109514274 | 0.301486224 | -0.402110636 | 0.291337997 | 0.829619348 | -0.365474463 | ... |
0.095825180 | -0.207610607 | -0.293776900 | 0.226735979 | 0.113529690 | 0.536592960 | 0.747583449 | -0.221452087 | ... |
Convert KGTK to gprojector format, use columns: label
and type_label
in the node file for metadata¶
kgtk convert-embeddings-format \
-i examples/doc/convert_embeddings_edge.tsv \
--node-file examples/doc/convert_embeddings_node.tsv \
--metadata-columns label,type_label \
--output-format gprojector \
--input-property graph_embeddings \
--metadata-file gprojector_metadata.tsv \
-o embeddings_gprojector.tsv
Let's take a look at the customized metadata file
>>> head gprojector_metadata.tsv
label | type_label |
---|---|
Tours University | university in France |
Robert Bouline | human |
William Monahan | human |
rated voltage | voltage |
France-Guernsey border | international border |
Convert KGTK to gprojector format, use columns: node1_label
and type
in the edge file for metadata¶
kgtk convert-embeddings-format \
-i examples/doc/convert_embeddings_edge.tsv \
--metadata-columns node1_label,type \
--output-format gprojector \
--input-property graph_embeddings \
--metadata-file gprojector_metadata.tsv \
-o embeddings_gprojector.tsv
Let's take a look at the customized metadata file
>>> head gprojector_metadata.tsv
node1_label | type |
---|---|
Tours University | university in France |
Robert Bouline | human |
William Monahan | human |
rated voltage | voltage |
France-Guernsey border | international border |