{"id":37026840,"url":"https://github.com/dwslab/jrdf2vec","last_synced_at":"2026-01-14T03:10:14.208Z","repository":{"id":38208500,"uuid":"246927552","full_name":"dwslab/jRDF2Vec","owner":"dwslab","description":"A high-performance Java Implementation of RDF2Vec","archived":false,"fork":false,"pushed_at":"2023-02-16T04:58:04.000Z","size":176737,"stargazers_count":42,"open_issues_count":22,"forks_count":6,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-07-13T01:34:09.615Z","etag":null,"topics":["cli-app","embeddings","generating-walks","knowledge-graph-embeddings","rdf2vec","semantic-web"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dwslab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-12T20:44:19.000Z","updated_at":"2024-12-18T13:51:56.000Z","dependencies_parsed_at":"2023-02-01T20:00:56.440Z","dependency_job_id":null,"html_url":"https://github.com/dwslab/jRDF2Vec","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/dwslab/jRDF2Vec","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dwslab%2FjRDF2Vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dwslab%2FjRDF2Vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dwslab%2FjRDF2Vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dwslab%2FjRDF2Vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dwslab","download_url":"https://codeload.github.com/dwslab/jRDF2Vec/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dwslab%2FjRDF2Vec/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28408814,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T01:52:23.358Z","status":"online","status_checked_at":"2026-01-14T02:00:06.678Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli-app","embeddings","generating-walks","knowledge-graph-embeddings","rdf2vec","semantic-web"],"created_at":"2026-01-14T03:10:13.708Z","updated_at":"2026-01-14T03:10:14.189Z","avatar_url":"https://github.com/dwslab.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# jRDF2Vec\n[![Java CI](https://github.com/dwslab/jRDF2Vec/workflows/Java%20CI/badge.svg)](https://github.com/dwslab/jRDF2Vec/actions)\n[![Python Tests](https://github.com/dwslab/jRDF2Vec/actions/workflows/python.yml/badge.svg)](https://github.com/dwslab/jRDF2Vec/actions/workflows/python.yml)\n[![Publish Docker image](https://github.com/dwslab/jRDF2Vec/actions/workflows/publish-docker.yml/badge.svg)](https://github.com/dwslab/jRDF2Vec/actions/workflows/publish-docker.yml)\n[![Coverage Status](https://coveralls.io/repos/github/dwslab/jRDF2Vec/badge.svg?branch=master)](https://coveralls.io/github/dwslab/jRDF2Vec?branch=master)\n[![License](https://img.shields.io/github/license/dwslab/jRDF2Vec)](https://github.com/dwslab/jRDF2Vec/blob/master/LICENSE)\n\n\njRDF2Vec is a Java implementation of \u003ca href=\"http://rdf2vec.org/\"\u003eRDF2Vec\u003c/a\u003e. \nIt supports multi-threaded, in-memory (or disk-access-based) walk generation and training.\nYou can generate embeddings for any `NT`, `NQ`, `OWL/XML`, [`RDF HDT`](http://www.rdfhdt.org/), \n[`TDB 1`](https://jena.apache.org/documentation/tdb/), or `TTL` file.\n\nFound a bug? Don't hesitate to \u003ca href=\"https://github.com/dwslab/jRDF2Vec/issues\"\u003eopen an issue\u003c/a\u003e.\n\n**How to cite?**\n```\nPortisch, Jan; Hladik, Michael; Paulheim, Heiko. RDF2Vec Light - A Lightweight Approach for Knowledge Graph Embeddings. Proceedings of the ISWC 2020 Posters \u0026 Demonstrations. 2020. [to appear]\n```\nAn open-access version of the paper is available [here](https://arxiv.org/pdf/2009.07659.pdf).\n\n## How to use the jRDF2Vec Command-Line Interface?\nDownload this project, execute `mvn clean install`.\nAlternatively, you can download the packaged JAR of the latest successful: commit \n\u003ca href=\"https://github.com/dwslab/jRDF2Vec/tree/jars/jars\"\u003ehere\u003c/a\u003e. \n\n### System Requirements\n- Java 8 or later.\n- Python 3.8 or later with the dependencies described in [requirements.txt](/src/main/resources/requirements.txt) installed.\u003cbr\u003e \n  (Conda users can directly use the [environment.yml](/src/main/resources/environment.yml) file.)\n\nYou can check if you set up the environment (Python 3 + dependencies) correctly by running:\n```bash\njava -jar jrdf2vec-1.1-SNAPSHOT.jar -checkInstallation\n```\nThe command line output will list missing requirements or print `Installation is ok ✔`.\n\n\n### Command-Line Interface (jRDF2Vec CLI) for Training and Walk Generation\nUse the resulting jar from the `target` directory.\n\n*Minimal Example*\n```bash\njava -jar jrdf2vec-1.1-SNAPSHOT.jar -graph ./kg_file.hdt\n```\n\n#### Required Parameters\n- `-graph \u003cgraph_file\u003e`\u003cbr/\u003e\nThe file containing the knowledge graph for which you want to generate embeddings. The `\u003cgraph_file\u003e` can be any triple file, HDT file, a directory which contains NT files, or a TDB1 directory.\n\n#### Optional Parameters\n*jRDF2Vec* follows the \u003ca href=\"https://en.wikipedia.org/wiki/Convention_over_configuration\"\u003econvention over \nconfiguration\u003c/a\u003e design paradigm to increase usability. You can overwrite the default values by setting one or more optional parameters.\n\n**Parameters for the Walk Configuration**\n- `-onlyWalks`\u003cbr\u003e\nIf added to the call, this switch will deactivate the training part so that only walks are generated. If training parameters are specified, they are ignored. The walk generation also works with the `-light` parameter.\n- `-light \u003centity_file\u003e`\u003cbr/\u003e\nIf you intend to use *RDF2VecLight*, you have to use this switch followed by the file path ot the describing the entities for which you require an embedding space. The file should contain one entity (full URI) per line.\n- `-numberOfWalks \u003cnumber\u003e` (default: `100`)\u003cbr/\u003e\nThe number of walks to be performed per entity.\n- `-depth \u003cdepth\u003e` (default: `4`)\u003cbr/\u003e\n  This parameter controls the depth of each walk. Depth is defined as the number of hops. Hence, you can also set an odd number. A depth of 1 leads to a sentence in the form `\u003cs p o\u003e`.\n- `-walkGenerationMode \u003cMID_WALKS | MID_WALKS_DUPLICATE_FREE | RANDOM_WALKS | RANDOM_WALKS_DUPLICATE_FREE\u003e` \n(default for light: `MID_WALKS`, default for classic: `RANDOM_WALKS_DUPLICATE_FREE`)\u003cbr/\u003e\nThis parameter determines the mode for the walk generation (multiple walk generation algorithms are available). \n- `-threads \u003cnumber_of_threads\u003e` (default: `(# of available processors) / 2`)\u003cbr/\u003e\nThis parameter allows you to set the number of threads that shall be used for the walk generation as well as for the training.\n- `-walkDirectory \u003cdirectory where walk files shall be generated/reside\u003e`\u003cbr/\u003e\nThe directory where the walks shall be generated into. In case of `-onlyTraining`, the directory where the walks reside.\n- `-embedText`\u003cbr\u003e\nIf added to the call, this switch will also generate walks that contain textual fragments of datatype properties.\n\n**Parameters for the Training Configuration**\n- `-onlyTraining`\u003cbr/\u003e\nIf added to the call, this switch will deactivate the walk generation part so that only the training is performed. The parameter `-walkDirectory` must be set. If walk generation parameters are specified, they are ignored.\n- `-trainingMode \u003ccbow | sg\u003e` (default: `sg`) \u003cbr/\u003e\nThis parameter controls the mode to be used for the word2vec training. Allowed values are `cbow` and `sg`.\n- `-dimension \u003csize_of_vector\u003e` (default: `200`)\u003cbr/\u003e\nThis parameter allows you to control the size of the resulting vectors (e.g. 100 for 100-dimensional vectors).\n- `-minCount \u003cnumber\u003e` (default: `1`)\u003cbr/\u003e\nThis parameter controls the minimum word count for the word2vec training. Unlike in the gensim defaults, this parameter is set to 1 by default because for knowledge graph embeddings, a vector for each node/arc is desired.\n- `-noVectorTextFileGeneration` | `-vectorTextFileGeneration`\u003cbr/\u003e\nA switch which indicates whether a text file with the vectors shall be persisted on the disk. This is enabled by default. Use `-noVectorTextFileGeneration` to disable the file generation.\n- `-sample \u003crate\u003e` (default: `0.0`)\u003cbr/\u003e\nThe threshold for configuring which higher-frequency words are randomly downsampled, a useful range is, according to the gensim framework, (0, 1e-5).\n- `-window \u003cwindow_size\u003e` (default: `5`)\u003cbr/\u003e\nThe size of the window in the training process.\n- `-epochs \u003cnumber_of_epochs\u003e` (default: `5`)\u003cbr/\u003e\nThe number of epochs to use in training.\n- `-port \u003cport_number\u003e` (default: `1808`)\u003cbr/\u003e\nThe port that shall be used for the server.\n\n**Advanced Parameters**\n- `-continue \u003cexisting_walk_directory\u003e`\u003cbr/\u003e\n  In some cases, old walks need to be re-used (e.g. if the program was interrupted after 48h). \n  With the `-continue` option, the walk generation can be continued; this means that old walks will be re-used and only\n  missing walks are generated. This does not work for MID_WALKS (and flavors). If you do not need to generate additional \n  walks use `-onlyTraining` instead.\n  \n\n### Command-Line Interface (jRDF2Vec CLI) - Additional Services\nBesides generating walks and training embeddings, the CLI offers additional services which are described below.\n\n#### Generating a Vector Text File\n\n*(1) Full Vocabulary*\u003cbr/\u003e\njRDF2vec is compatible with the \u003ca href=\"https://github.com/mariaangelapellegrino/Evaluation-Framework\"\u003eevaluation \nframework for KG embeddings (GEval)\u003c/a\u003e. \nThe latter framework requires the vectors to be present in a text file. If you have a gensim model or vector file, \nyou can use the following command to generate this file:\n\n```bash\njava -jar jrdf2vec-1.1-SNAPSHOT.jar -generateTextVectorFile ./path-to-your-model-or-vector-file\n```\nYou can find the file (named `vectors.txt`) in the directory where the model/vector file is located.\nIf you want to specify the file name/path yourself, you can use option `-newFile \u003cfile_path\u003e`.\n\n*(2) Subset of the  Vocabulary*\u003cbr/\u003e\nIf you want to write a `vectors.txt` file that contains only a subset of the vocabulary, you can additionally \nspecify the entities of interest using the `-light \u003centity_file\u003e` option (The `\u003centity_file\u003e` should contain one entity \n(full URI) per line.):\n\n```bash\njava -jar jrdf2vec-1.1-SNAPSHOT.jar -generateTextVectorFile ./path-to-your-model-or-vector-file -light ./path-to-entity-file\n```\nYou can find the file (named `vectors.txt`) in the directory where the model/vector file is located.\nIf you want to specify the file name/path yourself, you can use option `-newFile \u003cfile_path\u003e`.\nIf the vector concepts contain surrounding tags that you want to remove in the process, use option `-noTags`.\nThis command also works if `./path-to-your-model-or-vector-file` is an existing vector text file that shall be reduced.\n\n#### Generating a Vocabulary Text File\njRDF2vec provides functionality to print all concepts for which a vector has been trained.\nOne word of the vocabulary will be printed per line to a file named `vocabulary.txt`.\nThe model or vector file needs to be specified. If you have a gensim model or vector file, you can\nuse the following command to generate this file:\n\n```bash\njava -jar jrdf2vec-1.1-SNAPSHOT.jar -generateVocabularyFile ./path-to-your-model-or-vector-file\n```\n\n#### Converting a Text Vector File\njRDF2vec generates a `vectors.txt` file where one line represents a vector. This is the format also used by \n[GloVe](https://github.com/stanfordnlp/GloVe), for instance. \nIn some cases, however, other file formats are required. You can use jRDF2vec to convert text vector files to other\ncommon formats. The vector file does not have to be generated by jRDF2vec.\n\n*(1) Converting to w2v Format*\u003cbr/\u003e\nTo create a word2vec formatted file from the text file, you can use the following command:\n```bash\njava -jar jrdf2vec-1.1-SNAPSHOT.jar -convertToW2V \u003ctxt_file_path\u003e \u003cnew_file.w2v\u003e\n```\n\n*(2) Converting to kv Format*\u003cbr/\u003e\nThe provided txt file (first parameter) can be either in `txt` format or in `w2v` format. Make sure you use the \ncorrect file ending (`.txt`/`.w2v`).\n\nYou can run the command as follows:\n```bash\njava -jar jrdf2vec-1.1-SNAPSHOT.jar -convertToKv \u003ctxt_file_path\u003e \u003cnew_file.kv\u003e\n```\n\n*(3) Converting to Tensorflow Projector Format*\u003cbr/\u003e\nIf you want to visualize your embedding space by using the [Tensorflow Projector](http://projector.tensorflow.org/),\nyou can do so by converting your `vectors.txt` file to the two files required by the tool. Use the following command:\n```\njava -jar jrdf2vec-1.1-SNAPSHOT.jar -convertToTfProjector \u003ctxt_file_path\u003e [\u003cvectors.tsv\u003e \u003cmetadata.tsv\u003e]\n```\nTwo additional `.tsv` files will be generated. You can find them in the same directory where `\u003ctxt_file_path\u003e` is \nlocated.\n\nOptionally, you can specify the paths of the files to be written as indicated in the command above.\n\n#### Analyzing the Embedding Vocabulary\nFor RDF2Vec, it is not always guaranteed that all concepts in the graph appear in the embedding space. For example,\nsome concepts may only appear in the object position of statements and may never be reached by random walks.\nIn addition, the word2vec configuration parameters may filter out infrequent words depending on the configuration (see\n`-minCount` above, for example). To analyze such rather seldom cases, you can use the `-analyzeVocab` function specified\nas follows:\n\n```bash\njava -jar jrdf2vec-1.1-SNAPSHOT.jar -analyzeVocab \u003cmodel\u003e \u003ctraining_file|entity_file\u003e\n```\n- `\u003cmodel\u003e` refers to any model representation such as gensim model file, `.kv` file, or `.txt` file. Just make sure\n  you use the correct file endings.\n  \n- `\u003ctraining_file|entity_file\u003e` refers either to the NT/TTL etc. file that has been used to train the model *or* to a \n  text file containing the concepts you want to check (one concept per line in the text file, make sure the file ending is \n  `.txt`).\n  \n\nA report will be printed. For large models, you may want to redirect that into a file (`[...] \u0026\u003e somefile.txt)`.\n\n#### Merge of All Walk Files Into One\nBy default, jRDF2vec serializes walks in different gzipped files. If you require a single,\nuncompressed file, you can use the `-mergeWalks` keyword. You need to provide a\n`-walkDirectory \u003cdir\u003e` and you can optionally specify the output file using `-o \u003cfile_path\u003e`.\n(Files not ending with `.gz` in `\u003cdir\u003e` will be skipped.)\n\n```bash\njava -jar jrdf2vec-1.1-SNAPSHOT.jar -mergeWalks -walkDirectory \u003cdir\u003e -o \u003cfile_to_write\u003e\n```\n\n#### Converting the Graph File\n\n*(1) Converting to PajekNet*\u003cbr/\u003e\nTo create a graph file in the \u003ca href=\"https://gephi.org/users/supported-graph-formats/pajek-net-format/\"\u003ePajekNet format\u003c/a\u003e (e.g. for graph analysis), you can use the following command:\n```bash\njava -jar jrdf2vec-1.1-SNAPSHOT.jar -convertToPajek \u003cgraph\u003e \u003cfile_to_write\u003e\n```\n\n## How to use the jRDF2Vec as library in Java projects?\nStable releases are available through the maven central repository:\n```\n\u003cdependency\u003e\n    \u003cgroupId\u003ede.uni-mannheim.informatik.dws\u003c/groupId\u003e\n    \u003cartifactId\u003ejrdf2vec\u003c/artifactId\u003e\n    \u003cversion\u003e1.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n## Run jRDF2Vec using Docker\n\n[![Publish Docker image](https://github.com/dwslab/jRDF2Vec/actions/workflows/publish-docker.yml/badge.svg)](https://github.com/dwslab/jRDF2Vec/actions/workflows/publish-docker.yml) \n\nOptionally, Docker can be used to run jRDF2Vec. This functionality has been added by \u003ca href=\"https://github.com/vemonet\"\u003eVincent Emonet\u003c/a\u003e.\n\n### Run\n\nThe Docker image can be used with the same arguments as the Jar file, refer to the documentation above for more details on the different jRDF2Vec arguments.\n\nTest run to get the help message:\n\n```bash\ndocker run -it --rm ghcr.io/dwslab/jrdf2vec -help\n```\n\nThe best way to mount your local files in the docker container is to mount a folder on `/data` in the container:\n\n* On Linux and MacOS: use `$(pwd)` to mount the current working directory\n* On Windows:  use `${PWD}` to mount the current working directory (and make the command in one line)\n\nHere is an example generating embeddings using sample config files for DBpedia found in [`src/test/resources`](https://github.com/dwslab/jRDF2Vec/tree/master/src/test/resources) in this repository. Use this command from the root folder of this repository on Linux or MacOS, change the `$(pwd)` to `${PWD}` for Windows:\n\n```bash\ndocker run -it --rm \\\n  -v $(pwd):/data \\\n  ghcr.io/dwslab/jrdf2vec \\\n  -light /data/src/test/resources/sample_dbpedia_entity_file.txt \\\n  -graph /data/src/test/resources/sample_dbpedia_nt_file.nt\n```\n\n\u003e Embeddings will be generated in the folders `walks` and `python_server` from where you ran the command.\n\n### Build\n\nA new docker image is automatically built and published to the GitHub Container Registry by a [GitHub Actions workflow](https://github.com/dwslab/jRDF2Vec/actions/workflows/publish-docker.yml): \n\n* The `latest` image tag is updated everytime a commit is pushed to the `master` branch\n* A new image tag is created for every new release published following the scheme `v0.0.0`\n\nBuild from source code:\n\n```bash\ndocker build -t ghcr.io/dwslab/jrdf2vec .\n```\n\n## Developer Documentation\nThe most recent JavaDoc sites generated from the latest commit can be found \u003ca href=\"https://dwslab.github.io/jRDF2Vec/\"\u003ehere\u003c/a\u003e.\u003cbr/\u003e\n\n## Special Applications\n\n### Ordered RDF2Vec (\"Putting RDF2vec in Order\")\nThe following steps are necessary to obtain ordered RDF2vec embeddings (see publication [Putting RDF2vec in Order](https://arxiv.org/pdf/2108.05280.pdf) for conceptional details).\n\n**Step 1: Generate Walks**\u003cbr/\u003e\nRun jRDF2Vec to generate only walks (option [`-onlyWalks`](#optional-parameters)) on your desired dataset.\n\n**Step 2: Merge the Walks in a single, uncompressed file**\u003cbr/\u003e\nBy default, jRDF2Vec serializes the walks in multiple gzipped files. For this application, however, we need a single,\nuncompressed walk file.\n\nYou can use the [corresponding jRDF2Vec command line service](#merge-of-all-walk-files-into-one) to do so.\n\n**Step 3: Compile wang2vec**\u003cbr/\u003e\nDownload the C implementation of [wang2vec from GitHub](https://github.com/wlin12/wang2vec).\nCompile the files with `make`.\n\n**Step 4: Run and have fun**\u003cbr/\u003e\nRun the compiled wang2vec implementation on the merged walk file from step 2. In case you receive a `segfault` error,\nset the capping parameter to 1 (`-cap 1`).\n\n*Call Syntax*\u003cbr/\u003e\n```bash\n./word2vec -train \u003cyour walk file\u003e -output \u003cdesired file to be written\u003e -type \u003c2 (cwindow) or 3 (structured \nskipgram\u003e) -size \u003cvector size\u003e -threads \u003cnumber of threads\u003e -min-count 0 -cap 1  \n```\n\n*Exemplary Call*\u003cbr/\u003e\n```bash\n./word2vec -train walks.txt -output v100.txt -type 3 -size 100 -threads 4 -min-count 0 -cap 1  \n```\n\n**Not working? Contact us or open an issue.**\n\nPlease do not forget to cite the corresponding papers:\n\n```\n(1)  Portisch, Jan; Paulheim, Heiko. Putting RDF2vec in Order. In: Proceedings of the International Semantic Web \nConference - Posters and Demos, ISWC 2021. 2021. \n\n(2) Ling, Wang; Dyer, Chris; Black, Alan; Trancoso, Isabel. Two/too simple adaptations of word2vec for syntax \nproblems. In: NAACL HLT 2015. pp. 1299–1304. ACL (2015)\n```\n\n\n## Frequently Asked Questions (FAQs)\n**I have Python installed, but it is not accessible via command `python`. How to resolve this?**\u003cbr/\u003e\nCreate a file `python_command.txt` in directory `./python_server` (created when first running the jar). Write the command\nto call Python 3 in the first line of the file.\n\n**The program starts and immediately shuts down. Nothing seems to happen.**\u003cbr/\u003e\nMake sure your system is set-up correctly, in particular whether you have installed Python 3 and the required \ndependencies.\n\n**Can I run the command multiple times in parallel on the same machine?**\u003cbr/\u003e\nYes, you can. You need to make sure that for each command, you use (1) a different `-port` and (2) a different \n`-walkDirectory`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdwslab%2Fjrdf2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdwslab%2Fjrdf2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdwslab%2Fjrdf2vec/lists"}