{"id":13611000,"url":"https://github.com/predict-idlab/pyRDF2Vec","last_synced_at":"2025-04-13T01:34:01.305Z","repository":{"id":38365655,"uuid":"191751197","full_name":"predict-idlab/pyRDF2Vec","owner":"predict-idlab","description":"🐍 Python Implementation and Extension of RDF2Vec","archived":false,"fork":false,"pushed_at":"2024-11-01T21:09:00.000Z","size":7065,"stargazers_count":246,"open_issues_count":27,"forks_count":51,"subscribers_count":15,"default_branch":"main","last_synced_at":"2024-11-03T02:20:32.422Z","etag":null,"topics":["cbow","embeddings","feature-matrix","knowledge-graph","machine-learning","rdf2vec","rdflib","skip-gram","unsupervised-learning","walking-strategy","word2vec"],"latest_commit_sha":null,"homepage":"https://pyrdf2vec.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/predict-idlab.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.rst","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-13T11:36:12.000Z","updated_at":"2024-10-14T20:35:15.000Z","dependencies_parsed_at":"2023-02-17T19:16:08.241Z","dependency_job_id":"4bc9531e-ab32-43e1-9861-e01842cee090","html_url":"https://github.com/predict-idlab/pyRDF2Vec","commit_stats":{"total_commits":1233,"total_committers":9,"mean_commits":137.0,"dds":0.2392538523925385,"last_synced_commit":"940ef534cd44698dfb625a0f55a47b781a8dacae"},"previous_names":["predict-idlab/pyrdf2vec","ibcnservices/pyrdf2vec"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predict-idlab%2FpyRDF2Vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predict-idlab%2FpyRDF2Vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predict-idlab%2FpyRDF2Vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predict-idlab%2FpyRDF2Vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/predict-idlab","download_url":"https://codeload.github.com/predict-idlab/pyRDF2Vec/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223558463,"owners_count":17165134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cbow","embeddings","feature-matrix","knowledge-graph","machine-learning","rdf2vec","rdflib","skip-gram","unsupervised-learning","walking-strategy","word2vec"],"created_at":"2024-08-01T19:01:50.734Z","updated_at":"2024-11-07T17:31:06.158Z","avatar_url":"https://github.com/predict-idlab.png","language":"Python","funding_links":[],"categories":["图数据处理"],"sub_categories":[],"readme":"\n.. raw:: html\n\n   \u003cp align=\"center\"\u003e\n       \u003cimg width=\"100%\" src=\"assets/embeddings.svg\"\u003e\n   \u003c/p\u003e\n   \u003cp align=\"center\"\u003e\n       \u003ca href=\"https://www.ugent.be/ea/idlab/en\"\u003e\n           \u003cimg src=\"assets/imec-idlab.svg\" alt=\"Logo\" width=350\u003e\n       \u003c/a\u003e\n   \u003c/p\u003e\n   \u003cp align=\"center\"\u003e\n       \u003ca href=\"https://pypi.org/project/pyrdf2vec/\"\u003e\n           \u003cimg src=\"https://img.shields.io/pypi/pyversions/pyrdf2vec.svg\" alt=\"Python Versions\"\u003e\n       \u003c/a\u003e\n       \u003ca href=\"https://pypi.org/project/pyrdf2vec\"\u003e\n           \u003cimg src=\"https://img.shields.io/pypi/v/pyrdf2vec?logo=pypi\u0026color=1082C2\" alt=\"Downloads\"\u003e\n       \u003c/a\u003e\n       \u003ca href=\"https://pypi.org/project/pyrdf2vec\"\u003e\n           \u003cimg src=\"https://img.shields.io/pypi/dm/pyrdf2vec.svg?logo=pypi\u0026color=1082C2\" alt=\"Version\"\u003e\n       \u003c/a\u003e\n       \u003ca href=\"https://github.com/IBCNServices/pyRDF2Vec/blob/main/LICENSE\"\u003e\n           \u003cimg src=\"https://img.shields.io/github/license/IBCNServices/pyRDF2vec\" alt=\"License\"\u003e\n       \u003c/a\u003e\n   \u003c/p\u003e\n   \u003cp align=\"center\"\u003e\n       \u003ca href=\"https://github.com/IBCNServices/pyRDF2Vec/actions\"\u003e\n           \u003cimg src=\"https://github.com/IBCNServices/pyRDF2Vec/workflows/CI/badge.svg\" alt=\"Actions Status\"\u003e\n       \u003c/a\u003e\n        \u003ca href=\"https://pyrdf2vec.readthedocs.io/en/latest/?badge=latest\"\u003e\n           \u003cimg src=\"https://readthedocs.org/projects/pyrdf2vec/badge/?version=latest\" alt=\"Documentation Status\"\u003e\n       \u003c/a\u003e\n        \u003ca href=\"https://codecov.io/gh/IBCNServices/pyRDF2Vec?branch=main\"\u003e\n           \u003cimg src=\"https://codecov.io/gh/IBCNServices/pyRDF2Vec/coverage.svg?branch=main\u0026precision=2\" alt=\"Coverage Status\"\u003e\n       \u003c/a\u003e\n       \u003ca href=\"https://github.com/psf/black\"\u003e\n           \u003cimg src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" alt=\"Code style: black\"\u003e\n       \u003c/a\u003e\n   \u003c/p\u003e\n   \u003cp align=\"center\"\u003ePython implementation and extension of \u003ca\n   href=\"http://rdf2vec.org/\"\u003eRDF2Vec\u003c/a\u003e \u003cb\u003eto create a 2D feature matrix from\n   a Knowledge Graph\u003c/b\u003e for downstream ML tasks.\u003cp\u003e\n\n--------------\n\n.. raw:: html\n\n   \u003cp align=\"center\"\u003e\n     \u003cimg width=\"100%\" src=\"./assets/header.svg\" alt=\"text\"\u003e\n   \u003c/p\u003e\n\n.. rdf2vec-begin\n\nWhat is RDF2Vec?\n----------------\n\nRDF2Vec is an unsupervised technique that builds further on\n`Word2Vec \u003chttps://en.wikipedia.org/wiki/Word2vec\u003e`__, where an\nembedding is learned per word, in two ways:\n\n1. **the word based on its context**: Continuous Bag-of-Words (CBOW);\n2. **the context based on a word**: Skip-Gram (SG).\n\nTo create this embedding, RDF2Vec first creates \"sentences\" which can be\nfed to Word2Vec by extracting walks of a certain depth from a Knowledge\nGraph.\n\nThis repository contains an implementation of the algorithm in \"RDF2Vec:\nRDF Graph Embeddings and Their Applications\" by Petar Ristoski, Jessica\nRosati, Tommaso Di Noia, Renato De Leone, Heiko Paulheim\n(`[paper] \u003chttp://semantic-web-journal.net/content/rdf2vec-rdf-graph-embeddings-and-their-applications-0\u003e`__\n`[original\ncode] \u003chttp://data.dws.informatik.uni-mannheim.de/rdf2vec/\u003e`__).\n\n**Recently,** `a book about RDF2Vec \u003chttp://rdf2vec.org/book\u003e`__ **was published by Heiko Paulheim, Jan Portisch, and Petar Ristoski. The book is a great introduction to what RDF2Vec is, and what can be done with it. The examples in the book use pyRDF2Vec, so it is recommended to have a look at it!**\n\n.. rdf2vec-end\n.. getting-started-begin\n\nGetting Started\n---------------\n\nFor most uses-cases, here is how ``pyRDF2Vec`` should be used to generate\nembeddings and get literals from a given Knowledge Graph (KG) and entities:\n\n.. code:: python\n\n   import pandas as pd\n\n   from pyrdf2vec import RDF2VecTransformer\n   from pyrdf2vec.embedders import Word2Vec\n   from pyrdf2vec.graphs import KG\n   from pyrdf2vec.walkers import RandomWalker\n\n   # Read a CSV file containing the entities we want to classify.\n   data = pd.read_csv(\"samples/countries-cities/entities.tsv\", sep=\"\\t\")\n   entities = [entity for entity in data[\"location\"]]\n   print(entities)\n   # [\n   #    \"http://dbpedia.org/resource/Belgium\",\n   #    \"http://dbpedia.org/resource/France\",\n   #    \"http://dbpedia.org/resource/Germany\",\n   # ]\n\n   # Define our knowledge graph (here: DBPedia SPARQL endpoint).\n   knowledge_graph = KG(\n       \"https://dbpedia.org/sparql\",\n       skip_predicates={\"www.w3.org/1999/02/22-rdf-syntax-ns#type\"},\n       literals=[\n           [\n               \"http://dbpedia.org/ontology/wikiPageWikiLink\",\n               \"http://www.w3.org/2004/02/skos/core#prefLabel\",\n           ],\n           [\"http://dbpedia.org/ontology/humanDevelopmentIndex\"],\n       ],\n   )\n   # Create our transformer, setting the embedding \u0026 walking strategy.\n   transformer = RDF2VecTransformer(\n       Word2Vec(epochs=10),\n       walkers=[RandomWalker(4, 10, with_reverse=False, n_jobs=2)],\n       # verbose=1\n   )\n   # Get our embeddings.\n   embeddings, literals = transformer.fit_transform(knowledge_graph, entities)\n   print(embeddings)\n   # [\n   #     array([ 1.5737595e-04,  1.1333118e-03, -2.9838676e-04,  ..., -5.3064007e-04,\n   #             4.3192197e-04,  1.4529384e-03], dtype=float32),\n   #     array([-5.9027621e-04,  6.1689125e-04, -1.1987977e-03,  ...,  1.1066757e-03,\n   #            -1.0603866e-05,  6.6087965e-04], dtype=float32),\n   #     array([ 7.9996325e-04,  7.2907173e-04, -1.9482171e-04,  ...,  5.6251377e-04,\n   #             4.1435464e-04,  1.4478950e-04], dtype=float32)\n   # ]\n\n   print(literals)\n   # [\n   #     [('1830 establishments in Belgium', 'States and territories established in 1830',\n   #       'Western European countries', ..., 'Member states of the Organisation\n   #       internationale de la Francophonie', 'Member states of the Union for the\n   #       Mediterranean', 'Member states of the United Nations'), 0.919],\n   #     [('Group of Eight nations', 'Southwestern European countries', '1792\n   #       establishments in Europe', ..., 'Member states of the Union for the\n   #       Mediterranean', 'Member states of the United Nations', 'Transcontinental\n   #       countries'), 0.891]\n   #     [('Germany', 'Group of Eight nations', 'Articles containing video clips', ...,\n   #       'Member states of the European Union', 'Member states of the Union for the\n   #       Mediterranean', 'Member states of the United Nations'), 0.939]\n   #  ]\n\nIf you are using a dataset other than MUTAG (where the interested entities have\nno parents in the KG), it is **highly recommended** to specify\n``with_reverse=True`` (defaults to ``False``) in the walking strategy (e.g.,\n``RandomWalker``). Such a parameter **allows Word2Vec** to have a better\nlearning window for an entity based on its parents and children and thus\n**predict test data with better accuracy**.\n\nIn a more concrete way, we provide a blog post with a tutorial on how to use\n``pyRDF2Vec`` `here\n\u003chttps://towardsdatascience.com/how-to-create-representations-of-entities-in-a-knowledge-graph-using-pyrdf2vec-82e44dad1a0\u003e`__.\n\n**NOTE:** this blog uses an older version of ``pyRDF2Vec``, some commands need\nbe to adapted.\n\nIf you run the above snippet, you will not necessarily have the same\nembeddings, because there is no conservation of the random determinism, however\nit remains possible to do it (**SEE:** `FAQ \u003c#faq\u003e`__).\n\nInstallation\n~~~~~~~~~~~~\n\n``pyRDF2Vec`` can be installed in three ways:\n\n1. from `PyPI \u003chttps://pypi.org/project/pyrdf2vec\u003e`__ using ``pip``:\n\n.. code:: bash\n\n   pip install pyRDF2vec\n\n2. from any compatible Python dependency manager (e.g., ``poetry``):\n\n.. code:: bash\n\n   poetry add pyRDF2vec\n\n3. from source:\n\n.. code:: bash\n\n   git clone https://github.com/IBCNServices/pyRDF2Vec.git\n   pip install .\n\n\nIntroduction\n~~~~~~~~~~~~\n\nTo create embeddings for a list of entities, there are two steps to do\nbeforehand:\n\n1. **use a KG**;\n2. **define a walking strategy**.\n\nFor more elaborate examples, check the `examples\n\u003chttps://github.com/IBCNServices/pyRDF2Vec/blob/main/examples\u003e`__ folder.\n\nIf no sampling strategy is defined, ``UniformSampler`` is used. Similarly for\nthe embedding techniques, ``Word2Vec`` is used by default.\n\nUse a Knowledge Graph\n~~~~~~~~~~~~~~~~~~~~~\n\nTo use a KG, you can initialize it in three ways:\n\n1. **From a endpoint server using SPARQL**:\n\n.. code:: python\n\n   from pyrdf2vec.graphs import KG\n\n   # Defined the DBpedia endpoint server, as well as a set of predicates to\n   # exclude from this KG and a list of predicate chains to fetch the literals.\n   KG(\n       \"https://dbpedia.org/sparql\",\n       skip_predicates={\"www.w3.org/1999/02/22-rdf-syntax-ns#type\"},\n       literals=[\n           [\n               \"http://dbpedia.org/ontology/wikiPageWikiLink\",\n               \"http://www.w3.org/2004/02/skos/core#prefLabel\",\n           ],\n           [\"http://dbpedia.org/ontology/humanDevelopmentIndex\"],\n        ],\n    ),\n\n2. **From a file using RDFLib**:\n\n.. code:: python\n\n   from pyrdf2vec.graphs import KG\n\n   # Defined the MUTAG KG, as well as a set of predicates to exclude from\n   # this KG and a list of predicate chains to get the literals.\n   KG(\n       \"samples/mutag/mutag.owl\",\n       skip_predicates={\"http://dl-learner.org/carcinogenesis#isMutagenic\"},\n       literals=[\n           [\n               \"http://dl-learner.org/carcinogenesis#hasBond\",\n               \"http://dl-learner.org/carcinogenesis#inBond\",\n           ],\n           [\n               \"http://dl-learner.org/carcinogenesis#hasAtom\",\n               \"http://dl-learner.org/carcinogenesis#charge\",\n           ],\n       ],\n   ),\n\n3. **From scratch**:\n\n.. code:: python\n\n   from pyrdf2vec.graphs import KG, Vertex\n\n    GRAPH = [\n        [\"Alice\", \"knows\", \"Bob\"],\n        [\"Alice\", \"knows\", \"Dean\"],\n        [\"Dean\", \"loves\", \"Alice\"],\n    ]\n    URL = \"http://pyRDF2Vec\"\n    CUSTOM_KG = KG()\n\n    for row in GRAPH:\n        subj = Vertex(f\"{URL}#{row[0]}\")\n        obj = Vertex((f\"{URL}#{row[2]}\"))\n        pred = Vertex((f\"{URL}#{row[1]}\"), predicate=True, vprev=subj, vnext=obj)\n        CUSTOM_KG.add_walk(subj, pred, obj)\n\nDefine Walking Strategies With Their Sampling Strategy\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nAll supported walking strategies can be found on the\n`Wiki\npage \u003chttps://github.com/IBCNServices/pyRDF2Vec/wiki/Walking-Strategies\u003e`__.\n\nAs the number of walks grows exponentially in function of the depth,\nexhaustively extracting all walks quickly becomes infeasible for larger\nKnowledge Graphs. In order to avoid this issue, `sampling strategies\n\u003chttp://www.heikopaulheim.com/docs/wims2017.pdf\u003e`__ can be applied. These will\nextract a fixed maximum number of walks per entity and sampling the walks\naccording to a certain metric.\n\nFor example, if one wants to extract a maximum of 10 walks of a maximum depth\nof 4 for each entity using the random walking strategy and Page Rank sampling\nstrategy, the following code snippet can be used:\n\n.. code:: python\n\n   from pyrdf2vec.samplers import PageRankSampler\n   from pyrdf2vec.walkers import RandomWalker\n\n   walkers = [RandomWalker(4, 10, PageRankSampler())]\n\n.. getting-started-end\n\nSpeed up the Extraction of Walks\n--------------------------------\n\nThe extraction of walks can take hours, days if not more in some cases. That's\nwhy it is important to use certain attributes and optimize ``pyRDF2Vec``\nparameters as much as possible according to your use cases.\n\nThis section aims to help you to set up these parameters with some advice.\n\nConfigure the ``n_jobs`` attribute to use multiple processors\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nBy default multiprocessing is disabled (``n_jobs=1``). If your machine allows\nit, it is recommended to use multiprocessing by incrementing the number of\nprocessors used for the extraction of walks:\n\n.. code:: python\n\n   from pyrdf2vec.walkers import RandomWalker\n\n   RDF2VecTransformer(walkers=[RandomWalker(4, 10, n_jobs=4)])\n\nIn the above snippet, the random walking strategy will use 4 processors to\nextract the walks, whether for a local or remote KG.\n\n**WARNING: using a large number of processors may violate the policy of some\nSPARQL endpoint servers**. This being that using multiprocessing means that\neach processor will send a SPARQL request to one server to fetch the hops of\nthe entity it is processing. Therefore, since these requests may take place in\na short time, this server could consider them as a Denial-Of-Service attack\n(DOS). Of course, these risks are multiplied in the absence of cache and when\nthe entities to be treated are of a consequent number.\n\nBundle SPARQL requests\n~~~~~~~~~~~~~~~~~~~~~~\n\nBy default the SPARQL requests bundling is disabled\n(``mul_req=False``). However, if you are using a remote KG and have a large\nnumber of entities, this option can greatly speed up the extraction of walks:\n\n.. code:: python\n\n   import pandas as pd\n\n   from pyrdf2vec import RDF2VecTransformer\n   from pyrdf2vec.graphs import KG\n   from pyrdf2vec.walkers import RandomWalker\n\n   data = pd.read_csv(\"samples/countries-cities/entities.tsv\", sep=\"\\t\")\n\n   RDF2VecTransformer(walkers=[RandomWalker(4, 10)]).fit_transform(\n       KG(\"https://dbpedia.org/sparql\", mul_req=True),\n       [entity for entity in data[\"location\"]],\n   )\n\nIn the above snippet, the KG specifies to the internal connector that it uses,\nto fetch the hops of the specified entities in an asynchronous way. These hops\nwill then be stored in cache and be accessed by the walking strategy to\naccelerate the extraction of walks for these entities.\n\n**WARNING: bundling SPARQL requests for a number of entities considered too\nlarge can may violate the policy of some SPARQL endpoint servers**. As for the\nuse of multiprocessing (which can be combined with ``mul_req``), sending a\nlarge number of SPARQL requests simultaneously could be seen by a server as a\nDOS. Be aware that the number of entities you have in your file corresponds to\nthe number of simultaneous requests that will be made and stored in cache.\n\nModify the Cache Settings\n~~~~~~~~~~~~~~~~~~~~~~~~~\n\nBy default, ``pyRDF2Vec`` uses a cache that provides a `Least Recently Used\n(LRU) \u003chttps://www.interviewcake.com/concept/java/lru-cache\u003e`__ policy, with a\nsize that can hold 1024 entries, and a Time To Live (TTL) of 1200 seconds. For\nsome use cases, you would probably want to modify the `cache policy\n\u003chttps://cachetools.readthedocs.io/en/stable/\u003e`__, increase (or decrease) the\ncache size and/or change the TTL:\n\n.. code:: python\n\n   import pandas as pd\n   from cachetools import MRUCache\n\n   from pyrdf2vec import RDF2VecTransformer\n   from pyrdf2vec.graphs import KG\n   from pyrdf2vec.walkers import RandomWalker\n\n   data = pd.read_csv(\"samples/countries-cities/entities.tsv\", sep=\"\\t\")\n\n   RDF2VecTransformer(walkers=[RandomWalker(4, 10)]).fit_transform(\n       KG(\"https://dbpedia.org/sparql\", cache=MRUCache(maxsize=2048),\n       [entity for entity in data[\"location\"]],\n   )\n\nModify the Walking Strategy Settings\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nBy default, ``pyRDF2Vec`` uses ``[RandomWalker(2, None, UniformSampler())]`` as\nwalking strategy. Using a greater maximum depth indicates a longer extraction\ntime for walks. Add to this that using ``max_walks=None``, extracts more walks\nand is faster in most cases than when giving a number (**SEE:** `FAQ \u003c#faq\u003e`__).\n\nIn some cases, using another sampling strategy can speed up the extraction of\nwalks by assigning a higher weight to some paths than others:\n\n.. code:: python\n\n   import pandas as pd\n\n   from pyrdf2vec import RDF2VecTransformer\n   from pyrdf2vec.graphs import KG\n   from pyrdf2vec.samplers import PageRankSampler\n   from pyrdf2vec.walkers import RandomWalker\n\n   data = pd.read_csv(\"samples/countries-cities/entities.tsv\", sep=\"\\t\")\n\n   RDF2VecTransformer(\n       walkers=[RandomWalker(2, None, PageRankSampler())]\n   ).fit_transform(\n       KG(\"https://dbpedia.org/sparql\"),\n       [entity for entity in data[\"location\"]],\n   )\n\nSet Up a Local Server\n~~~~~~~~~~~~~~~~~~~~~\n\nLoading large RDF files into memory will cause memory issues. Remote KGs serve\nas a solution for larger KGs, but **using a public endpoint will be slower**\ndue to overhead caused by HTTP requests. For that reason, it is better to set\nup your own local server and use that for your \"Remote\" KG.\n\nTo set up such a server, a tutorial has been made `on our wiki\n\u003chttps://github.com/IBCNServices/pyRDF2Vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint\u003e`__.\n\nDocumentation\n-------------\n\nFor more information on how to use ``pyRDF2Vec``, `visit our online documentation\n\u003chttps://pyrdf2vec.readthedocs.io/en/latest/\u003e`__ which is automatically updated\nwith the latest version of the ``main`` branch.\n\nFrom then on, you will be able to learn more about the use of the\nmodules as well as their functions available to you.\n\nContributions\n-------------\n\nYour help in the development of ``pyRDF2Vec`` is more than welcome.\n\n.. raw:: html\n\n   \u003cp align=\"center\"\u003e\n     \u003cimg width=\"85%\" src=\"./assets/architecture.png\" alt=\"architecture\"\u003e\n   \u003c/p\u003e\n\nThe architecture of ``pyRDF2Vec`` makes it easy to create new extraction and\nsampling strategies, new embedding techniques. In order to better understand\nhow you can help either through pull requests and/or issues, please take a look\nat the `CONTRIBUTING\n\u003chttps://github.com/IBCNServices/pyRDF2Vec/blob/main/CONTRIBUTING.rst\u003e`__\nfile.\n\nFAQ\n---\nHow to Ensure the Generation of Similar Embeddings?\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n``pyRDF2Vec``'s walking strategies, sampling strategies and Word2Vec work with\nrandomness. To get reproducible embeddings, you firstly need to **use a seed** to\nensure determinism:\n\n.. code:: bash\n\n   PYTHONHASHSEED=42 python foo.py\n\nAdded to this, you must **also specify a random state** to the walking strategy\nwhich will implicitly use it for the sampling strategy:\n\n.. code:: python\n\n   from pyrdf2vec.walkers import RandomWalker\n\n   RandomWalker(2, None, random_state=42)\n\n**NOTE:** the ``PYTHONHASHSEED`` (e.g., 42) is to ensure determinism.\n\nFinally, to ensure random determinism for Word2Vec, you must **specify a single\nworker**:\n\n.. code:: python\n\n   from pyrdf2vec.embedders import Word2Vec\n\n   Word2Vec(workers=1)\n\n**NOTE:** using the ``n_jobs`` and ``mul_req`` parameters does not affect the\nrandom determinism.\n\nWhy the Extraction Time of Walks is Faster if ``max_walks=None``?\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nCurrently, **the BFS function** (using the Breadth-first search algorithm) is used\nwhen ``max_walks=None`` which is significantly **faster** than the DFS function\n(using the Depth-first search algorithm) **and extract more walks**.\n\nWe hope that this algorithmic complexity issue will be solved for the next\nrelease of ``pyRDf2Vec``\n\nHow to Silence the tcmalloc Warning When Using FastText With Mediums/Large KGs?\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nSets the ``TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD`` environment variable to a\nhigh value.\n\nReferencing\n-----------\n\nIf you use ``pyRDF2Vec`` in a scholarly article, we would appreciate a\ncitation:\n\n.. code:: bibtex\n\n    @inproceedings{pyrdf2vec,\n      title        = {pyRDF2Vec: A Python Implementation and Extension of RDF2Vec},\n      author       = {Steenwinckel, Bram and Vandewiele, Gilles and Agozzino, Terencio and Ongenae, Femke},\n      year         = 2023,\n      publisher    = {Springer Nature Switzerland},\n      booktitle    = {European Semantic Web Conference},\n      doi          = {10.1007/978-3-031-33455-9_28},\n      url          = {https://arxiv.org/abs/2205.02283},\n      pages        = {471--483},\n    }\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpredict-idlab%2FpyRDF2Vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpredict-idlab%2FpyRDF2Vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpredict-idlab%2FpyRDF2Vec/lists"}