{"id":15159715,"url":"https://github.com/graphaware/neo4j-nlp","last_synced_at":"2025-09-30T10:30:55.568Z","repository":{"id":57724713,"uuid":"56810670","full_name":"graphaware/neo4j-nlp","owner":"graphaware","description":"NLP Capabilities in Neo4j","archived":true,"fork":false,"pushed_at":"2021-05-05T19:43:11.000Z","size":22071,"stargazers_count":335,"open_issues_count":0,"forks_count":82,"subscribers_count":48,"default_branch":"master","last_synced_at":"2024-09-27T21:41:46.020Z","etag":null,"topics":["algorithms","graph-database","machine-learning","neo4j","nlp","opennlp","stanford-corenlp"],"latest_commit_sha":null,"homepage":"https://hume.graphaware.com/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/graphaware.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-04-21T23:00:16.000Z","updated_at":"2024-09-18T19:17:09.000Z","dependencies_parsed_at":"2022-09-11T04:30:51.565Z","dependency_job_id":null,"html_url":"https://github.com/graphaware/neo4j-nlp","commit_stats":null,"previous_names":[],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graphaware%2Fneo4j-nlp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graphaware%2Fneo4j-nlp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graphaware%2Fneo4j-nlp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graphaware%2Fneo4j-nlp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/graphaware","download_url":"https://codeload.github.com/graphaware/neo4j-nlp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234724891,"owners_count":18877279,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithms","graph-database","machine-learning","neo4j","nlp","opennlp","stanford-corenlp"],"created_at":"2024-09-26T21:41:45.366Z","updated_at":"2025-09-30T10:30:48.860Z","avatar_url":"https://github.com/graphaware.png","language":"Java","funding_links":[],"categories":["人工智能"],"sub_categories":[],"readme":"\n## GraphAware Natural Language Processing Has Been Retired\nAs of May 2021, this [repository has been retired](https://graphaware.com/framework/2021/05/06/from-graphaware-framework-to-graphaware-hume.html).\n\n---\n\n# GraphAware Natural Language Processing\n\nThis [Neo4j](https://neo4j.com) plugin offers Graph Based Natural Language Processing capabilities.\n\nThe main module, this module, provide a common interface for underlying text processors as well as a\n**Domain Specific Language** built atop stored procedures and functions making your Natural Language Processing\nworkflow developer friendly.\n\nIt comes in 2 versions, Community (open-sourced) and Enterprise with the following NLP features :\n\n## Feature Matrix\n\n| | Community Edition | Enterprise Edition |\n| --- | :---: | :---: |\n| Text information Extraction | ✔ | ✔ |\n| Multi-languages in the same database | | ✔ |\n| Custom NamedEntityRecognition model builder | | ✔ |\n| ConceptNet5 Enricher | ✔ | ✔ |\n| Microsoft Concept Enricher | ✔ | ✔ |\n| Keyword Extraction | ✔ | ✔ |\n| TextRank Summarization | ✔ | ✔ |\n| Topics Extraction | | ✔ |\n| Word Embeddings (Word2Vec) | ✔ | ✔ |\n| Similarity Computation | ✔ | ✔ |\n| PDF Parsing | ✔ | ✔ |\n| Apache Spark Binding for Distributed Algorithms | | ✔ |\n| Doc2Vec implementation | | ✔ |\n| User Interface | | ✔ |\n| ML Prediction capabilities | | ✔ |\n| Entity Merging | | ✔ |\n\nTwo NLP processor implementations are available, respectively [Stanford NLP](https://github.com/graphaware/neo4j-nlp-stanfordnlp) and\n[OpenNLP](https://github.com/graphaware/neo4j-nlp-opennlp) (OpenNLP receives less frequent updates, StanfordNLP is recommended).\n\n\n## Installation\n\n*From version 3.5.1.53.15 you need to download the language models, see below*\n\nFrom the [GraphAware plugins directory](https://products.graphaware.com), download the following `jar` files :\n\n* `neo4j-framework` (the JAR for this is labeled \"graphaware-server-enterprise-all\")\n* `neo4j-nlp`\n* `neo4j-nlp-stanfordnlp`\n* The language model to be downloaded from `https://stanfordnlp.github.io/CoreNLP/#download`\n\nand copy them in the `plugins` directory of Neo4j.\n\n*Take care that the version numbers of the framework you are using match with the version of Neo4J\nyou are using*.  This is a common setup problem.  For example, if you are using Neo4j 3.4.0 and above, all\nof the JARs you download should contain 3.4 in their version number.\n\n`plugins/` directory example :\n\n```\n-rw-r--r--  1 ikwattro  staff    58M Oct 11 11:15 graphaware-nlp-3.5.1.53.14.jar\n-rw-r--r--@ 1 ikwattro  staff    13M Aug 22 15:22 graphaware-server-community-all-3.5.1.53.jar\n-rw-r--r--  1 ikwattro  staff    16M Oct 11 11:28 nlp-stanfordnlp-3.5.1.53.14.jar\n-rw-r--r--@ 1 ikwattro  staff   991M Oct 11 11:45 stanford-english-corenlp-2018-10-05-models.jar\n```\n\nAppend the following configuration in the `neo4j.conf` file in the `config/` directory:\n\n```\n  dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware\n  com.graphaware.runtime.enabled=true\n  com.graphaware.module.NLP.1=com.graphaware.nlp.module.NLPBootstrapper\n  dbms.security.procedures.whitelist=ga.nlp.*\n```\n\nStart or restart your Neo4j database.\n\nNote: both concrete text processors are quite greedy - you will need to dedicate sufficient memory for to Neo4j heap space.\n\nAdditionally, the following indexes and constraints are suggested to speed performance:\n\n```\nCREATE CONSTRAINT ON (n:AnnotatedText) ASSERT n.id IS UNIQUE;\nCREATE CONSTRAINT ON (n:Tag) ASSERT n.id IS UNIQUE;\nCREATE CONSTRAINT ON (n:Sentence) ASSERT n.id IS UNIQUE;\nCREATE INDEX ON :Tag(value);\n```\n\nOr use the dedicated procedure :\n\n```\nCALL ga.nlp.createSchema()\n```\n\nDefine which language you will use in this database :\n\n```\nCALL ga.nlp.config.setDefaultLanguage('en')\n```\n\n### Quick Documentation in Neo4j Browser\n\nOnce the extension is loaded, you can see basic documentation on all available procedures by running\nthis Cypher query:\n\n```\nCALL dbms.procedures() YIELD name, signature, description\nWHERE name =~ 'ga.nlp.*'\nRETURN name, signature, description ORDER BY name asc;\n```\n\n## Getting Started\n\n### Text extraction\n\n#### Pipelines and components\n\nThe text extraction phase is done with a Natural Language Processing pipeline, each pipeline has a list of enabled components.\n\nFor example, the basic `tokenizer` pipeline has the following components :\n\n\n* Sentence Segmentation\n* Tokenization\n* StopWords Removal\n* Stemming\n* Part Of Speech Tagging\n* Named Entity Recognition\n\nIt is mandatory to create your pipeline first :\n\n```\nCALL ga.nlp.processor.addPipeline({textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor', name: 'customStopWords', processingSteps: {tokenize: true, ner: true, dependency: false}, stopWords: '+,result, all, during', \nthreadNumber: 20})\n```\n\nThe available optional parameters (default values are in brackets):\n* `name`: desired name of a new pipeline\n* `textProcessor`: to which text processor should the new pipeline be added\n* `processingSteps`: pipeline configuration (available in both Stanford and OpenNLP unless stated otherwise)\n  * `tokenize` (default: true): perform tokenization\n  * `ner` (default: true): Named Entity Recognition\n  * `sentiment` (default: false): run sentiment analysis on sentences\n  * `coref` (default: false): Coreference Resolution (identify multiple mentions of the same entity, such as \"Barack Obama\" and \"he\")\n  * `relations` (default: false): run relations identification between two tokens\n  * `dependency`  (default: false, StanfordNLP only): extract typed dependencies (ex.: amod - adjective modifier, conj - conjunct, ...)\n  * `cleanxml`  (default: false, StanfordNLP only): remove XML tags\n  * `truecase`  (default: false, StanfordNLP only): recognizes the \"true\" case of tokens (how they would be capitalized in well-edited text)\n  * `customNER`: list of custom NER model identifiers (as a string, model identifiers separated by “,”)\n* `stopWords`: specify words that are required to be ignored (if the list starts with +, the following words are appended to the default stopwords list, otherwise the default list is overwritten)\n* `threadNumber` (default: 4): for multi-threading\n* `excludedNER`: (default: none) specify a list of NE to not be recognized in upper case, for example for excluding `NER_Money` and `NER_O` on the Tag nodes, use ['O', 'MONEY']\n\n\nTo set a pipeline as a default pipeline:\n\n```\nCALL ga.nlp.processor.pipeline.default(\u003cyour-pipeline-name\u003e)\n```\n\nTo delete a pipeline, use this command:\n\n```\nCALL ga.nlp.processor.removePipeline(\u003cpipeline-name\u003e, \u003ctext-processor\u003e)\n```\n\nTo see details of all existing pipelines:\n\n```\nCALL ga.nlp.processor.getPipelines()\n```\n\n\n#### Example\n\nLet's take the following text as example :\n\n```\nScores of people were already lying dead or injured inside a crowded Orlando nightclub,\nand the police had spent hours trying to connect with the gunman and end the situation without further violence.\nBut when Omar Mateen threatened to set off explosives, the police decided to act, and pushed their way through a\nwall to end the bloody standoff.\n```\n\n**Simulate your original corpus**\n\nCreate a node with the text, this node will represent your original corpus or knowledge graph :\n\n```\nCREATE (n:News)\nSET n.text = \"Scores of people were already lying dead or injured inside a crowded Orlando nightclub,\nand the police had spent hours trying to connect with the gunman and end the situation without further violence.\nBut when Omar Mateen threatened to set off explosives, the police decided to act, and pushed their way through a\nwall to end the bloody standoff.\";\n```\n\n**Perform the text information extraction**\n\nThe extraction is done via the `annotate` procedure which is the entry point to text information extraction\n\n```\nMATCH (n:News)\nCALL ga.nlp.annotate({text: n.text, id: id(n)})\nYIELD result\nMERGE (n)-[:HAS_ANNOTATED_TEXT]-\u003e(result)\nRETURN result\n```\n\nAvailable parameters of `annotate` procedure:\n  * `text`: text to annotate represented as a string\n  * `id`: specify ID that will be used as `id` property of the new AnnotatedText node\n  * `textProcessor` (default: \"Stanford\", if not available than the first entry in the list of available text processors)\n  * `pipeline` (default: tokenizer)\n  * `checkLanguage` (default: true): run language detection on provided text and check whether it's supported\n\nThis procedure will link your original `:News` node to an `:AnnotatedText` node which is the entry point for the graph\nbased NLP of this particular News. The original text is broken down into words, parts of speech, and functions.\nThis analysis of the text acts as a starting point for the later steps.\n\n![annotated text](https://github.com/graphaware/neo4j-nlp/raw/master/docs/image1.png)\n\n**Running a batch of annotations**\n\nIf you have a big set of data to annotate, we recommend to use [APOC](https://github.com/neo4j-contrib/neo4j-apoc-procedures) :\n\n```\nCALL apoc.periodic.iterate(\n\"MATCH (n:News) RETURN n\",\n\"CALL ga.nlp.annotate({text: n.text, id: id(n)})\nYIELD result MERGE (n)-[:HAS_ANNOTATED_TEXT]-\u003e(result)\", {batchSize:1, iterateList:true})\n```\n\nIt is **important** to keep the `batchSize` and `iterateList` options as mentioned in the example. Running the annotation\nprocedure in parallel will create deadlocks.\n\n### Enrich your original knowledge\n\nWe implement external knowledge bases in order to enrich the knowledge of your current data.\n\nAs of now, two implementations are available :\n\n* ConceptNet5\n* Microsoft Concept Graph\n\nThis enricher will extend the meaning of tokens (Tag nodes) in the graph.\n\n```\nMATCH (n:Tag)\nCALL ga.nlp.enrich.concept({enricher: 'conceptnet5', tag: n, depth:1, admittedRelationships:[\"IsA\",\"PartOf\"]})\nYIELD result\nRETURN result\n```\n\nThe available parameters (default values are in brackets):\n* `tag`: tag to be enriched\n* `enricher` (`\"conceptnet5\"`): choose `microsoft` or `conceptnet5`\n* `depth` (`2`): how deep to go in concept hierarchy\n* `admittedRelationships`: choose desired concept relationships types, please refer to the [ConceptNet Documentation](http://conceptnet.io/) for details\n* `pipeline`: choose pipeline name to be used for cleansing of concepts before storing them to your DB; your system default pipeline is used otherwise\n* `filterByLanguage` (`true`): allow only concepts of languages specified in `outputLanguages`; if no languages are specified, the same language as `tag` is required\n* `outputLanguages` (`[]`): return only concepts with specified languages\n* `relDirection` (`\"out\"`): desired direction of relationships in concept hierarchy (`\"in\"`, `\"out\"`, `\"both\"`)\n* `minWeight` (`0.0`): minimal admitted concept relationship weight\n* `limit` (`10`): maximal number of concepts per `tag`\n* `splitTag` (`false`): if `true`, `tag` is first tokenised and then individual tokens enriched\n\nTags have now a `IS_RELATED_TO` relationships to other enriched concepts.\n\n![annotated text](https://github.com/graphaware/neo4j-nlp/raw/master/docs/image2.png)\n\n## List of available procedures\n\n### Keyword Extraction\n\n```\nMATCH (a:AnnotatedText)\nCALL ga.nlp.ml.textRank({annotatedText: a, stopwords: '+,other,email', useDependencies: true})\nYIELD result RETURN result\n```\n\n`annotatedText` is a mandatory parameter which refers to the annotated document that is required to be analyzed.\n\nAvailable optional parameters (default values are in brackets):\n\n* `keywordLabel` (Keyword): label name of the keyword nodes\n* `useDependencies` (true): use universal dependencies to enrich extracted keywords and key phrases by tags related through COMPOUND and AMOD relationships\n* `dependenciesGraph` (false): use universal dependencies for creating tag co-occurrence graph (default is false, which means that a natural word flow is used for building co-occurrences)\n* `cleanKeywords` (true): run cleaning procedure\n* `topXTags` (1/3): set a fraction of highest-rated tags that will be used as keywords / key phrases\n* `respectSentences` (false): respect or not sentence boundaries for co-occurrence graph building\n* `respectDirections` (false): respect or not directions in co-occurrence graph (how the words follow each other)\n* `iterations` (30): number of PageRank iterations\n* `damp` (0.85): PageRank damping factor\n* `threshold` (0.0001): PageRank convergence threshold\n* `removeStopwords` (true): use a stopwords list for co-occurrence graph building and final cleaning of keywords\n* `stopwords`: customize stopwords list (if the list starts with `+`, the following words are appended to the default stopwords list, otherwise the default list is overwritten)\n* `admittedPOSs`: specify which POS labels are considered as keyword candidates; needed when using different language than English\n* `forbiddenPOSs`: specify list of POS labels to be ignored when constructing co-occurrence graph; needed when using different language than English\n* `forbiddenNEs`: specify list of NEs to be ignored\n\nFor a detailed `TextRank` algorithm description, please refer to our blog post about\n[Unsupervised Keyword Extraction](https://graphaware.com/neo4j/2017/10/03/efficient-unsupervised-topic-extraction-nlp-neo4j.html).\n\nUsing universal dependencies for keyword enrichment (`useDependencies` option) can result in keywords with unnecessary level of detail, for example a keyword *space shuttle logistics program*. In many use cases we might be interested to also know that given document speaks generally about *space shuttle* (or *logistic program*). To do that, run post-processing with one of these options:\n* `direct` - each key phrase of *n* number of tags is checked against all key phrases from all documents with *1 \u003c m \u003c n* number of tags; if the former contains the latter key phrase, then a `DESCRIBES` relationship is created from the *m*-keyphrase to all annotated texts of the *n*-keyphrase\n* `subgroups` - the same procedure as for `direct`, but instead of connecting higher level keywords directly to *AnnotatedTexts*, they are connected to the lower level keywords with `HAS_SUBGROUP` relationships\n```\n// Important note: create subsequent indices to optimise the post-process method performance\nCREATE INDEX ON :Keyword(numTerms)\nCREATE INDEX ON :Keyword(value)\n\nCALL ga.nlp.ml.textRank.postprocess({keywordLabel: \"Keyword\", method: \"subgroups\"})\nYIELD result\nRETURN result\n```\n`keywordLabel` is an optional argument set by default to *\"Keyword\"*.\n\nThe postprocess operation by default is processing on all keywords, which can be very heavy on large graphs. You can specify the annotatedText on which to apply the postprocess operation with the `annotatedText` argument :\n\n```\nMATCH (n:AnnotatedText) WITH n LIMIT 100\nCALL ga.nlp.ml.textRank.postprocess({annotatedText: n, method:'subgroups'}) YIELD result RETURN count(n)\n```\n\nExample for running it efficiently on the full set of Keywords with APOC :\n\n```\nCALL apoc.periodic.iterate(\n'MATCH (n:AnnotatedText) RETURN n',\n'CALL ga.nlp.ml.textRank.postprocess({annotatedText: n, method:\"subgroups\"}) YIELD result RETURN count(n)',\n{batchSize: 1, iterateList:false}\n)\n```\n\n### TextRank Summarization\n\nSimilar approach to the keyword extraction can be employed to implement simple summarization. A densely connect graph of sentences is created, with Sentence-Sentence relationships representing their similarity based on shared words (number of shared words vs sum of logarithms of number of words in a sentence). PageRank is then used as a centrality measure to rank the relative importance of sentences in the document.\n\nTo run this algorithm:\n```\nMATCH (a:AnnotatedText)\nCALL ga.nlp.ml.textRank.summarize({annotatedText: a}) YIELD result\nRETURN result\n```\nAvailable parameters:\n* `annotatedText`\n* `iterations` (30): number of PageRank iterations\n* `damp` (0.85): PageRank damping factor\n* `threshold` (0.0001): PageRank convergence threshold\n\nThe summarisation procedure saves new properties to Sentence nodes: `summaryRelevance` (PageRank value of given sentence) and `summaryRank` (ranking; 1 = highest ranked sentence). Example query for retrieving summary:\n```\nmatch (n:Kapitel)-[:HAS_ANNOTATED_TEXT]-\u003e(a:AnnotatedText)\nwhere id(n) = 233\nmatch (a)-[:CONTAINS_SENTENCE]-\u003e(s:Sentence)\nwith a, count(*) as nSentences\nmatch (a)-[:CONTAINS_SENTENCE]-\u003e(s:Sentence)-[:HAS_TAG]-\u003e(t:Tag)\nwith a, s, count(distinct t) as nTags, (CASE WHEN nSentences*0.1 \u003e 10 THEN 10 ELSE toInteger(nSentences*0.1) END) as nLimit\nwhere nTags \u003e 4\nwith a, s, nLimit\norder by s.summaryRank\nwith a, collect({text: s.text, pos: s.sentenceNumber})[..nLimit] as summary\nunwind summary as sent\nreturn sent.text\norder by sent.pos\n```\n\n### Sentiment Detection\n\nYou can also determine whether the text presented is positive, negative, or neutral.  This procedure\nrequires an AnnotatedText node, which is produced by `ga.nlp.annotate` above.\n\n```\nMATCH (t:MyNode)-[]-(a:AnnotatedText) \nCALL ga.nlp.sentiment(a) YIELD result \nRETURN result;\n```\n\nThis procedure will simply return \"SUCCESS\" when it is successful, but it will apply the `:POSITIVE`, \n`:NEUTRAL` or `:NEGATIVE` label to each Sentence.  As a result, when sentiment detection is complete,\nyou can query for the sentiment of sentences as such:\n\n```\nMATCH (s:Sentence)\nRETURN s.text, labels(s)\n```\n\n### Language Detection\n\n```\nCALL ga.nlp.detectLanguage(\"What language is this in?\") \nYIELD result return result\n```\n\n### NLP based filter\n\n```\nCALL ga.nlp.filter({text:'On 8 May 2013,\n    one week before the Pakistani election, the third author,\n    in his keynote address at the Sentiment Analysis Symposium, \n    forecast the winner of the Pakistani election. The chart\n    in Figure 1 shows varying sentiment on the candidates for \n    prime minister of Pakistan in that election. The next day, \n    the BBC’s Owen Bennett Jones, reporting from Islamabad, wrote \n    an article titled Pakistan Elections: Five Reasons Why the \n    Vote is Unpredictable, in which he claimed that the election \n    was too close to call. It was not, and despite his being in Pakistan, \n    the outcome of the election was exactly as we predicted.', filter: 'Owen Bennett Jones/PERSON, BBC, Pakistan/LOCATION'}) YIELD result \nreturn result\n```\n\n### Cosine similarity computation\n\nOnce tags are extracted from all the news or other nodes containing some text, it is possible to compute similarities between them using content based similarity. \nDuring this process, each annotated text is described using the TF-IDF encoding format. TF-IDF is an established technique from the field of information retrieval and stands for Term Frequency-Inverse Document Frequency. \nText documents can be TF-IDF encoded as vectors in a multidimensional Euclidean space. The space dimensions correspond to the tags, previously extracted from the documents. The coordinates of a given document in each dimension (i.e., for each tag) are calculated as a product of two sub-measures: term frequency and inverse document frequency.\n\n```\nMATCH (a:AnnotatedText) \n//WHERE ...\nWITH collect(a) as nodes\nCALL ga.nlp.ml.similarity.cosine({input: \u003clist_of_annotated_texts\u003e[, query: \u003ctfidf_query\u003e, relationshipType: \"CUSTOM_SIMILARITY\", ...]}) YIELD result\nRETURN result\n```\n\nAvailable parameters (default values are in brackets):\n* `input`: list of input nodes - AnnotatedTexts\n* `relationshipType` (SIMILARITY_COSINE): type of similarity relationship, use it along with `query`\n* `query`: specify your own query for extracting *tf* and *idf* in form `... RETURN id(Tag), tf, idf`\n* `propertyName` (value): name of an existing node property (array of numerical values) which contains already prepared document vector\n\n\n### Word2vec\n\nWord2vec is a shallow two-layer neural network model used to produce word embeddings (words represented as multidimensional semantic vectors) and it is one of the models used in [ConceptNet Numberbatch](https://github.com/commonsense/conceptnet-numberbatch).\n\nTo add source model (vectors) into a Lucene index\n```\nCALL ga.nlp.ml.word2vec.addModel(\u003cpath_to_source_dir\u003e, \u003cpath_to_index\u003e, \u003cidentifier\u003e)\n```\n* `\u003cpath_to_source_dir\u003e` is a full path to the directory with source vectors to be indexed\n* `\u003cpath_to_index\u003e` is a full path where the index will be stored\n* `\u003cidentifier\u003e` is a custom string that uniquely identifies the model\n\nTo list available models:\n```\nCALL ga.nlp.ml.word2vec.listModels\n```\n\nThe model can now be used to compute cosine similarities between words:\n```\nWITH ga.nlp.ml.word2vec.wordVector('äpple', 'swedish-numberbatch') AS appleVector,\nga.nlp.ml.word2vec.wordVector('frukt', 'swedish-numberbatch') AS fruitVector\nRETURN ga.nlp.ml.similarity.cosine(appleVector, fruitVector) AS similarity\n```\n* 1st parameter: word\n* 2nd parameter: model identifier\n\nOr you can ask directly for a word2vec of a node which has a word stored in property `value`:\n```\nMATCH (n1:Tag), (n2:Tag)\nWHERE ...\nWITH ga.nlp.ml.word2vec.vector(n1, \u003cmodel_name\u003e) AS vector1,\nga.nlp.ml.word2vec.vector(n2, \u003cmodel_name\u003e) AS vector2\nRETURN ga.nlp.ml.similarity.cosine(vector1, vector2) AS similarity\n```\n\nWe can also permanently store the word2vec vectors to Tag nodes:\n```\nCALL ga.nlp.ml.word2vec.attach({query:'MATCH (t:Tag) RETURN t', modelName:'swedish-numberbatch'})\n```\n* `query`: query which returns tags to which embedding vectors should be attached\n* `modelName`: model to use\n\nYou can also get the nearest neighbors with the following procedure :\n\n```\nCALL ga.nlp.ml.word2vec.nn('analyzed', 10, 'fasttext') YIELD word, distance RETURN word, distance\n```\n\nFor large models, for example full fasttext for english, approximately 2 million words, it will be inefficient to compute the nearest neighbors on the fly.\n\nYou can load the model into memory in order to have faster nearest neighbors ( fasttext 1M word vectors generally takes 27 seconds if needed to read from disk, ~300ms in memory) :\n\nMake sure to have efficient heap memory dedicated to Neo4j :\n\n```\ndbms.memory.heap.initial_size=3000m\ndbms.memory.heap.max_size=5000m\n```\n\nLoad the model into memory :\n\n```\nCALL ga.nlp.ml.word2vec.load(\u003cmaxNeighbors\u003e, \u003cmodelName\u003e)\n```\n\nAnd retrieve it with\n\n```\nCALL ga.nlp.ml.word2vec.nn(\u003cword\u003e,\u003cmaxNeighbors\u003e,\u003cmodelName\u003e)\n```\n\n#### Using other models\n\nYou can use any word embedding model as long as the following is true :\n\n- Every line contain the word + the vector\n- The file has a `.txt` extension\n\nFor example, you can load the models from fasttext and just rename the file from `.vec` to `.txt` : https://fasttext.cc/docs/en/english-vectors.html\n\n### Parsing PDF Documents\n\n```\nCALL ga.nlp.parser.pdf(\"file:///Users/ikwattro/_graphs/nlp/import/myfile.pdf\") YIELD number, paragraphs\n```\n\nThe procedure return rows with columns `number` being the page number and `paragraphs` being a `List\u003cString\u003e` of paragraph texts.\n\nYou can also pass an `http` or `https` url to the procedure for loading a file from a remote location.\n\n#### Exclude content from the pdf\n\nIn some cases, pdf documents have some recurrent useless content like page footers etc, you can excluded them from the parsing by\npassing a list of regexes defining the parts to exclude :\n\n```\nCALL ga.nlp.parser.pdf(\"myfile.pdf\", [\"^[0-9]$\",\"^Licensed to\"])\n```\n\n#### Use a different user Agent than TIKA\n\nTIKA can be recognized as crawler and be denied access to some sites containing pdf's. You can override this by passing a `UserAgent` option :\n\n```\nCALL ga.nlp.parser.pdf($url, [], {UserAgent: 'Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.7.2) Gecko/20040803'})\n```\n\n### Extras\n\n#### Parsing raw content from a file\n\n```\nRETURN ga.nlp.parse.raw(\u003cpath-to-file\u003e) AS content\n```\n\n#### Storing only certain Tag/Tokens\n\nIn certain situations, it would be useful to store only certain values instead of the full graph, note though that it might reduce the ability to extract insights ( textRank ) for eg :\n\n```\nCALL ga.nlp.processor.addPipeline({\nname:\"whitelist\",\nwhitelist:\"hello,john,ibm\",\ntextProcessor:\"com.graphaware.nlp.enterprise.processor.EnterpriseStanfordTextProcessor\",\nprocessingSteps:{tokenize:true, ner:true}})\n```\n\n```\nCALL ga.nlp.annotate({text:\"Hello, my name is John and I worked at IBM.\", id:\"test-123\", pipeline:\"whitelist\", checkLanguage:false})\nYIELD result\nRETURN result\n```\n\n### Parsing WebVTT\n\nWebVTT is the format for Web Video Text Tracks, such as Youtube Transcripts of videos : https://fr.wikipedia.org/wiki/WebVTT\n\n```\nCALL ga.nlp.parser.webvtt(\"url-to-transcript.vtt\") YIELD startTime, endTime, text\n```\n\n### Listing files from directory(ies)\n\n```\nCALL ga.nlp.utils.listFiles(\u003cpath-to-directory\u003e, \u003cextensionFilter\u003e)\n\n// eg:\n\nCALL ga.nlp.utils.listFiles(\"/Users/ikwattro/dev/papers\", \".pdf\") YIELD filePath RETURN filePath\n```\n\nThe above procedure list files of the current directory only, if you need to walk the children directories as well, use `walkdir` :\n\n```\nCALL ga.nlp.utils.walkdir(\"/Users/ikwattro/dev/papers\", \".pdf\") YIELD filePath RETURN filePath\n```\n\n## Additional Procedures\n\n### ga.nlp.config.model.list()\n\nList stored models and their paths\n\n### ga.nlp.refreshPipeline(\u003cname\u003e)\n\nRemove and re-create a pipeline with the same configuration ( useful when using static ner files that have been changed for eg )\n\n\n## License\n\nCopyright (c) 2013-2019 GraphAware\n\nGraphAware is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License\nas published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.\nThis program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied\nwarranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.\nYou should have received a copy of the GNU General Public License along with this program.\nIf not, see \u003chttp://www.gnu.org/licenses/\u003e.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgraphaware%2Fneo4j-nlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgraphaware%2Fneo4j-nlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgraphaware%2Fneo4j-nlp/lists"}