{"id":15014118,"url":"https://github.com/msg-systems/holmes-extractor","last_synced_at":"2025-04-07T06:09:21.698Z","repository":{"id":40589320,"uuid":"184127334","full_name":"msg-systems/holmes-extractor","owner":"msg-systems","description":"Information extraction from English and German texts based on predicate logic","archived":false,"fork":false,"pushed_at":"2022-07-08T16:04:34.000Z","size":1582,"stargazers_count":390,"open_issues_count":2,"forks_count":38,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-03-31T04:08:18.946Z","etag":null,"topics":["information-extraction","machine-learning","nlp","ontology","python","semantics","spacy","spacy-extension"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/msg-systems.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-29T19:00:58.000Z","updated_at":"2025-03-24T23:01:50.000Z","dependencies_parsed_at":"2022-08-25T05:20:35.699Z","dependency_job_id":null,"html_url":"https://github.com/msg-systems/holmes-extractor","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msg-systems%2Fholmes-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msg-systems%2Fholmes-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msg-systems%2Fholmes-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msg-systems%2Fholmes-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/msg-systems","download_url":"https://codeload.github.com/msg-systems/holmes-extractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247601448,"owners_count":20964864,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["information-extraction","machine-learning","nlp","ontology","python","semantics","spacy","spacy-extension"],"created_at":"2024-09-24T19:45:13.045Z","updated_at":"2025-04-07T06:09:21.660Z","avatar_url":"https://github.com/msg-systems.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"Holmes\n======\nAuthor: \u003ca href=\"mailto:richard@explosion.ai\"\u003eRichard Paul Hudson, Explosion AI\u003c/a\u003e\n\n-   [1. Introduction](#introduction)\n    -   [1.1 The basic idea](#the-basic-idea)\n    -   [1.2 Installation](#installation)\n        -   [1.2.1 Prerequisites](#prerequisites)\n        -   [1.2.2 Library installation](#library-installation)\n        -   [1.2.3 Installing the spaCy and coreferee models](#installing-the-spacy-and-coreferee-models)\n        -   [1.2.4 Comments about deploying Holmes in an\n            enterprise\n            environment](#comments-about-deploying-holmes-in-an-enterprise-environment)\n        -   [1.2.5 Resource requirements](#resource-requirements)\n    -   [1.3 Getting started](#getting-started)\n-   [2. Word-level matching strategies](#word-level-matching-strategies)\n    -   [2.1 Direct matching](#direct-matching)\n    -   [2.2 Derivation-based matching](#derivation-based-matching)\n    -   [2.3 Named-entity matching](#named-entity-matching)\n    -   [2.4 Ontology-based matching](#ontology-based-matching)\n    -   [2.5 Embedding-based matching](#embedding-based-matching)\n    -   [2.6 Named-entity-embedding-based matching](#named-entity-embedding-based-matching)\n    -   [2.7 Initial-question-word matching](#initial-question-word-matching)\n-   [3. Coreference resolution](#coreference-resolution)\n-   [4. Writing effective search\n    phrases](#writing-effective-search-phrases)\n    -   4.1 [General comments](#general-comments)\n        -   [4.1.1 Lexical versus grammatical words](#lexical-versus-grammatical-words)\n        -   [4.1.2 Use of the present active](#use-of-the-present-active)\n        -   [4.1.3 Generic pronouns](#generic-pronouns)\n        -   [4.1.4 Prepositions](#prepositions)\n    -   [4.2 Structures not permitted in search\n        phrases](#structures-not-permitted-in-search-phrases)\n        -   [4.2.1 Multiple clauses](#multiple-clauses)\n        -   [4.2.2 Negation](#negation)\n        -   [4.2.3 Conjunction](#conjunction)\n        -   [4.2.4 Lack of lexical words](#lack-of-lexical-words)\n        -   [4.2.5 Coreferring pronouns](#coreferring-pronouns)\n    -   [4.3 Structures strongly discouraged in search\n        phrases](#structures-strongly-discouraged-in-search-phrases)\n        -   [4.3.1 Ungrammatical\n            expressions](#ungrammatical-expressions)\n        -   [4.3.2 Complex verb tenses](#complex-verb-tenses)\n        -   [4.3.3 Questions](#questions)\n        -   [4.3.4 Compound words](#compound-words)\n    -   [4.4 Structures to be used with caution in search\n        phrases](#structures-to-be-used-with-caution-in-search-phrases)\n        -   [4.4.1 Very complex\n            structures](#very-complex-structures)\n        -   [4.4.2 Deverbal noun phrases](#deverbal-noun-phrases)\n-   [5. Use cases and examples](#use-cases-and-examples)\n    -   [5.1 Chatbot](#chatbot)\n    -   [5.2 Structural extraction](#structural-extraction)\n    -   [5.3 Topic matching](#topic-matching)\n    -   [5.4 Supervised document classification](#supervised-document-classification)\n-   [6 Interfaces intended for public\n    use](#interfaces-intended-for-public-use)\n    -   [6.1 `Manager`](#manager)\n    -   [6.2 `manager.nlp`](#manager.nlp)\n    -   [6.3 `Ontology`](#ontology)\n    -   [6.4 `SupervisedTopicTrainingBasis`](#supervised-topic-training-basis)\n    (returned from `Manager.get_supervised_topic_training_basis()`)\n    -   [6.5 `SupervisedTopicModelTrainer`](#supervised-topic-model-trainer)\n    (returned from `SupervisedTopicTrainingBasis.train()`)\n    -   [6.6 `SupervisedTopicClassifier`](#supervised-topic-classifier)\n    (returned from `SupervisedTopicModelTrainer.classifier()` and\n    `Manager.deserialize_supervised_topic_classifier()`)\n    -   [6.7 Dictionary returned from\n        `Manager.match()`](#dictionary)\n    -   [6.8 Dictionary returned from\n        `Manager.topic_match_documents_against()`](#topic-match-dictionary)\n-   [7 A note on the license](#a-note-on-the-license)\n-   [8 Information for developers](#information-for-developers)\n    -   [8.1 How it works](#how-it-works)\n        - [8.1.1 Structural matching (chatbot and structural extraction)](#how-it-works-structural-matching)\n        - [8.1.2 Topic matching](#how-it-works-topic-matching)\n        - [8.1.3 Supervised document classification](#how-it-works-supervised-document-classification)\n    -   [8.2 Development and testing\n        guidelines](#development-and-testing-guidelines)\n    -   [8.3 Areas for further\n        development](#areas-for-further-development)\n        -   [8.3.1 Additional languages](#additional-languages)\n        -   [8.3.2 Use of machine learning to improve\n            matching](#use-of-machine-learning-to-improve-matching)\n        -   [8.3.3 Remove names from supervised document classification models](#remove-names-from-supervised-document-classification-models)\n        -   [8.3.4 Improve the performance of supervised document classification training](#improve-performance-of-supervised-document-classification-training)\n        -   [8.3.5 Explore the optimal hyperparameters for topic matching and supervised document classification](#explore-hyperparameters)\n    -   [8.4 Version history](#version-history)\n        -   [8.4.1 Version 2.0.x](#version-20x)\n        -   [8.4.2 Version 2.1.0](#version-210)\n        -   [8.4.3 Version 2.2.0](#version-220)\n        -   [8.4.4 Version 2.2.1](#version-221)\n        -   [8.4.5 Version 3.0.0](#version-300)\n        -   [8.4.6 Version 4.0.0](#version-400)\n\n\u003ca id=\"introduction\"\u003e\u003c/a\u003e\n### 1. Introduction\n\n\u003ca id=\"the-basic-idea\"\u003e\u003c/a\u003e\n#### 1.1 The basic idea\n\n**Holmes** is a Python 3 library (v3.6—v3.10) running on top of\n[spaCy](https://spacy.io/) (v3.1—v3.3) that supports a number of use cases\ninvolving information extraction from English and German texts. In all use cases, the information\nextraction is based on analysing the semantic relationships expressed by the component parts of\neach sentence:\n\n- In the [chatbot](#getting-started) use case, the system is configured using one or more **search phrases**.\nHolmes then looks for structures whose meanings correspond to those of these search phrases within\na searched **document**, which in this case corresponds to an individual snippet of text or speech\nentered by the end user. Within a match, each word with its own meaning (i.e. that does not merely fulfil a grammatical function) in the search phrase\ncorresponds to one or more such words in the document. Both the fact that a search phrase was matched and any structured information the search phrase extracts can be used to drive the chatbot.\n\n- The [structural extraction](#structural-extraction) use case uses exactly the same\n[structural matching](#how-it-works-structural-matching) technology as the chatbot use\ncase, but searching takes place with respect to a pre-existing document or documents that are typically much\nlonger than the snippets analysed in the chatbot use case, and the aim is to extract and store structured information. For example, a set of business articles could be searched to find all the places where one company is said to be planning to\ntake over a second company. The identities of the companies concerned could then be stored in a database.\n\n- The [topic matching](#topic-matching) use case aims to find passages in a document or documents whose meaning\nis close to that of another document, which takes on the role of the **query document**, or to that of a **query phrase** entered ad-hoc by the user. Holmes extracts a number of small **phraselets** from the query phrase or\nquery document, matches the documents being searched against each phraselet, and conflates the results to find\nthe most relevant passages within the documents. Because there is no strict requirement that every\nword with its own meaning in the query document match a specific word or words in the searched documents, more matches are found\nthan in the structural extraction use case, but the matches do not contain structured information that can be\nused in subsequent processing. The topic matching use case is demonstrated by [a website allowing searches within\nsix Charles Dickens novels (for English) and around 350 traditional stories (for German)](https://holmes-demo.explosion.services/).\n\n- The [supervised document classification](#supervised-document-classification) use case uses training data to\nlearn a classifier that assigns one or more **classification labels** to new documents based on what they are about.\nIt classifies a new document by matching it against phraselets that were extracted from the training documents in the\nsame way that phraselets are extracted from the query document in the topic matching use case. The technique is\ninspired by bag-of-words-based classification algorithms that use n-grams, but aims to derive n-grams whose component\nwords are related semantically rather than that just happen to be neighbours in the surface representation of a language.\n\nIn all four use cases, the **individual words** are matched using a [number of strategies](#word-level-matching-strategies).\nTo work out whether two grammatical structures that contain individually matching words correspond logically and\nconstitute a match, Holmes transforms the syntactic parse information provided by the [spaCy](https://spacy.io/) library\ninto semantic structures that allow texts to be compared using predicate logic. As a user of Holmes, you do not need to\nunderstand the intricacies of how this works, although there are some\n[important tips](#writing-effective-search-phrases) around writing effective search phrases for the chatbot and\nstructural extraction use cases that you should try and take on board.\n\nHolmes aims to offer generalist solutions that can be used more or less out of the box with\nrelatively little tuning, tweaking or training and that are rapidly applicable to a wide range of use cases.\nAt its core lies a logical, programmed, rule-based system that describes how syntactic representations in each\nlanguage express semantic relationships. Although the supervised document classification use case does incorporate a\nneural network and although the spaCy library upon which Holmes builds has itself been pre-trained using machine\nlearning, the essentially rule-based nature of Holmes means that the chatbot, structural extraction and topic matching use\ncases can be put to use out of the box without any training and that the supervised document classification use case\ntypically requires relatively little training data, which is a great advantage because pre-labelled training data is\nnot available for many real-world problems.\n\nHolmes has a long and complex history and we are now able to publish it under the MIT license thanks to the goodwill and openness of several companies. I, Richard Hudson, wrote the versions up to 3.0.0 while working at [msg systems](https://www.msg.group/en), a large international software consultancy based near Munich. In late 2021, I changed employers and now work for [Explosion](https://explosion.ai/), the creators of [spaCy](https://spacy.io/) and [Prodigy](https://prodi.gy/). Elements of the Holmes library are covered by a [US patent](https://patents.google.com/patent/US8155946B2/en) that I myself wrote in the early 2000s while working at a startup called Definiens that has since been acquired by [AstraZeneca](https://www.astrazeneca.com/). With the kind permission of both AstraZeneca and msg systems, I am now maintaining Holmes at Explosion and can offer it for the first time under a permissive license: anyone can now use Holmes under the terms of the MIT\nlicense without having to worry about the patent.\n\nThe library was originally developed at [msg systems](https://www.msg.group/en), but is now being maintained at [Explosion AI](https://explosion.ai). **Please direct any new issues or discussions to [the Explosion repository](https://github.com/explosion/holmes-extractor).**\n\n\u003ca id=\"installation\"\u003e\u003c/a\u003e\n#### 1.2 Installation\n\n\u003ca id=\"prerequisites\"\u003e\u003c/a\u003e\n##### 1.2.1 Prerequisites\n\nIf you do not already have [Python 3](https://realpython.com/installing-python/) and\n[pip](https://pypi.org/project/pip/) on your machine, you will need to install them\nbefore installing Holmes.\n\n\u003ca id=\"library-installation\"\u003e\u003c/a\u003e\n##### 1.2.2 Library installation\n\nInstall Holmes using the following commands:\n\n*Linux:*\n```\npip3 install holmes-extractor\n```\n\n*Windows:*\n```\npip install holmes-extractor\n```\n\nTo upgrade from a previous Holmes version, issue the following commands and then\n[reissue the commands to download the spaCy and coreferee models](#installing-the-spacy-and-coreferee-models) to ensure\nyou have the correct versions of them:\n\n*Linux:*\n```\npip3 install --upgrade holmes-extractor\n```\n\n*Windows:*\n```\npip install --upgrade holmes-extractor\n```\n\nIf you wish to use the examples and tests, clone the source code using\n\n```\ngit clone https://github.com/explosion/holmes-extractor\n```\n\nIf you wish to experiment with changing the source code, you can\noverride the installed code by starting Python (type `python3` (Linux) or `python`\n(Windows)) in the parent directory of the directory where your altered `holmes_extractor`\nmodule code is. If you have checked Holmes out of Git, this will be the `holmes-extractor` directory.\n\nIf you wish to uninstall Holmes again, this is achieved by deleting the installed\nfile(s) directly from the file system. These can be found by issuing the\nfollowing from the Python command prompt started from any directory **other**\nthan the parent directory of `holmes_extractor`:\n\n```\nimport holmes_extractor\nprint(holmes_extractor.__file__)\n```\n\n\u003ca id=\"installing-the-spacy-and-coreferee-models\"\u003e\u003c/a\u003e\n##### 1.2.3 Installing the spaCy and coreferee models\n\nThe spaCy and coreferee libraries that Holmes builds upon require\nlanguage-specific models that have to be downloaded separately before Holmes can be used:\n\n*Linux/English:*\n```\npython3 -m spacy download en_core_web_trf\npython3 -m spacy download en_core_web_lg\npython3 -m coreferee install en\n```\n\n*Linux/German:*\n```\npip3 install spacy-lookups-data # (from spaCy 3.3 onwards)\npython3 -m spacy download de_core_news_lg\npython3 -m coreferee install de\n```\n\n*Windows/English:*\n```\npython -m spacy download en_core_web_trf\npython -m spacy download en_core_web_lg\npython -m coreferee install en\n```\n\n*Windows/German:*\n```\npip install spacy-lookups-data # (from spaCy 3.3 onwards)\npython -m spacy download de_core_news_lg\npython -m coreferee install de\n```\n\nand if you plan to run the [regression tests](#development-and-testing-guidelines):\n\n*Linux:*\n```\npython3 -m spacy download en_core_web_sm\n```\n\n*Windows:*\n```\npython -m spacy download en_core_web_sm\n```\n\nYou specify a spaCy model for Holmes to use [when you instantiate the Manager facade class](#getting-started). `en_core_web_trf` and `de_core_web_lg` are the models that have been found to yield the best results for English and German respectively. Because `en_core_web_trf` does not have its own word vectors, but Holmes requires word vectors for [embedding-based-matching](#embedding-based-matching), the `en_core_web_lg` model is loaded as a vector source whenever `en_core_web_trf` is specified to the Manager class as the main model.\n\nThe `en_core_web_trf` model requires sufficiently more resources than the other models; in a siutation where resources are scarce, it may be a sensible compromise to use `en_core_web_lg` as the main model instead.\n\n\u003ca id=\"comments-about-deploying-holmes-in-an-enterprise-environment\"\u003e\u003c/a\u003e\n##### 1.2.4 Comments about deploying Holmes in an enterprise environment\n\nThe best way of integrating Holmes into a non-Python environment is to\nwrap it as a RESTful HTTP service and to deploy it as a\nmicroservice. See [here](https://github.com/explosion/holmes-extractor/blob/master/examples/example_search_EN_literature.py) for an example.\n\n\u003ca id=\"resource-requirements\"\u003e\u003c/a\u003e\n##### 1.2.5 Resource requirements\n\nBecause Holmes performs complex, intelligent analysis, it is inevitable that it requires more hardware resources than more traditional search frameworks. The use cases that involve loading documents — [structural extraction](#structural-extraction) and [topic matching](#topic-matching) — are most immediately applicable to large but not massive corpora (e.g. all the documents belonging to a certain organisation, all the patents on a certain topic, all the books by a certain author). For cost reasons, Holmes would not be an appropriate tool with which to analyse the content of the entire internet!\n\nThat said, Holmes is both vertically and horizontally scalable. With sufficient hardware, both these use cases can be applied to an essentially unlimited number of documents by running Holmes on multiple machines, processing a different set of documents on each one and conflating the results. Note that this strategy is already employed to distribute matching amongst multiple cores on a single machine: the [Manager](#manager) class starts a number of worker processes and distributes registered documents between them.\n\nHolmes holds loaded documents in memory, which ties in with its intended use with large but not massive corpora. The performance of document loading, [structural extraction](#structural-extraction) and [topic matching](#topic-matching) all degrade heavily if the operating system has to swap memory pages to secondary storage, because Holmes can require memory from a variety of pages to be addressed when processing a single sentence. This means it is important to supply enough RAM on each machine to hold all loaded documents.\n\nPlease note the [above comments](#installing-the-spacy-and-coreferee-models) about the relative resource requirements of the different models.\n\n\u003ca id=\"getting-started\"\u003e\u003c/a\u003e\n#### 1.3 Getting started\n\nThe easiest use case with which to get a quick basic idea of how Holmes works is the **chatbot** use case.\n\nHere one or more search phrases are defined to Holmes in advance, and the\nsearched documents are short sentences or paragraphs typed in\ninteractively by an end user. In a real-life setting, the extracted\ninformation would be used to\ndetermine the flow of interaction with the end user. For testing and\ndemonstration purposes, there is a console that displays\nits matched findings interactively. It can be easily and\nquickly started from the Python command line (which is itself started from the\noperating system prompt by typing `python3` (Linux) or `python` (Windows))\nor from within a [Jupyter notebook](https://jupyter.org/).\n\nThe following code snippet can be entered line for line into the Python command\nline, into a Jupyter notebook or into an IDE. It registers the fact that you are\ninterested in sentences about big dogs chasing cats and starts a\ndemonstration chatbot console:\n\n*English:*\n\n```\nimport holmes_extractor as holmes\nholmes_manager = holmes.Manager(model='en_core_web_lg', number_of_workers=1)\nholmes_manager.register_search_phrase('A big dog chases a cat')\nholmes_manager.start_chatbot_mode_console()\n```\n\n*German:*\n\n```\nimport holmes_extractor as holmes\nholmes_manager = holmes.Manager(model='de_core_news_lg', number_of_workers=1)\nholmes_manager.register_search_phrase('Ein großer Hund jagt eine Katze')\nholmes_manager.start_chatbot_mode_console()\n```\n\nIf you now enter a sentence that corresponds to the search phrase, the\nconsole will display a match:\n\n*English:*\n\n```\nReady for input\n\nA big dog chased a cat\n\n\nMatched search phrase with text 'A big dog chases a cat':\n'big'-\u003e'big' (Matches BIG directly); 'A big dog'-\u003e'dog' (Matches DOG directly); 'chased'-\u003e'chase' (Matches CHASE directly); 'a cat'-\u003e'cat' (Matches CAT directly)\n```\n\n*German:*\n\n```\nReady for input\n\nEin großer Hund jagte eine Katze\n\n\nMatched search phrase 'Ein großer Hund jagt eine Katze':\n'großer'-\u003e'groß' (Matches GROSS directly); 'Ein großer Hund'-\u003e'hund' (Matches HUND directly); 'jagte'-\u003e'jagen' (Matches JAGEN directly); 'eine Katze'-\u003e'katze' (Matches KATZE directly)\n```\n\nThis could easily have been achieved with a simple matching algorithm, so type\nin a few more complex sentences to convince yourself that Holmes is\nreally grasping them and that matches are still returned:\n\n*English:*\n\n```\nThe big dog would not stop chasing the cat\nThe big dog who was tired chased the cat\nThe cat was chased by the big dog\nThe cat always used to be chased by the big dog\nThe big dog was going to chase the cat\nThe big dog decided to chase the cat\nThe cat was afraid of being chased by the big dog\nI saw a cat-chasing big dog\nThe cat the big dog chased was scared\nThe big dog chasing the cat was a problem\nThere was a big dog that was chasing a cat\nThe cat chase by the big dog\nThere was a big dog and it was chasing a cat.\nI saw a big dog. My cat was afraid of being chased by the dog.\nThere was a big dog. His name was Fido. He was chasing my cat.\nA dog appeared. It was chasing a cat. It was very big.\nThe cat sneaked back into our lounge because a big dog had been chasing her.\nOur big dog was excited because he had been chasing a cat.\n```\n\n*German:*\n\n```\nDer große Hund hat die Katze ständig gejagt\nDer große Hund, der müde war, jagte die Katze\nDie Katze wurde vom großen Hund gejagt\nDie Katze wurde immer wieder durch den großen Hund gejagt\nDer große Hund wollte die Katze jagen\nDer große Hund entschied sich, die Katze zu jagen\nDie Katze, die der große Hund gejagt hatte, hatte Angst\nDass der große Hund die Katze jagte, war ein Problem\nEs gab einen großen Hund, der eine Katze jagte\nDie Katzenjagd durch den großen Hund\nEs gab einmal einen großen Hund, und er jagte eine Katze\nEs gab einen großen Hund. Er hieß Fido. Er jagte meine Katze\nEs erschien ein Hund. Er jagte eine Katze. Er war sehr groß.\nDie Katze schlich sich in unser Wohnzimmer zurück, weil ein großer Hund sie draußen gejagt hatte\nUnser großer Hund war aufgeregt, weil er eine Katze gejagt hatte\n```\n\nThe demonstration is not complete without trying other sentences that\ncontain the same words but do not express the same idea and observing that they\nare **not** matched:\n\n*English:*\n\n```\nThe dog chased a big cat\nThe big dog and the cat chased about\nThe big dog chased a mouse but the cat was tired\nThe big dog always used to be chased by the cat\nThe big dog the cat chased was scared\nOur big dog was upset because he had been chased by a cat.\nThe dog chase of the big cat\n```\n\n*German:*\n\n```\nDer Hund jagte eine große Katze\nDie Katze jagte den großen Hund\nDer große Hund und die Katze jagten\nDer große Hund jagte eine Maus aber die Katze war müde\nDer große Hund wurde ständig von der Katze gejagt\nDer große Hund entschloss sich, von der Katze gejagt zu werden\nDie Hundejagd durch den große Katze\n```\n\nIn the above examples, Holmes has matched a variety of different\nsentence-level structures that share the same meaning, but the base\nforms of the three words in the matched documents have always been the\nsame as the three words in the search phrase. Holmes provides\nseveral further strategies for matching at the individual word level. In\ncombination with Holmes's ability to match different sentence\nstructures, these can enable a search phrase to be matched to a document\nsentence that shares its meaning even where the two share no words and\nare grammatically completely different.\n\nOne of these additional word-matching strategies is [named-entity\nmatching](#named-entity-matching): special words can be included in search phrases\nthat match whole classes of names like people or places. Exit the\nconsole by typing `exit`, then register a second search phrase and\nrestart the console:\n\n*English:*\n\n```\nholmes_manager.register_search_phrase('An ENTITYPERSON goes into town')\nholmes_manager.start_chatbot_mode_console()\n```\n\n*German:*\n\n```\nholmes_manager.register_search_phrase('Ein ENTITYPER geht in die Stadt')\nholmes_manager.start_chatbot_mode_console()\n```\n\nYou have now registered your interest in people going into town and can\nenter appropriate sentences into the console:\n\n*English:*\n\n```\nReady for input\n\nI met Richard Hudson and John Doe last week. They didn't want to go into town.\n\n\nMatched search phrase with text 'An ENTITYPERSON goes into town'; negated; uncertain; involves coreference:\n'Richard Hudson'-\u003e'ENTITYPERSON' (Has an entity label matching ENTITYPERSON); 'go'-\u003e'go' (Matches GO directly); 'into'-\u003e'into' (Matches INTO directly); 'town'-\u003e'town' (Matches TOWN directly)\n\nMatched search phrase with text 'An ENTITYPERSON goes into town'; negated; uncertain; involves coreference:\n'John Doe'-\u003e'ENTITYPERSON' (Has an entity label matching ENTITYPERSON); 'go'-\u003e'go' (Matches GO directly); 'into'-\u003e'into' (Matches INTO directly); 'town'-\u003e'town' (Matches TOWN directly)\n```\n\n*German:*\n\n```\nReady for input\n\nLetzte Woche sah ich Richard Hudson und Max Mustermann. Sie wollten nicht mehr in die Stadt gehen.\n\n\nMatched search phrase with text 'Ein ENTITYPER geht in die Stadt'; negated; uncertain; involves coreference:\n'Richard Hudson'-\u003e'ENTITYPER' (Has an entity label matching ENTITYPER); 'gehen'-\u003e'gehen' (Matches GEHEN directly); 'in'-\u003e'in' (Matches IN directly); 'die Stadt'-\u003e'stadt' (Matches STADT directly)\n\nMatched search phrase with text 'Ein ENTITYPER geht in die Stadt'; negated; uncertain; involves coreference:\n'Max Mustermann'-\u003e'ENTITYPER' (Has an entity label matching ENTITYPER); 'gehen'-\u003e'gehen' (Matches GEHEN directly); 'in'-\u003e'in' (Matches IN directly); 'die Stadt'-\u003e'stadt' (Matches STADT directly)\n```\n\nIn each of the two languages, this last example demonstrates several\nfurther features of Holmes:\n\n-   It can match not only individual words, but also **multiword**\n    phrases like *Richard Hudson*.\n-   When two or more words or phrases are linked by **conjunction**\n    (*and* or *or*), Holmes extracts a separate match for each.\n-   When a sentence is **negated** (*not*), Holmes marks the match\n    accordingly.\n-   Like several of the matches yielded by the more complex entry\n    sentences in the above example about big dogs and cats, Holmes marks the\n    two matches as **uncertain**. This means that the search phrase was\n    not matched exactly, but rather in the context of some other, more\n    complex relationship ('wanting to go into town' is not the same\n    thing as 'going into town').\n\nFor more examples, please see [section 5](#use-cases-and-examples).\n\n\u003ca id=\"word-level-matching-strategies\"\u003e\u003c/a\u003e\n### 2. Word-level matching strategies\n\nThe following strategies are implemented with \n[one Python module per strategy](https://github.com/explosion/holmes-extractor/tree/master/holmes_extractor/word_matching). \nAlthough the standard library does not support adding bespoke strategies via the [Manager](#manager)\nclass, it would be relatively easy for anyone with Python programming skills to\nchange the code to enable this.\n\n\u003ca id=\"direct-matching\"\u003e\u003c/a\u003e\n#### 2.1 Direct matching (`word_match.type=='direct'`)\n\nDirect matching between search phrase words and document words is always\nactive. The strategy relies mainly on matching stem forms of words,\ne.g. matching English *buy* and *child* to *bought* and *children*,\nGerman *steigen* and *Kind* to *stieg* and *Kinder*. However, in order to\nincrease the chance of direct matching working when the parser delivers an\nincorrect stem form for a word, the raw-text forms of both search-phrase and\ndocument words are also taken into consideration during direct matching.\n\n\u003ca id=\"derivation-based-matching\"\u003e\u003c/a\u003e\n#### 2.2 Derivation-based matching (`word_match.type=='derivation'`)\n\nDerivation-based matching involves distinct but related words that typically\nbelong to different word classes, e.g. English *assess* and *assessment*,\nGerman *jagen* and *Jagd*. It is active by default but can be switched off using\nthe `analyze_derivational_morphology` parameter, which is set when instantiating the [Manager](#manager) class.\n\n\u003ca id=\"named-entity-matching\"\u003e\u003c/a\u003e\n#### 2.3 Named-entity matching (`word_match.type=='entity'`)\n\nNamed-entity matching is activated by inserting a special named-entity\nidentifier at the desired point in a search phrase in place of a noun,\ne.g.\n\n***An ENTITYPERSON goes into town*** (English)  \n***Ein ENTITYPER geht in die Stadt*** (German).\n\nThe supported named-entity identifiers depend directly on the named-entity information supplied\nby the spaCy models for each language (descriptions copied from an earlier version of the spaCy\ndocumentation):\n\n*English:*\n\n|Identifier           | Meaning|\n|---------------------| ------------------------------------------------------|\n|ENTITYNOUN           | Any noun phrase.|\n|ENTITYPERSON         | People, including fictional.|\n|ENTITYNORP           | Nationalities or religious or political groups.|\n|ENTITYFAC            | Buildings, airports, highways, bridges, etc.|\n|ENTITYORG            | Companies, agencies, institutions, etc.|\n|ENTITYGPE            | Countries, cities, states.|\n|ENTITYLOC            | Non-GPE locations, mountain ranges, bodies of water.|\n|ENTITYPRODUCT        | Objects, vehicles, foods, etc. (Not services.)|\n|ENTITYEVENT          | Named hurricanes, battles, wars, sports events, etc.|\n|ENTITYWORK_OF_ART    | Titles of books, songs, etc.|\n|ENTITYLAW            | Named documents made into laws.|\n|ENTITYLANGUAGE       | Any named language.|\n|ENTITYDATE           | Absolute or relative dates or periods.|\n|ENTITYTIME           | Times smaller than a day.|\n|ENTITYPERCENT        | Percentage, including \"%\".|\n|ENTITYMONEY          | Monetary values, including unit.|\n|ENTITYQUANTITY       | Measurements, as of weight or distance.|\n|ENTITYORDINAL        | \"first\", \"second\", etc.|\n|ENTITYCARDINAL       | Numerals that do not fall under another type.|\n\n\n*German:*\n\n|Identifier|                                Meaning|\n|----|----|\n|ENTITYNOUN |                               Any noun phrase.|\n|ENTITYPER   |                              Named person or family.|\n|ENTITYLOC    |                             Name of politically or  geographically defined                                        location (cities, provinces, countries, international regions, bodies of water,                                          mountains).|\n|ENTITYORG     |                            Named corporate, governmental, or other                                          organizational entity.|\n|ENTITYMISC     |                           Miscellaneous entities, e.g. events, nationalities, products or works of art.|\n\nWe have added `ENTITYNOUN` to the genuine named-entity identifiers. As\nit matches any noun phrase, it behaves in a similar fashion to [generic pronouns](#generic-pronouns).\nThe differences are that `ENTITYNOUN` has to match a specific noun phrase within a document\nand that this specific noun phrase is extracted and available for further processing. `ENTITYNOUN` is not supported within the topic matching use case.\n\n\u003ca id=\"ontology-based-matching\"\u003e\u003c/a\u003e\n#### 2.4 Ontology-based matching (`word_match.type=='ontology'`)\n\nAn ontology enables the user to define relationships between words that\nare then taken into account when matching documents to search phrases.\nThe three relevant relationship types are *hyponyms* (something is a\nsubtype of something), *synonyms* (something means the same as\nsomething) and *named individuals* (something is a specific instance of\nsomething). The three relationship types are exemplified in Figure 1:\n\n![Figure 1](https://github.com/explosion/holmes-extractor/blob/master/docs/ontology_example.png)\n\nOntologies are defined to Holmes using the [OWL ontology\nstandard](https://www.w3.org/OWL/) serialized using\n[RDF/XML](https://www.w3.org/2001/sw/wiki/RDF). Such ontologies\ncan be generated with a variety of tools. For the Holmes [examples](#use-cases-and-examples) and\n[tests](#development-and-testing-guidelines), the free tool\n[Protege](https://protege.stanford.edu/) was used. It is recommended\nthat you use Protege both to define your own ontologies and to browse\nthe ontologies that ship with the examples and tests. When saving an\nontology under Protege, please select *RDF/XML* as the format. Protege\nassigns standard labels for the hyponym, synonym and named-individual relationships\nthat Holmes [understands as defaults](#ontology) but that can also be\noverridden.\n\nOntology entries are defined using an Internationalized Resource\nIdentifier (IRI),\ne.g. `http://www.semanticweb.org/hudsonr/ontologies/2019/0/animals#dog`.\nHolmes only uses the final fragment for matching, which allows homonyms\n(words with the same form but multiple meanings) to be defined at\nmultiple points in the ontology tree.\n\nOntology-based matching gives the best results with Holmes when small\nontologies are used that have been built for specific subject domains\nand use cases. For example, if you are implementing a chatbot for a\nbuilding insurance use case, you should create a small ontology capturing the\nterms and relationships within that specific domain. On the other hand,\nit is not recommended to use large ontologies built\nfor all domains within an entire language such as\n[WordNet](https://wordnet.princeton.edu/). This is because the many\nhomonyms and relationships that only apply in narrow subject\ndomains will tend to lead to a large number of incorrect matches. For\ngeneral use cases, [embedding-based matching](#embedding-based-matching) will tend to yield better results.\n\nEach word in an ontology can be regarded as heading a subtree consisting\nof its hyponyms, synonyms and named individuals, those words' hyponyms,\nsynonyms and named individuals, and so on. With an ontology set up in the standard fashion that\nis appropriate for the [chatbot](#chatbot) and [structural extraction](#structural-extraction) use cases,\na word in a Holmes search phrase matches a word in a document if the document word is within the\nsubtree of the search phrase word. Were the ontology in Figure 1 defined to Holmes, in addition to the\n[direct matching strategy](#direct-matching), which would match each word to itself, the\nfollowing combinations would match:\n\n-   *animal* in a search phrase would match *hound*, *dog*, *cat*,\n    *pussy*, *puppy*, *Fido*, *kitten* and *Mimi Momo* in documents;\n-   *hound* in a search phrase would match *dog*, *puppy* and *Fido* in\n    documents;\n-   *dog* in a search phrase would match *hound*, *puppy* and *Fido* in\n    documents;\n-   *cat* in a search phrase would match *pussy*, *kitten* and *Mimi\n    Momo* in documents;\n-   *pussy* in a search phrase would match *cat*, *kitten* and *Mimi\n    Momo* in documents.\n\nEnglish phrasal verbs like *eat up* and German separable verbs like *aufessen*  must be defined as single items within ontologies. When Holmes is analysing a text and\ncomes across such a verb, the main verb and the particle are conflated into a single\nlogical word that can then be matched via an ontology. This means that *eat up* within\na text would match the subtree of *eat up* within the ontology but not the subtree of\n*eat* within the ontology.\n\nIf [derivation-based matching](#derivation-based-matching) is active, it is taken into account\non both sides of a potential ontology-based match. For example, if *alter* and *amend* are\ndefined as synonyms in an ontology, *alteration* and *amendment* would also match each other.\n\nIn situations where finding relevant sentences is more important than\nensuring the logical correspondence of document matches to search phrases,\nit may make sense to specify **symmetric matching** when defining the ontology.\nSymmetric matching is recommended for the [topic matching](#topic-matching) use case, but\nis unlikely to be appropriate for the [chatbot](#chatbot) or [structural extraction](#structural-extraction) use cases.\nIt means that the hypernym (reverse hyponym) relationship is taken into account as well as the\nhyponym and synonym relationships when matching, thus leading to a more symmetric relationship\nbetween documents and search phrases. An important rule applied when matching via a symmetric ontology is that a match path may not contain both hypernym and hyponym relationships, i.e. you cannot go back on yourself. Were the\nontology above defined as symmetric, the following combinations would match:\n\n-   *animal* in a search phrase would match *hound*, *dog*, *cat*,\n    *pussy*, *puppy*, *Fido*, *kitten* and *Mimi Momo* in documents;\n-   *hound* in a search phrase would match *animal*, *dog*, *puppy* and *Fido* in\n    documents;\n-   *dog* in a search phrase would match *animal*, *hound*, *puppy* and *Fido* in\n    documents;\n-   *puppy* in a search phrase would match *animal*, *dog* and *hound* in documents;\n-   *Fido* in a search phrase would match *animal*, *dog* and *hound* in documents;    \n-   *cat* in a search phrase would match *animal*, *pussy*, *kitten* and *Mimi\n    Momo* in documents;\n-   *pussy* in a search phrase would match *animal*, *cat*, *kitten* and *Mimi\n    Momo* in documents.\n-   *kitten* in a search phrase would match *animal*, *cat* and *pussy* in documents;\n-   *Mimi Momo* in a search phrase would match *animal*, *cat* and *pussy* in documents.\n\nIn the [supervised document classification](#supervised-document-classification) use case,\ntwo separate ontologies can be used:\n\n- The **structural matching** ontology is used to analyse the content of both training\nand test documents. Each word from a document that is found in the ontology is replaced by its most general hypernym\nancestor. It is important to realise that an ontology is only likely to work with structural matching for\nsupervised document classification if it was built specifically for the purpose: such an ontology\nshould consist of a number of separate trees representing the main classes of object in the documents\nto be classified. In the example ontology shown above, all words in the ontology would be replaced with *animal*; in an extreme case with a WordNet-style ontology, all nouns would end up being replaced with\n *thing*, which is clearly not a desirable outcome!\n\n- The **classification** ontology is used to capture relationships between classification labels: that a document\nhas a certain classification implies it also has any classifications to whose subtree that classification belongs.\nSynonyms should be used sparingly if at all in classification ontologies because they add to the complexity of the\nneural network without adding any value; and although it is technically possible to set up a classification\nontology to use symmetric matching, there is no sensible reason for doing so. Note that a label within the\nclassification ontology that is not directly defined as the label of any training document\n[has to be registered specifically](#supervised-topic-training-basis) using the\n`SupervisedTopicTrainingBasis.register_additional_classification_label()` method if it is to be taken into\naccount when training the classifier.\n\n\u003ca id=\"embedding-based-matching\"\u003e\u003c/a\u003e\n#### 2.5 Embedding-based matching (`word_match.type=='embedding'`)\n\nspaCy offers **word embeddings**:\nmachine-learning-generated numerical vector representations of words\nthat capture the contexts in which each word\ntends to occur. Two words with similar meaning tend to emerge with word\nembeddings that are close to each other, and spaCy can measure the\n**cosine similarity** between any two words' embeddings expressed as a decimal\nbetween 0.0 (no similarity) and 1.0 (the same word). Because *dog* and\n*cat* tend to appear in similar contexts, they have a similarity of\n0.80; *dog* and *horse* have less in common and have a similarity of\n0.62; and *dog* and *iron* have a similarity of only 0.25. Embedding-based matching\nis only activated for nouns, adjectives and adverbs because the results have been found to be\nunsatisfactory with other word classes.\n\nIt is important to understand that the fact that two words have similar\nembeddings does not imply the same sort of logical relationship between\nthe two as when [ontology-based matching](#ontology-based-matching) is used: for example, the\nfact that *dog* and *cat* have similar embeddings means neither that a\ndog is a type of cat nor that a cat is a type of dog. Whether or not\nembedding-based matching is nonetheless an appropriate choice depends on\nthe functional use case.\n\nFor the [chatbot](#chatbot), [structural extraction](#structural-extraction) and [supervised document classification](#supervised-document-classification) use cases, Holmes makes use of word-\nembedding-based similarities using a `overall_similarity_threshold` parameter defined globally on\nthe [Manager](#manager) class. A match is detected between a\nsearch phrase and a structure within a document whenever the geometric\nmean of the similarities between the individual corresponding word pairs\nis greater than this threshold. The intuition behind this technique is\nthat where a search phrase with e.g. six lexical words has matched a\ndocument structure where five of these words match exactly and only one\ncorresponds via an embedding, the similarity that should be required to match this sixth word is\nless than when only three of the words matched exactly and two of the other words also correspond\nvia embeddings.\n\nMatching a search phrase to a document begins by finding words\nin the document that match the word at the root (syntactic head) of the\nsearch phrase. Holmes then investigates the structure around each of\nthese matched document words to check whether the document structure matches\nthe search phrase structure in its entirity.\nThe document words that match the search phrase root word are normally found\nusing an index. However, if embeddings have to be taken into account when\nfinding document words that match a search phrase root word, **every** word in\n**every** document with a valid word class has to be compared for similarity to that\nsearch phrase root word. This has a very noticeable performance hit that renders all use cases\nexcept the [chatbot](#chatbot) use case essentially unusable.\n\nTo avoid the typically unnecessary performance hit that results from embedding-based matching\nof search phrase root words, it is controlled separately from embedding-based matching in general\nusing the `embedding_based_matching_on_root_words` parameter, which is set when instantiating the\n[Manager](#manager) class. You are advised to keep this setting switched off (value `False`) for most use cases.\n\nNeither the `overall_similarity_threshold` nor the `embedding_based_matching_on_root_words` parameter has any effect on the [topic matching](#topic-matching) use case. Here word-level embedding similarity thresholds are set using the `word_embedding_match_threshold` and  `initial_question_word_embedding_match_threshold` parameters when calling the [`topic_match_documents_against` function on the Manager class](#manager-topic-match-function).\n\n\u003ca id=\"named-entity-embedding-based-matching\"\u003e\u003c/a\u003e\n#### 2.6 Named-entity-embedding-based matching (`word_match.type=='entity_embedding'`)\n\nA named-entity-embedding based match obtains between a searched-document word that has a certain entity label and a search phrase or query document word whose embedding is sufficiently similar to the underlying meaning of that entity label, e.g. the word *individual* in a search phrase has a similar word embedding to the underlying meaning of the *PERSON* entity label. Note that named-entity-embedding-based matching is never active on root words regardless of the `embedding_based_matching_on_root_words` setting.\n\n\u003ca id=\"initial-question-word-matching\"\u003e\u003c/a\u003e\n#### 2.7 Initial-question-word matching (`word_match.type=='question'`)\n\nInitial-question-word matching is only active during [topic matching](#topic-matching). Initial question words in query phrases match entities in the searched documents that represent potential answers to the question, e.g. when comparing the query phrase *When did Peter have breakfast* to the searched-document phrase *Peter had breakfast at 8 a.m.*, the question word *When* would match the temporal adverbial phrase *at 8 a.m.*.\n\nInitial-question-word matching is switched on and off using the `initial_question_word_behaviour` parameter when calling the [`topic_match_documents_against` function on the Manager class](#manager-topic-match-function). It is only likely to be useful when topic matching is being performed in an interactive setting where the user enters short query phrases, as opposed to when it is being used to find documents on a similar topic to an pre-existing query document: initial question words are only processed at the beginning of the first sentence of the query phrase or query document.\n\nLinguistically speaking, if a query phrase consists of a complex question with several elements dependent on the main verb, a finding in a searched document is only an 'answer' if contains matches to all these elements. Because recall is typically more important than precision when performing topic matching with interactive query phrases, however, Holmes will match an initial question word to a searched-document phrase wherever they correspond semantically (e.g. wherever *when* corresponds to a temporal adverbial phrase) and each depend on verbs that themselves match at the word level. One possible strategy to filter out 'incomplete answers' would be to calculate the maximum possible score for a query phrase and reject topic matches that score below a threshold scaled to this maximum.\n\n\u003ca id=\"coreference-resolution\"\u003e\u003c/a\u003e\n### 3. Coreference resolution\n\nBefore Holmes analyses a searched document or query document, coreference resolution is performed using the [Coreferee](https://github.com/explosion/coreferee)\nlibrary running on top of spaCy.  This means that situations are recognised where pronouns and nouns that are located near one another within a text refer to the same entities. The information from one mention can then be applied to the analysis of further mentions:\n\nI saw a *big dog*. *It* was chasing a cat.   \nI saw a *big dog*. *The dog* was chasing a cat.\n\nCoreferee also detects situations where a noun refers back to a named entity:\n\nWe discussed *AstraZeneca*. *The company* had given us permission to publish this library under the MIT license.\n\nIf this example were to match the search phrase ***A company gives permission to publish something***, the\ncoreference information that the company under discussion is AstraZeneca is clearly\nrelevant and worth extracting in addition to the word(s) directly matched to the search\nphrase. Such information is captured in the [word_match.extracted_word](#dictionary) field.\n\n\u003ca id=\"writing-effective-search-phrases\"\u003e\u003c/a\u003e\n### 4. Writing effective search phrases\n\n\u003ca id=\"general-comments\"\u003e\u003c/a\u003e\n#### 4.1 General comments\n\nThe concept of search phrases has [already been introduced](#getting-started) and is relevant to the\nchatbot use case, the structural extraction use case and to [preselection](#preselection) within the supervised\ndocument classification use case.\n\n**It is crucial to understand that the tips and limitations set out in Section 4 do not apply in any way to query phrases in topic matching. If you are using Holmes for topic matching only, you can completely ignore this section!**\n\nStructural matching between search phrases and documents is not symmetric: there\nare many situations in which sentence X as a search phrase would match\nsentence Y within a document but where the converse would not be true.\nAlthough Holmes does its best to understand any search phrases, the\nresults are better when the user writing them follows certain patterns\nand tendencies, and getting to grips with these patterns and tendencies is\nthe key to using the relevant features of Holmes successfully.\n\n\u003ca id=\"lexical-versus-grammatical-words\"\u003e\u003c/a\u003e\n##### 4.1.1 Lexical versus grammatical words\n\nHolmes distinguishes between: **lexical words** like *dog*, *chase* and\n*cat* (English) or *Hund*, *jagen* and *Katze* (German) in the [initial\nexample above](#getting-started); and **grammatical words** like *a* (English)\nor *ein* and *eine* (German) in the initial example above. Only lexical words match\nwords in documents, but grammatical words still play a crucial role within a\nsearch phrase: they enable Holmes to understand it.\n\n***Dog chase cat*** (English)  \n***Hund jagen Katze*** (German)\n\ncontain the same lexical words as the search phrases in the [initial\nexample above](#getting-started), but as they are not grammatical sentences Holmes is\nliable to misunderstand them if they are used as search phrases. This is a major difference\nbetween Holmes search phrases and the search phrases you use instinctively with\nstandard search engines like Google, and it can take some getting used to.\n\n\u003ca id=\"use-of-the-present-active\"\u003e\u003c/a\u003e\n##### 4.1.2 Use of the present active\n\nA search phrase need not contain a verb:\n\n***ENTITYPERSON*** (English)  \n***A big dog*** (English)  \n***Interest in fishing*** (English)  \n***ENTITYPER*** (German)  \n***Ein großer Hund*** (German)  \n***Interesse am Angeln*** (German)\n\nare all perfectly valid and potentially useful search phrases.\n\nWhere a verb is present, however, Holmes delivers the best results when the verb\nis in the **present active**, as *chases* and *jagt* are in the [initial\nexample above](#getting-started). This gives Holmes the best chance of understanding\nthe relationship correctly and of matching the\nwidest range of document structures that share the target meaning.\n\n\u003ca id=\"generic-pronouns\"\u003e\u003c/a\u003e\n##### 4.1.3 Generic pronouns\n\nSometimes you may only wish to extract the object of a verb. For\nexample, you might want to find sentences that are discussing a cat\nbeing chased regardless of who is doing the chasing. In order to avoid a\nsearch phrase containing a passive expression like\n\n***A cat is chased*** (English)  \n***Eine Katze wird gejagt*** (German)\n\nyou can use a **generic pronoun**. This is a word that Holmes treats\nlike a grammatical word in that it is not matched to documents; its sole\npurpose is to help the user form a grammatically optimal search phrase\nin the present active. Recognised generic pronouns are English\n*something*, *somebody* and *someone* and German *jemand* (and inflected forms of *jemand*) and *etwas*:\nHolmes treats them all as equivalent. Using generic pronouns,\nthe passive search phrases above could be re-expressed as\n\n***Somebody chases a cat*** (English)  \n***Jemand jagt eine Katze*** (German).\n\n\u003ca id=\"prepositions\"\u003e\u003c/a\u003e\n##### 4.1.4 Prepositions\n\nExperience shows that different **prepositions** are often used with the\nsame meaning in equivalent phrases and that this can prevent search\nphrases from matching where one would intuitively expect it. For\nexample, the search phrases\n\n***Somebody is at the market*** (English)  \n***Jemand ist auf dem Marktplatz*** (German)\n\nwould fail to match the document phrases\n\n*Richard was in the market* (English)  \n*Richard war am Marktplatz* (German)\n\nThe best way of solving this problem is to define the prepositions in\nquestion as synonyms in an [ontology](#ontology-based-matching).\n\n\u003ca id=\"structures-not-permitted-in-search-phrases\"\u003e\u003c/a\u003e\n#### 4.2 Structures not permitted in search phrases\n\nThe following types of structures are prohibited in search phrases and\nresult in Python user-defined errors:\n\n\u003ca id=\"multiple-clauses\"\u003e\u003c/a\u003e\n##### 4.2.1 Multiple clauses\n\n***A dog chases a cat. A cat chases a dog*** (English)  \n***Ein Hund jagt eine Katze. Eine Katze jagt einen Hund*** (German)\n\nEach clause must be separated out into its own search phrase and\nregistered individually.\n\n\u003ca id=\"negation\"\u003e\u003c/a\u003e\n##### 4.2.2 Negation\n\n***A dog does not chase a cat.*** (English)  \n***Ein Hund jagt keine Katze.*** (German)\n\nNegative expressions are recognised as such in documents and the generated\nmatches marked as negative; allowing search phrases themselves to be\nnegative would overcomplicate the library without offering any benefits.\n\n\u003ca id=\"conjunction\"\u003e\u003c/a\u003e\n##### 4.2.3 Conjunction\n\n***A dog and a lion chase a cat.*** (English)  \n***Ein Hund und ein Löwe jagen eine Katze.*** (German)\n\nWherever conjunction occurs in documents, Holmes distributes the\ninformation among multiple matches as explained [above](#getting-started). In the\nunlikely event that there should be a requirement to capture conjunction explicitly\nwhen matching, this could be achieved by using the\n[`Manager.match()` function](#manager-match-function) and looking for situations\nwhere the document token objects are shared by multiple match objects.\n\n\u003ca id=\"lack-of-lexical-words\"\u003e\u003c/a\u003e\n##### 4.2.4 Lack of lexical words\n\n***The*** (English)  \n***Der*** (German)\n\nA search phrase cannot be processed if it does not contain any words\nthat can be matched to documents.\n\n\u003ca id=\"coreferring-pronouns\"\u003e\u003c/a\u003e\n##### 4.2.5 Coreferring pronouns\n\n***A dog chases a cat and he chases a mouse*** (English)  \n***Ein Hund jagt eine Katze und er jagt eine Maus*** (German)\n\nPronouns that corefer with nouns elsewhere in the search phrase are not permitted as this\nwould overcomplicate the library without offering any benefits.\n\n\u003ca id=\"structures-strongly-discouraged-in-search-phrases\"\u003e\u003c/a\u003e\n#### 4.3 Structures strongly discouraged in search phrases\n\nThe following types of structures are strongly discouraged in search\nphrases:\n\n\u003ca id=\"ungrammatical-expressions\"\u003e\u003c/a\u003e\n##### 4.3.1 Ungrammatical expressions\n\n***Dog chase cat*** (English)  \n***Hund jagen Katze*** (German)\n\nAlthough these will sometimes work, the results will be better if search\nphrases are expressed grammatically.\n\n\u003ca id=\"complex-verb-tenses\"\u003e\u003c/a\u003e\n##### 4.3.2 Complex verb tenses\n\n***A cat is chased by a dog*** (English)  \n***A dog will have chased a cat*** (English)  \n***Eine Katze wird durch einen Hund gejagt*** (German)  \n***Ein Hund wird eine Katze gejagt haben*** (German)\n\nAlthough these will sometimes work, the results will be better if verbs in\nsearch phrases are expressed in the present active.\n\n\u003ca id=\"questions\"\u003e\u003c/a\u003e\n##### 4.3.3 Questions\n\n***Who chases the cat?*** (English)  \n***Wer jagt die Katze?*** (German)\n\nAlthough questions are supported as query phrases in the\n[topic matching](#topic-matching) use case, they are not appropriate as search phrases.\nQuestions should be re-phrased as statements, in this case\n\n***Something chases the cat*** (English)  \n***Etwas jagt die Katze*** (German).\n\n\u003ca id=\"compound-words\"\u003e\u003c/a\u003e\n##### 4.3.4 Compound words (relates to German only)\n\n***Informationsextraktion*** (German)  \n***Ein Stadtmittetreffen*** (German)\n\nThe internal structure of German compound words is analysed within searched documents as well as\nwithin query phrases in the [topic matching](#topic-matching) use case, but not within search\nphrases. In search phrases, compound words should be reexpressed as genitive constructions even in cases\nwhere this does not strictly capture their meaning:\n\n***Extraktion der Information*** (German)  \n***Ein Treffen der Stadtmitte*** (German)\n\n\u003ca id=\"structures-to-be-used-with-caution-in-search-phrases\"\u003e\u003c/a\u003e\n#### 4.4 Structures to be used with caution in search phrases\n\nThe following types of structures should be used with caution in search\nphrases:\n\n\u003ca id=\"very-complex-structures\"\u003e\u003c/a\u003e\n##### 4.4.1 Very complex structures\n\n***A fierce dog chases a scared cat on the way to the theatre***\n(English)  \n***Ein kämpferischer Hund jagt eine verängstigte Katze auf dem\nWeg ins Theater*** (German)\n\nHolmes can handle any level of complexity within search phrases, but the\nmore complex a structure, the less likely it becomes that a document\nsentence will match it. If it is really necessary to match such complex relationships\nwith search phrases rather than with [topic matching](#topic-matching), they are typically better extracted by splitting the search phrase up, e.g.\n\n***A fierce dog*** (English)  \n***A scared cat*** (English)  \n***A dog chases a cat*** (English)  \n***Something chases something on the way to the theatre*** (English)  \n\n***Ein kämpferischer Hund*** (German)  \n***Eine verängstigte Katze*** (German)   \n***Ein Hund jagt eine Katze*** (German)  \n***Etwas jagt etwas auf dem Weg ins Theater*** (German)\n\nCorrelations between the resulting matches can then be established by\nmatching via the [`Manager.match()` function](#manager-match-function) and looking for\nsituations where the document token objects are shared across multiple match objects.\n\nOne possible exception to this piece of advice is when\n[embedding-based matching](#embedding-based-matching) is active. Because\nwhether or not each word in a search phrase matches then depends on whether\nor not other words in the same search phrase have been matched, large, complex\nsearch phrases can sometimes yield results that a combination of smaller,\nsimpler search phrases would not.\n\n\u003ca id=\"deverbal-noun-phrases\"\u003e\u003c/a\u003e\n##### 4.4.2 Deverbal noun phrases\n\n***The chasing of a cat*** (English)  \n***Die Jagd einer Katze*** (German)\n\nThese will often work, but it is generally better practice\nto use verbal search phrases like\n\n***Something chases a cat*** (English)  \n***Etwas jagt eine Katze*** (German)\n\nand to allow the corresponding nominal phrases to be matched via [derivation-based matching](#derivation-based-matching).\n\n\u003ca id=\"use-cases-and-examples\"\u003e\u003c/a\u003e\n### 5. Use cases and examples\n\n\u003ca id=\"chatbot\"\u003e\u003c/a\u003e\n#### 5.1 Chatbot\n\nThe chatbot use case has [already been introduced](#getting-started):\na predefined set of search phrases is used to extract\ninformation from phrases entered interactively by an end user, which in\nthis use case act as the documents.\n\nThe Holmes source code ships with two examples demonstrating the chatbot\nuse case, one for each language, with predefined ontologies. Having\n[cloned the source code and installed the Holmes library](#installation),\nnavigate to the `/examples` directory and type the following (Linux):\n\n*English:*\n\n    python3 example_chatbot_EN_insurance.py\n\n*German:*\n\n    python3 example_chatbot_DE_insurance.py\n\nor click on the files in Windows Explorer (Windows).\n\nHolmes matches syntactically distinct structures that are semantically\nequivalent, i.e. that share the same meaning. In a real chatbot use\ncase, users will typically enter equivalent information with phrases that\nare semantically distinct as well, i.e. that have different meanings.\nBecause the effort involved in registering a search phrase is barely\ngreater than the time it takes to type it in, it makes sense to register\na large number of search phrases for each relationship you are trying to\nextract: essentially *all ways people have been observed to express the\ninformation you are interested in* or *all ways you can imagine somebody\nmight express the information you are interested in*. To assist this,\nsearch phrases can be registered with labels that do not need\nto be unique: a label can then be used to express the relationship\nan entire group of search phrases is designed to extract. Note that when many search\nphrases have been defined to extract the same relationship, a single user entry\nis likely to be sometimes matched by multiple search phrases. This must be handled\nappropriately by the calling application.\n\nOne obvious weakness of Holmes in the chatbot setting is its sensitivity\nto correct spelling and, to a lesser extent, to correct grammar.\nStrategies for mitigating this weakness include:\n\n-   Defining common misspellings as synonyms in the ontology\n-   Defining specific search phrases including common misspellings\n-   Putting user entry through a spellchecker before submitting it to\n    Holmes\n-   Explaining the importance of correct spelling and grammar to users\n\n\u003ca id=\"structural-extraction\"\u003e\u003c/a\u003e\n#### 5.2 Structural extraction\n\nThe structural extraction use case uses [structural matching](#how-it-works-structural-matching) in the same way as the [chatbot](#chatbot) use case,\nand many of the same comments and tips apply to it. The principal differences are that pre-existing and\noften lengthy documents are scanned rather than text snippets entered ad-hoc by the user, and that the\nreturned match objects are not used to\ndrive a dialog flow; they are examined solely to extract and store structured information.\n\nCode for performing structural extraction would typically perform the following tasks:\n\n-   Initialize the Holmes manager object.\n-   Call `Manager.register_search_phrase()` several times to define a number of search phrases specifying the information to be extracted.\n-   Call `Manager.parse_and_register_document()` several times to load a number of documents within which to search.\n-   Call `Manager.match()` to perform the matching.\n-   Query the returned match objects to obtain the extracted information and store it in a database.\n\n\u003ca id=\"topic-matching\"\u003e\u003c/a\u003e\n#### 5.3 Topic matching\n\nThe topic matching use case matches a **query document**, or alternatively a **query phrase**\nentered ad-hoc by the user, against a set of documents pre-loaded into memory. The aim is to find the passages\nin the documents whose topic most closely corresponds to the topic of the query document; the output is\na ordered list of passages scored according to topic similarity. Additionally, if a query phrase contains an [initial question word](#initial-question-word-matching), the output will contain potential answers to the question.\n\nTopic matching queries may contain [generic pronouns](#generic-pronouns) and\n[named-entity identifiers](#named-entity-matching) just like search phrases, although the `ENTITYNOUN`\ntoken is not supported. However, an important difference from\nsearch phrases is that the topic matching use case places no\nrestrictions on the grammatical structures permissible within the query document.\n\nIn addition to the [Holmes demonstration website](https://holmes-demo.explosion.services/), the Holmes source code ships with [three examples](https://github.com/explosion/holmes-extractor/blob/master/examples/) demonstrating the topic matching use case with an English literature\ncorpus, a German literature corpus and a German legal corpus respectively. Users are encouraged to run these\nto get a feel for how they work.\n\nTopic matching uses a variety of strategies to find text passages that are relevant to the query. These include\nresource-hungry procedures like investigating semantic relationships and comparing embeddings. Because applying these\nacross the board would prevent topic matching from scaling, Holmes only attempts them for specific areas of the text\nthat less resource-intensive strategies have already marked as looking promising. This and the other interior workings\nof topic matching are explained [here](#how-it-works-topic-matching).\n\n\u003ca id=\"supervised-document-classification\"\u003e\u003c/a\u003e\n#### 5.4 Supervised document classification\n\nIn the supervised document classification use case, a classifier is trained with a number of documents that\nare each pre-labelled with a classification. The trained classifier then assigns one or more labels to new documents\naccording to what each new document is about. As explained [here](#ontology-based-matching), ontologies can be\nused both to enrichen the comparison of the content of the various documents and to capture implication\nrelationships between classification labels.\n\nA classifier makes use of a neural network (a [multilayer perceptron](https://machinelearningcatalogue.com/algorithm/alg_perceptron.html)) whose topology can either\nbe determined automatically by Holmes or [specified explicitly by the user](#supervised-topic-training-basis).\nWith a large number of training documents, the automatically determined topology can easily exhaust the memory\navailable on a typical machine; if there is no opportunity to scale up the memory, this problem can be\nremedied by specifying a smaller number of hidden layers or a smaller number of nodes in one or more of the layers.\n\nA trained document classification model retains no references to its training data. This is an advantage\nfrom a data protection viewpoint, although it\n[cannot presently be guaranteed](#remove-names-from-supervised-document-classification-models) that models will\nnot contain individual personal or company names.\n\n\u003ca id=\"preselection\"\u003e\u003c/a\u003e\nA typical problem with the execution of many document classification use cases is that a new classification label\nis added when the system is already live but that there are initially no examples of this new classification with\nwhich to train a new model. The best course of action in such a situation is to define search phrases which\n**preselect** the more obvious documents with the new classification using structural matching. Those documents that\nare not preselected as having the new classification label are then passed to the existing, previously trained\nclassifier in the normal way. When enough documents exemplifying the new classification have accumulated in the system,\nthe model can be retrained and the preselection search phrases removed.\n\nHolmes ships with an example [script](https://github.com/explosion/holmes-extractor/blob/master/examples/example_supervised_topic_model_EN.py) demonstrating supervised document classification for English with the\n[BBC Documents dataset](http://mlg.ucd.ie/datasets/bbc.html). The script downloads the documents (for\nthis operation and for this operation alone, you will need to be online) and places them in a working directory.\nWhen training is complete, the script saves the model to the working directory. If the model file is found\nin the working directory on subsequent invocations of the script, the training phase is skipped and the script\ngoes straight to the testing phase. This means that if it is wished to repeat the training phase, either the model\nhas to be deleted from the working directory or a new working directory has to be specified to the script.\n\nHaving [cloned the source code and installed the Holmes library](#installation),\nnavigate to the `/examples` directory. Specify a working directory at the top of the\n`example_supervised_topic_model_EN.py` file, then type `python3 example_supervised_topic_model_EN` (Linux)\nor click on the script in Windows Explorer (Windows).\n\nIt is important to realise that Holmes learns to classify documents according to the words or semantic\nrelationships they contain, taking any structural matching ontology into account in the process. For many\nclassification tasks, this is exactly what is required; but there are tasks (e.g. author attribution according\nto the frequency of grammatical constructions typical for each author) where it is not. For the right task,\nHolmes achieves impressive results. For the BBC Documents benchmark\nprocessed by the example script, Holmes performs slightly better than benchmarks available online\n(see e.g. [here](https://github.com/suraj-deshmukh/BBC-Dataset-News-Classification))\nalthough the difference is probably too slight to be significant, especially given that the different\ntraining/test splits were used in each case: Holmes has been observed to learn models that predict the\ncorrect result between 96.9% and 98.7% of the time. The range is explained by the fact that the behaviour\nof the neural network is not fully deterministic.\n\nThe interior workings of supervised document classification are explained [here](#how-it-works-supervised-document-classification).\n\n\u003ca id=\"interfaces-intended-for-public-use\"\u003e\u003c/a\u003e\n### 6 Interfaces intended for public use\n\n\u003ca id=\"manager\"\u003e\u003c/a\u003e\n#### 6.1 `Manager`\n\n``` {.python}\nholmes_extractor.Manager(self, model, *, overall_similarity_threshold=1.0,\n  embedding_based_matching_on_root_words=False, ontology=None,\n  analyze_derivational_morphology=True, perform_coreference_resolution=None,\n  number_of_workers=None, verbose=False)\n\nThe facade class for the Holmes library.\n\nParameters:\n\nmodel -- the name of the spaCy model, e.g. *en_core_web_trf*\noverall_similarity_threshold -- the overall similarity threshold for embedding-based\n  matching. Defaults to *1.0*, which deactivates embedding-based matching. Note that this\n  parameter is not relevant for topic matching, where the thresholds for embedding-based\n  matching are set on the call to *topic_match_documents_against*.\nembedding_based_matching_on_root_words -- determines whether or not embedding-based\n  matching should be attempted on search-phrase root tokens, which has a considerable\n  performance hit. Defaults to *False*. Note that this parameter is not relevant for topic\n  matching.\nontology -- an *Ontology* object. Defaults to *None* (no ontology).\nanalyze_derivational_morphology -- *True* if matching should be attempted between different\n  words from the same word family. Defaults to *True*.\nperform_coreference_resolution -- *True* if coreference resolution should be taken into account\n  when matching. Defaults to *True*.\nuse_reverse_dependency_matching -- *True* if appropriate dependencies in documents can be\n  matched to dependencies in search phrases where the two dependencies point in opposite\n  directions. Defaults to *True*.\nnumber_of_workers -- the number of worker processes to use, or *None* if the number of worker\n  processes should depend on the number of available cores. Defaults to *None*\nverbose -- a boolean value specifying whether multiprocessing messages should be outputted to\n  the console. Defaults to *False*\n```\n\n``` {.python}\nManager.register_serialized_document(self, serialized_document:bytes, label:str=\"\") -\u003e None\n\nParameters:\n\ndocument -- a preparsed Holmes document.\nlabel -- a label for the document which must be unique. Defaults to the empty string,\n    which is intended for use cases involving single documents (typically user entries).\n```\n\n\u003ca id=\"manager-register-serialized-documents-function\"\u003e\u003c/a\u003e\n``` {.python}\nManager.register_serialized_documents(self, document_dictionary:dict[str, bytes]) -\u003e None\n\nNote that this function is the most efficient way of loading documents.\n\nParameters:\n\ndocument_dictionary -- a dictionary from labels to serialized documents.\n```\n\n``` {.python}\nManager.parse_and_register_document(self, document_text:str, label:str='') -\u003e None\n\nParameters:\n\ndocument_text -- the raw document text.\nlabel -- a label for the document which must be unique. Defaults to the empty string,\n    which is intended for use cases involving single documents (typically user entries).\n```\n\n``` {.python}\nManager.remove_document(self, label:str) -\u003e None\n```\n\n``` {.python}\nManager.remove_all_documents(self, labels_starting:str=None) -\u003e None\n\nParameters:\n\nlabels_starting -- a string starting the labels of documents to be removed,\n    or 'None' if all documents are to be removed.\n```\n\n``` {.python}\nManager.list_document_labels(self) -\u003e List[str]\n\nReturns a list of the labels of the currently registered documents.\n```\n\n``` {.python}\nManager.serialize_document(self, label:str) -\u003e Optional[bytes]\n\nReturns a serialized representation of a Holmes document that can be\n  persisted to a file. If 'label' is not the label of a registered document,\n  'None' is returned instead.\n\nParameters:\n\nlabel -- the label of the document to be serialized.\n```\n\n``` {.python}\nManager.get_document(self, label:str='') -\u003e Optional[Doc]\n\nReturns a Holmes document. If *label* is not the label of a registered document, *None*\n  is returned instead.\n\nParameters:\n\nlabel -- the label of the document to be serialized.\n```\n\n``` {.python}\nManager.debug_document(self, label:str='') -\u003e None\n\nOutputs a debug representation for a loaded document.\n\nParameters:\n\nlabel -- the label of the document to be serialized.\n```\n\n``` {.python}\nManager.register_search_phrase(self, search_phrase_text:str, label:str=None) -\u003e SearchPhrase\n\nRegisters and returns a new search phrase.\n\nParameters:\n\nsearch_phrase_text -- the raw search phrase text.  \nlabel -- a label for the search phrase, which need not be unique.\n  If label==None, the assigned label defaults to the raw search phrase text.\n```\n\n``` {.python}\nManager.remove_all_search_phrases_with_label(self, label:str) -\u003e None\n```\n\n```\nManager.remove_all_search_phrases(self) -\u003e None\n```\n\n```\nManager.list_search_phrase_labels(self) -\u003e List[str]\n```\n\n\u003ca id=\"manager-match-function\"\u003e\u003c/a\u003e\n``` {.python}\nManager.match(self, search_phrase_text:str=None, document_text:str=None) -\u003e List[Dict]\n\nMatches search phrases to documents and returns the result as match dictionaries.\n\nParameters:\n\nsearch_phrase_text -- a text from which to generate a search phrase, or 'None' if the\n    preloaded search phrases should be used for matching.\ndocument_text -- a text from which to generate a document, or 'None' if the preloaded\n    documents should be used for matching.\n```\n\n\u003ca id=\"manager-topic-match-function\"\u003e\u003c/a\u003e\n``` {.python}\ntopic_match_documents_against(self, text_to_match:str, *,\n    use_frequency_factor:bool=True,\n    maximum_activation_distance:int=75,\n    word_embedding_match_threshold:float=0.8,\n    initial_question_word_embedding_match_threshold:float=0.7,\n    relation_score:int=300,\n    reverse_only_relation_score:int=200,\n    single_word_score:int=50,\n    single_word_any_tag_score:int=20,\n    initial_question_word_answer_score:int=600,\n    initial_question_word_behaviour:str='process',\n    different_match_cutoff_score:int=15,\n    overlapping_relation_multiplier:float=1.5,\n    embedding_penalty:float=0.6,\n    ontology_penalty:float=0.9,\n    relation_matching_frequency_threshold:float=0.25,\n    embedding_matching_frequency_threshold:float=0.5,\n    sideways_match_extent:int=100,\n    only_one_result_per_document:bool=False,\n    number_of_results:int=10,\n    document_label_filter:str=None,\n    tied_result_quotient:float=0.9) -\u003e List[Dict]:\n\nReturns a list of dictionaries representing the results of a topic match between an entered text\nand the loaded documents.\n\nProperties:\n\ntext_to_match -- the text to match against the loaded documents.\nuse_frequency_factor -- *True* if scores should be multiplied by a factor between 0 and 1\n  expressing how rare the words matching each phraselet are in the corpus. Note that,\n  even if this parameter is set to *False*, the factors are still calculated as they are \n  required for determining which relation and embedding matches should be attempted.\nmaximum_activation_distance -- the number of words it takes for a previous phraselet\n  activation to reduce to zero when the library is reading through a document.\nword_embedding_match_threshold -- the cosine similarity above which two words match where\n  the search phrase word does not govern an interrogative pronoun.\ninitial_question_word_embedding_match_threshold -- the cosine similarity above which two\n  words match where the search phrase word governs an interrogative pronoun.\nrelation_score -- the activation score added when a normal two-word relation is matched.\nreverse_only_relation_score -- the activation score added when a two-word relation\n  is matched using a search phrase that can only be reverse-matched.\nsingle_word_score -- the activation score added when a single noun is matched.\nsingle_word_any_tag_score -- the activation score added when a single word is matched\n  that is not a noun.\ninitial_question_word_answer_score -- the activation score added when a question word is\n  matched to an potential answer phrase.\ninitial_question_word_behaviour -- 'process' if a question word in the sentence\n  constituent at the beginning of *text_to_match* is to be matched to document phrases\n  that answer it and to matching question words; 'exclusive' if only topic matches that \n  answer questions are to be permitted; 'ignore' if question words are to be ignored.\ndifferent_match_cutoff_score -- the activation threshold under which topic matches are\n  separated from one another. Note that the default value will probably be too low if\n  *use_frequency_factor* is set to *False*.\noverlapping_relation_multiplier -- the value by which the activation score is multiplied\n  when two relations were matched and the matches involved a common document word.\nembedding_penalty -- a value between 0 and 1 with which scores are multiplied when the\n  match involved an embedding. The result is additionally multiplied by the overall\n  similarity measure of the match.\nontology_penalty -- a value between 0 and 1 with which scores are multiplied for each\n  word match within a match that involved the ontology. For each such word match,\n  the score is multiplied by the value (abs(depth) + 1) times, so that the penalty is\n  higher for hyponyms and hypernyms than for synonyms and increases with the\n  depth distance.\nrelation_matching_frequency_threshold -- the frequency threshold above which single\n  word matches are used as the basis for attempting relation matches.\nembedding_matching_frequency_threshold -- the frequency threshold above which single\n  word matches are used as the basis for attempting relation matches with\n  embedding-based matching on the second word.\nsideways_match_extent -- the maximum number of words that may be incorporated into a\n  topic match either side of the word where the activation peaked.\nonly_one_result_per_document -- if 'True', prevents multiple results from being returned\n  for the same document.\nnumber_of_results -- the number of topic match objects to return.\ndocument_label_filter -- optionally, a string with which document labels must start to\n  be considered for inclusion in the results.\ntied_result_quotient -- the quotient between a result and following results above which\n  the results are interpreted as tied.\n```\n\n``` {.python}\nManager.get_supervised_topic_training_basis(self, *, classification_ontology:Ontology=None,\n  overlap_memory_size:int=10, oneshot:bool=True, match_all_words:bool=False,\n  verbose:bool=True) -\u003e SupervisedTopicTrainingBasis:\n\nReturns an object that is used to train and generate a model for the\nsupervised document classification use case.\n\nParameters:\n\nclassification_ontology -- an Ontology object incorporating relationships between\n    classification labels, or 'None' if no such ontology is to be used.\noverlap_memory_size -- how many non-word phraselet matches to the left should be\n    checked for words in common with a current match.\noneshot -- whether the same word or relationship matched multiple times within a\n    single document should be counted once only (value 'True') or multiple times\n    (value 'False')\nmatch_all_words -- whether all single words should be taken into account\n          (value 'True') or only single words with noun tags (value 'False')          \nverbose -- if 'True', information about training progress is outputted to the console.\n```\n\n``` {.python}\nManager.deserialize_supervised_topic_classifier(self,\n  serialized_model:bytes, verbose:bool=False) -\u003e SupervisedTopicClassifier:\n\nReturns a classifier for the supervised document classification use case\nthat will use a supplied pre-trained model.\n\nParameters:\n\nserialized_model -- the pre-trained model as returned from `SupervisedTopicClassifier.serialize_model()`.\nverbose -- if 'True', information about matching is outputted to the console.\n```\n\n``` {.python}\nManager.start_chatbot_mode_console(self)\n\nStarts a chatbot mode console enabling the matching of pre-registered\n  search phrases to documents (chatbot entries) entered ad-hoc by the\n  user.\n```\n\n``` {.python}\nManager.start_structural_search_mode_console(self)\n\nStarts a structural extraction mode console enabling the matching of pre-registered\n  documents to search phrases entered ad-hoc by the user.\n```\n\n``` {.python}\nManager.start_topic_matching_search_mode_console(self,    \n  only_one_result_per_document:bool=False, word_embedding_match_threshold:float=0.8,\n  initial_question_word_embedding_match_threshold:float=0.7):\n\nStarts a topic matching search mode console enabling the matching of pre-registered\n  documents to query phrases entered ad-hoc by the user.\n\nParameters:\n\nonly_one_result_per_document -- if 'True', prevents multiple topic match\n  results from being returned for the same document.\nword_embedding_match_threshold -- the cosine similarity above which two words match where the  \n  search phrase word does not govern an interrogative pronoun.\ninitial_question_word_embedding_match_threshold -- the cosine similarity above which two\n  words match where the search phrase word governs an interrogative pronoun.\n```\n\n``` {.python}\nManager.close(self) -\u003e None\n\nTerminates the worker processes.\n```\n\n\u003ca id=\"manager.nlp\"\u003e\u003c/a\u003e\n#### 6.2 `manager.nlp`\n\n`manager.nlp` is the underlying spaCy [Language](https://spacy.io/api/language/) object on which both Coreferee and Holmes have been registered as custom pipeline components. The most efficient way of parsing documents for use with Holmes is to call [`manager.nlp.pipe()`](https://spacy.io/api/language/#pipe). This yields an iterable of documents that can then be loaded into Holmes via [`manager.register_serialized_documents()`](#manager-register-serialized-documents-function).\n\nThe [`pipe()` method](https://spacy.io/api/language#pipe) has an argument `n_process` that specifies the number of processors to use. With `_lg`, `_md` and `_sm` spaCy models, there are [some situations](https://github.com/explosion/spaCy/discussions/8402#multiprocessing) where it can make sense to specify a value other than 1 (the default). Note however that with transformer spaCy models (`_trf`) values other than 1 are not supported.\n\n\u003ca id=\"ontology\"\u003e\u003c/a\u003e\n#### 6.3 `Ontology`\n\n``` {.python}\nholmes_extractor.Ontology(self, ontology_path,\n  owl_class_type='http://www.w3.org/2002/07/owl#Class',\n  owl_individual_type='http://www.w3.org/2002/07/owl#NamedIndividual',\n  owl_type_link='http://www.w3.org/1999/02/22-rdf-syntax-ns#type',\n  owl_synonym_type='http://www.w3.org/2002/07/owl#equivalentClass',\n  owl_hyponym_type='http://www.w3.org/2000/01/rdf-schema#subClassOf',\n  symmetric_matching=False)\n\nLoads information from an existing ontology and manages ontology\nmatching.\n\nThe ontology must follow the W3C OWL 2 standard. Search phrase words are\nmatched to hyponyms, synonyms and instances from within documents being\nsearched.\n\nThis class is designed for small ontologies that have been constructed\nby hand for specific use cases. Where the aim is to model a large number\nof semantic relationships, word embeddings are likely to offer\nbetter results.\n\nHolmes is not designed to support changes to a loaded ontology via direct\ncalls to the methods of this class. It is also not permitted to share a single instance\nof this class between multiple Manager instances: instead, a separate Ontology instance\npointing to the same path should be created for each Manager.\n\nMatching is case-insensitive.\n\nParameters:\n\nontology_path -- the path from where the ontology is to be loaded,\nor a list of several such paths. See https://github.com/RDFLib/rdflib/.  \nowl_class_type -- optionally overrides the OWL 2 URL for types.  \nowl_individual_type -- optionally overrides the OWL 2 URL for individuals.  \nowl_type_link -- optionally overrides the RDF URL for types.  \nowl_synonym_type -- optionally overrides the OWL 2 URL for synonyms.  \nowl_hyponym_type -- optionally overrides the RDF URL for hyponyms.\nsymmetric_matching -- if 'True', means hypernym relationships are also taken into account.\n```\n\n\u003ca id=\"supervised-topic-training-basis\"\u003e\u003c/a\u003e\n#### 6.4 `SupervisedTopicTrainingBasis` (returned from `Manager.get_supervised_topic_training_basis()`)\n\nHolder object for training documents and their classifications from which one or more\n[SupervisedTopicModelTrainer](#supervised-topic-model-trainer) objects can be derived. This class is NOT threadsafe.\n\n``` {.python}\nSupervisedTopicTrainingBasis.parse_and_register_training_document(self, text:str, classification:str,\n  label:Optional[str]=None) -\u003e None\n\nParses and registers a document to use for training.\n\nParameters:\n\ntext -- the document text\nclassification -- the classification label\nlabel -- a label with which to identify the document in verbose training output,\n  or 'None' if a random label should be assigned.\n```\n\n``` {.python}\nSupervisedTopicTrainingBasis.register_training_document(self, doc:Doc, classification:str, \n  label:Optional[str]=None) -\u003e None\n\nRegisters a pre-parsed document to use for training.\n\nParameters:\n\ndoc -- the document\nclassification -- the classification label\nlabel -- a label with which to identify the document in verbose training output,\n  or 'None' if a random label should be assigned.\n```\n\n``` {.python}\nSupervisedTopicTrainingBasis.register_additional_classification_label(self, label:str) -\u003e None\n\nRegister an additional classification label which no training document possesses explicitly\n  but that should be assigned to documents whose explicit labels are related to the\n  additional classification label via the classification ontology.\n```\n\n``` {.python}\nSupervisedTopicTrainingBasis.prepare(self) -\u003e None\n\nMatches the phraselets derived from the training documents against the training\n  documents to generate frequencies that also include combined labels, and examines the\n  explicit classification labels, the additional classification labels and the\n  classification ontology to derive classification implications.\n\n  Once this method has been called, the instance no longer accepts new training documents\n  or additional classification labels.\n```\n\n\u003ca id=\"supervised-topic-training-basis-train\"\u003e\u003c/a\u003e\n``` {.python}\nSupervisedTopicTrainingBasis.train(\n        self,\n        *,\n        minimum_occurrences: int = 4,\n        cv_threshold: float = 1.0,\n        learning_rate: float = 0.001,\n        batch_size: int = 5,\n        max_epochs: int = 200,\n        convergence_threshold: float = 0.0001,\n        hidden_layer_sizes: Optional[List[int]] = None,\n        shuffle: bool = True,\n        normalize: bool = True\n    ) -\u003e SupervisedTopicModelTrainer:\n\nTrains a model based on the prepared state.\n\nParameters:\n\nminimum_occurrences -- the minimum number of times a word or relationship has to\n  occur in the context of the same classification for the phraselet\n  to be accepted into the final model.\ncv_threshold -- the minimum coefficient of variation with which a word or relationship has\n  to occur across the explicit classification labels for the phraselet to be\n  accepted into the final model.\nlearning_rate -- the learning rate for the Adam optimizer.\nbatch_size -- the number of documents in each training batch.\nmax_epochs -- the maximum number of training epochs.\nconvergence_threshold -- the threshold below which loss measurements after consecutive\n  epochs are regarded as equivalent. Training stops before 'max_epochs' is reached\n  if equivalent results are achieved after four consecutive epochs.\nhidden_layer_sizes -- a list containing the number of neurons in each hidden layer, or\n  'None' if the topology should be determined automatically.\nshuffle -- 'True' if documents should be shuffled during batching.\nnormalize -- 'True' if normalization should be applied to the loss function.\n```\n\n\u003ca id=\"supervised-topic-model-trainer\"\u003e\u003c/a\u003e\n#### 6.5 `SupervisedTopicModelTrainer` (returned from `SupervisedTopicTrainingBasis.train()`)\n\nWorker object used to train and generate models. This object could be removed from the public interface\n(`SupervisedTopicTrainingBasis.train()` could return a `SupervisedTopicClassifier` directly) but has\nbeen retained to facilitate testability.\n\nThis class is NOT threadsafe.\n\n``` {.python}\nSupervisedTopicModelTrainer.classifier(self)\n\nReturns a supervised topic classifier which contains no explicit references to the training data and that\ncan be serialized.\n```\n\n\u003ca id=\"supervised-topic-classifier\"\u003e\u003c/a\u003e\n#### 6.6 `SupervisedTopicClassifier` (returned from\n`SupervisedTopicModelTrainer.classifier()` and\n`Manager.deserialize_supervised_topic_classifier()`))\n\n``` {.python}\nSupervisedTopicClassifier.def parse_and_classify(self, text: str) -\u003e Optional[OrderedDict]:\n\nReturns a dictionary from classification labels to probabilities\n  ordered starting with the most probable, or *None* if the text did\n  not contain any words recognised by the model.\n\nParameters:\n\ntext -- the text to parse and classify.\n```\n\n``` {.python}\nSupervisedTopicClassifier.classify(self, doc: Doc) -\u003e Optional[OrderedDict]:\n\nReturns a dictionary from classification labels to probabilities\n  ordered starting with the most probable, or *None* if the text did\n  not contain any words recognised by the model.\n\n\nParameters:\n\ndoc -- the pre-parsed document to classify.\n```\n\n``` {.python}\nSupervisedTopicClassifier.serialize_model(self) -\u003e str\n\nReturns a serialized model that can be reloaded using\n  *Manager.deserialize_supervised_topic_classifier()*\n```\n\n\u003ca id=\"dictionary\"\u003e\u003c/a\u003e\n#### 6.7 Dictionary returned from `Manager.match()`\n\n``` {.python}\nA text-only representation of a match between a search phrase and a\ndocument. The indexes refer to tokens.\n\nProperties:\n\nsearch_phrase_label -- the label of the search phrase.\nsearch_phrase_text -- the text of the search phrase.\ndocument -- the label of the document.\nindex_within_document -- the index of the match within the document.\nsentences_within_document -- the raw text of the sentences within the document that matched.\nnegated -- 'True' if this match is negated.\nuncertain -- 'True' if this match is uncertain.\ninvolves_coreference -- 'True' if this match was found using coreference resolution.\noverall_similarity_measure -- the overall similarity of the match, or\n  '1.0' if embedding-based matching was not involved in the match.  \nword_matches -- an array of dictionaries with the properties:\n\n  search_phrase_token_index -- the index of the token that matched from the search phrase.\n  search_phrase_word -- the string that matched from the search phrase.\n  document_token_index -- the index of the token that matched within the document.\n  first_document_token_index -- the index of the first token that matched within the document.\n    Identical to 'document_token_index' except where the match involves a multiword phrase.\n  last_document_token_index -- the index of the last token that matched within the document\n    (NOT one more than that index). Identical to 'document_token_index' except where the match\n    involves a multiword phrase.\n  structurally_matched_document_token_index -- the index of the token within the document that\n    structurally matched the search phrase token. Is either the same as 'document_token_index' or\n    is linked to 'document_token_index' within a coreference chain.\n  document_subword_index -- the index of the token subword that matched within the document, or\n    'None' if matching was not with a subword but with an entire token.\n  document_subword_containing_token_index -- the index of the document token that contained the\n    subword that matched, which may be different from 'document_token_index' in situations where a\n    word containing multiple subwords is split by hyphenation and a subword whose sense\n    contributes to a word is not overtly realised within that word.\n  document_word -- the string that matched from the document.\n  document_phrase -- the phrase headed by the word that matched from the document.\n  match_type -- 'direct', 'derivation', 'entity', 'embedding', 'ontology', 'entity_embedding'\n    or 'question'.\n  negated -- 'True' if this word match is negated.\n  uncertain -- 'True' if this word match is uncertain.\n  similarity_measure -- for types 'embedding' and 'entity_embedding', the similarity between the\n    two tokens, otherwise '1.0'.\n  involves_coreference -- 'True' if the word was matched using coreference resolution.\n  extracted_word -- within the coreference chain, the most specific term that corresponded to\n    the document_word.\n  depth -- the number of hyponym relationships linking 'search_phrase_word' and\n    'extracted_word', or '0' if ontology-based matching is not active. Can be negative\n    if symmetric matching is active.\n  explanation -- creates a human-readable explanation of the word match from the perspective of the\n    document word (e.g. to be used as a tooltip over it).\n```\n\n\u003ca id=\"topic-match-dictionary\"\u003e\u003c/a\u003e\n#### 6.8 Dictionary returned from `Manager.topic_match_documents_against()`\n\n``` {.python}\nA text-only representation of a topic match between a search text and a document.\n\nProperties:\n\ndocument_label -- the label of the document.\ntext -- the document text that was matched.\ntext_to_match -- the search text.\nrank -- a string representation of the scoring rank which can have the form e.g. '2=' in case of a tie.\nindex_within_document -- the index of the document token where the activation peaked.\nsubword_index -- the index of the subword within the document token where the activation peaked, or\n  'None' if the activation did not peak at a specific subword.\nstart_index -- the index of the first document token in the topic match.\nend_index -- the index of the last document token in the topic match (NOT one more than that index).\nsentences_start_index -- the token start index within the document of the sentence that contains\n  'start_index'\nsentences_end_index -- the token end index within the document of the sentence that contains\n  'end_index' (NOT one more than that index).\nsentences_character_start_index_in_document -- the character index of the first character of 'text'\n  within the document.\nsentences_character_end_index_in_document -- one more than the character index of the last\n  character of 'text' within the document.\nscore -- the score\nword_infos -- an array of arrays with the semantics:\n\n  [0] -- 'relative_start_index' -- the index of the first character in the word relative to\n    'sentences_character_start_index_in_document'.\n  [1] -- 'relative_end_index' -- one more than the index of the last character in the word\n    relative to 'sentences_character_start_index_in_document'.  \n  [2] -- 'type' -- 'single' for a single-word match, 'relation' if within a relation match\n    involving two words, 'overlapping_relation' if within a relation match involving three\n    or more words.\n  [3] -- 'is_highest_activation' -- 'True' if this was the word at which the highest activation\n    score reported in 'score' was achieved, otherwise 'False'.\n  [4] -- 'explanation' -- a human-readable explanation of the word match from the perspective of\n    the document word (e.g. to be used as a tooltip over it).\n\nanswers -- an array of arrays with the semantics:\n\n  [0] -- the index of the first character of a potential answer to an initial question word.\n  [1] -- one more than the index of the last character of a potential answer to an initial question\n    word.\n```\n\n\u003ca id=\"a-note-on-the-license\"\u003e\u003c/a\u003e\n### 7 A note on the license\n\nEarlier versions of Holmes could only be published under a restrictive license because of patent issues. As explained in the\n[introduction](#introduction), this is no longer the case thanks to the generosity of [AstraZeneca](https://www.astrazeneca.com/):\nversions from 4.0.0 onwards are licensed under the MIT license.\n\n\u003ca id=\"information-for-developers\"\u003e\u003c/a\u003e\n### 8 Information for developers\n\n\u003ca id=\"how-it-works\"\u003e\u003c/a\u003e\n#### 8.1 How it works\n\n\u003ca id=\"how-it-works-structural-matching\"\u003e\u003c/a\u003e\n##### 8.1.1 Structural matching (chatbot and structural extraction)\n\nThe word-level matching and the high-level operation of structural\nmatching between search-phrase and document subgraphs both work more or\nless as one would expect. What is perhaps more in need of further\ncomment is the semantic analysis code subsumed in the [parsing.py](https://github.com/explosion/holmes-extractor/blob/master/holmes_extractor/parsing.py)\nscript as well as in the `language_specific_rules.py` script for each\nlanguage.\n\n`SemanticAnalyzer` is an abstract class that is subclassed for each\nlanguage: at present by `EnglishSemanticAnalyzer` and\n`GermanSemanticAnalyzer`. These classes contain most of the semantic analysis code.\n`SemanticMatchingHelper` is a second abstract class, again with an concrete\nimplementation for each language, that contains semantic analysis code\nthat is required at matching time. Moving this out to a separate class family\nwas necessary because, on operating systems that spawn processes rather\nthan forking processes (e.g. Windows), `SemanticMatchingHelper` instances\nhave to be serialized when the worker processes are created: this would\nnot be possible for `SemanticAnalyzer` instances because not all\nspaCy models are serializable, and would also unnecessarily consume\nlarge amounts of memory.\n\nAt present, all functionality that is common\nto the two languages is realised in the two abstract parent classes.\nEspecially because English and German are closely related languages, it\nis probable that functionality will need to be moved from the abstract\nparent classes to specific implementing children classes if and when new\nsemantic analyzers are added for new languages.\n\nThe `HolmesDictionary` class is defined as a [spaCy extension\nattribute](https://spacy.io/usage/processing-pipelines#section-custom-components-attributes)\nthat is accessed using the syntax `token._.holmes`. The most important\ninformation in the dictionary is a list of `SemanticDependency` objects.\nThese are derived from the dependency relationships in the spaCy output\n(`token.dep_`) but go through a considerable amount of processing to\nmake them 'less syntactic' and 'more semantic'. To give but a few\nexamples:\n\n-   Where coordination occurs, dependencies are added to and from all\n    siblings.\n-   In passive structures, the dependencies are swapped around to capture\n    the fact that the syntactic subject is the semantic object and\n    vice versa.\n-   Relationships are added spanning main and subordinate clauses to\n    capture the fact that the syntactic subject of a main clause also\n    plays a semantic role in the subordinate clause.\n\nSome new semantic dependency labels that do not occur in spaCy outputs\nas values of `token.dep_` are added for Holmes semantic dependencies.\nIt is important to understand that Holmes semantic dependencies are used\nexclusively for matching and are therefore neither intended nor required\nto form a coherent set of linguistic theoretical entities or relationships;\nwhatever works best for matching is assigned on an ad-hoc basis.\n\nFor each language, the `match_implication_dict` dictionary maps search-phrase semantic dependencies\nto matching document semantic dependencies and is responsible for the [asymmetry of matching\nbetween search phrases and documents](#general-comments).\n\n\u003ca id=\"how-it-works-topic-matching\"\u003e\u003c/a\u003e\n##### 8.1.2 Topic matching\n\nTopic matching involves the following steps:\n\n1. The query document or query phrase is parsed and a number of **phraselets**\nare derived from it. Single-word phraselets are extracted for every word (or subword in German) with its own meaning within the query phrase apart from a handful of stop words defined within the semantic matching helper (`SemanticMatchingHelper.topic_matching_phraselet_stop_lemmas`), which are\nconsistently ignored throughout the whole process.\n2. Two-word or **relation** phraselets are extracted from the query document or query phrase wherever certain grammatical structures\nare found. The structures that trigger two-word phraselets differ from language to language\nbut typically include verb-subject, verb-object and noun-adjective pairs as well as verb-noun and noun-noun relations spanning prepositions. Each relation phraselet\nhas a parent (governor) word or subword and a child (governed) word or subword. The relevant\nphraselet structures for a given language are defined in `SemanticMatchingHelper.phraselet_templates`.\n3. Both types of phraselet are assigned a **frequency factor** expressing how common or rare its word or words are in the corpus. Frequency factors are determined using a logarithmic calculation and range from 0.0 (very common) to 1.0 (very rare). Each word within a relation phraselet is also assigned its own frequency factor.\n4. Phraselet templates where the parent word belongs to a closed word class, e.g. prepositions, can be defined as 'reverse_only'. This signals that matching with derived phraselets should only be attempted starting from the child word rather than from the parent word as normal. Phraselets are also defined as reverse-only when th","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsg-systems%2Fholmes-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsg-systems%2Fholmes-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsg-systems%2Fholmes-extractor/lists"}