{"id":13747169,"url":"https://github.com/erre-quadro/spikex","last_synced_at":"2025-04-05T04:08:57.661Z","repository":{"id":48357639,"uuid":"278327802","full_name":"erre-quadro/spikex","owner":"erre-quadro","description":"SpikeX - SpaCy Pipes for Knowledge Extraction","archived":false,"fork":false,"pushed_at":"2021-07-30T07:49:16.000Z","size":3597,"stargazers_count":398,"open_issues_count":7,"forks_count":28,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-03-29T03:03:33.653Z","etag":null,"topics":["abbreviations-detection","acronym-recognition","clustering","entity-linking","named-entity-recognition","nlp","noun-phrase-extract","sentence-splitting","spacy","spacy-pipes","verb-phrase-extract","wikigraph","wikipedia","wikipedia-graph"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/erre-quadro.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-07-09T09:58:16.000Z","updated_at":"2025-03-20T07:00:12.000Z","dependencies_parsed_at":"2022-09-19T10:01:38.871Z","dependency_job_id":null,"html_url":"https://github.com/erre-quadro/spikex","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erre-quadro%2Fspikex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erre-quadro%2Fspikex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erre-quadro%2Fspikex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erre-quadro%2Fspikex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/erre-quadro","download_url":"https://codeload.github.com/erre-quadro/spikex/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247284943,"owners_count":20913704,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["abbreviations-detection","acronym-recognition","clustering","entity-linking","named-entity-recognition","nlp","noun-phrase-extract","sentence-splitting","spacy","spacy-pipes","verb-phrase-extract","wikigraph","wikipedia","wikipedia-graph"],"created_at":"2024-08-03T06:01:19.008Z","updated_at":"2025-04-05T04:08:57.633Z","avatar_url":"https://github.com/erre-quadro.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# SpikeX - SpaCy Pipes for Knowledge Extraction\n\nSpikeX is a collection of pipes ready to be plugged in a spaCy pipeline.\nIt aims to help in building knowledge extraction tools with almost-zero effort.\n\n[![Build Status](https://img.shields.io/azure-devops/build/erre-quadro/spikex/3/master?label=build\u0026logo=azure-pipelines\u0026style=flat-square)](https://dev.azure.com/erre-quadro/spikex/_build/latest?definitionId=3\u0026branchName=master)\n[![pypi Version](https://img.shields.io/pypi/v/spikex.svg?style=flat-square\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/spikex/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)\n\n## What's new in SpikeX 0.5.0\n\n**WikiGraph** has never been so lightning fast:\n- 🌕 **Performance mooning**, thanks to the adoption of a *sparse adjacency matrix* to handle pages graph, instead of using *igraph*\n- 🚀 **Memory optimization**, with a consumption cut by ~40% and a compressed size cut by ~20%, introducing new *bidirectional dictionaries* to manage data\n- 📖 **New APIs** for a faster and easier usage and interaction\n- 🛠 **Overall fixes**, for a better graph and a better pages matching \n \n## Pipes\n\n- **WikiPageX** links Wikipedia pages to chunks in text\n- **ClusterX** picks noun chunks in a text and clusters them based on a revisiting of the [Ball Mapper](https://arxiv.org/abs/1901.07410) algorithm, Radial Ball Mapper\n- **AbbrX** detects abbreviations and acronyms, linking them to their long form. It is based on [scispacy](https://github.com/allenai/scispacy/blob/master/scispacy/abbreviation.py)'s one with improvements\n- **LabelX** takes labelings of pattern matching expressions and catches them in a text, solving overlappings, abbreviations and acronyms\n- **PhraseX** creates a `Doc`'s underscore extension based on a custom attribute name and phrase patterns. Examples are **NounPhraseX** and **VerbPhraseX**, which extract noun phrases and verb phrases, respectively\n- **SentX** detects sentences in a text, based on [Splitta](https://github.com/dgillick/splitta) with refinements\n\n## Tools\n\n- **WikiGraph** with pages as leaves linked to categories as nodes\n- **Matcher** that inherits its interface from the [spaCy](https://github.com/explosion/spaCy/blob/master/spacy/matcher/matcher.pyx)'s one, but built using an engine made of RegEx which boosts its performance\n\n## Install SpikeX\n\nSome requirements are inherited from spaCy:\n\n- **spaCy version**: 2.3+\n- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual\n  Studio)\n- **Python version**: Python 3.6+ (only 64 bit)\n- **Package managers**: [pip](https://pypi.org/project/spikex/)\n\nSome dependencies use **Cython** and it needs to be installed before SpikeX:\n\n```bash\npip install cython\n```\n\nRemember that a virtual environment is always recommended, in order to avoid modifying system state.\n### pip\n\nAt this point, installing SpikeX via pip is a one line command:\n\n```bash\npip install spikex\n```\n\n## Usage\n\n### Prerequirements\n\nSpikeX pipes work with spaCy, hence a model its needed to be installed. Follow official instructions [here](https://spacy.io/usage/models#download). The brand new spaCy 3.0 is supported!\n\n### WikiGraph\n\nA `WikiGraph` is built starting from some key components of Wikipedia: *pages*, *categories* and *relations* between them. \n\n#### Auto\n\nCreating a `WikiGraph` can take time, depending on how large is its Wikipedia dump. For this reason, we provide wikigraphs ready to be used:\n\n| Date | WikiGraph | Lang | Size (compressed) | Size (memory) | |\n| --- | --- | --- | --- | --- | --- |\n| 2021-05-20 | enwiki_core | EN | 1.3GB | 8GB | [![][dl]][enwiki_core_20210520] | \n| 2021-05-20 | simplewiki_core | EN | 20MB | 130MB | [![][dl]][simplewiki_core_20210520] |\n| 2021-05-20 | itwiki_core | IT | 208MB | 1.2GB | [![][dl]][itwiki_core_20210520] |\n| More coming... |\n\n[enwiki_core_20210520]: https://errequadrosrl-my.sharepoint.com/:u:/g/personal/paolo_arduin_errequadrosrl_onmicrosoft_com/EeIb238HAmtCruMvhzZdOl8BIEBU_09XV5FnHE4SVmYzBQ?Download=1\n[simplewiki_core_20210520]: https://errequadrosrl-my.sharepoint.com/:u:/g/personal/paolo_arduin_errequadrosrl_onmicrosoft_com/EWdpEV_R4JVEk_ZwvJTrAEUBsLpmJMxyWDa13sFOzQAo3Q?Download=1\n[itwiki_core_20210520]: https://errequadrosrl-my.sharepoint.com/:u:/g/personal/paolo_arduin_errequadrosrl_onmicrosoft_com/EcWYGXp5SUdGvFTHN9KQ_zkBW8Zu9p0hiwpC3oKyhibXtQ?Download=1\n\n[dl]: http://i.imgur.com/gQvPgr0.png\n\nSpikeX provides a command to shortcut downloading and installing a `WikiGraph` (Linux or macOS, Windows not supported yet):\n```bash\nspikex download-wikigraph simplewiki_core\n```\n\n#### Manual\n\nA `WikiGraph` can be created from command line, specifying which Wikipedia dump to take and where to save it:\n\n```bash\nspikex create-wikigraph \\\n  \u003cYOUR-OUTPUT-PATH\u003e \\\n  --wiki \u003cWIKI-NAME, default: en\u003e \\\n  --version \u003cDUMP-VERSION, default: latest\u003e \\\n  --dumps-path \u003cDUMPS-BACKUP-PATH\u003e \\\n```\n\nThen it needs to be packed and installed:\n\n```bash\nspikex package-wikigraph \\\n  \u003cWIKIGRAPH-RAW-PATH\u003e \\\n  \u003cYOUR-OUTPUT-PATH\u003e\n```\n\nFollow the instructions at the end of the packing process and install the distribution package in your virtual environment.\nNow your are ready to use your WikiGraph as you wish:\n\n```python\nfrom spikex.wikigraph import load as wg_load\n\nwg = wg_load(\"enwiki_core\")\npage = \"Natural_language_processing\"\ncategories = wg.get_categories(page, distance=1)\nfor category in categories:\n    print(category)\n\n\u003e\u003e\u003e Category:Speech_recognition\n\u003e\u003e\u003e Category:Artificial_intelligence\n\u003e\u003e\u003e Category:Natural_language_processing\n\u003e\u003e\u003e Category:Computational_linguistics\n\n```\n### Matcher\n\nThe **Matcher** is identical to the spaCy's one, but faster when it comes to handle many patterns at once (order of thousands), so follow official usage instructions [here](https://spacy.io/usage/rule-based-matching#matcher).\n\nA trivial example:\n```python\nfrom spikex.matcher import Matcher\nfrom spacy import load as spacy_load\n\nnlp = spacy_load(\"en_core_web_sm\")\nmatcher = Matcher(nlp.vocab)\nmatcher.add(\"TEST\", [[{\"LOWER\": \"nlp\"}]])\ndoc = nlp(\"I love NLP\")\nfor _, s, e in matcher(doc):\n  print(doc[s: e])\n\n\u003e\u003e\u003e NLP\n```\n\n### WikiPageX\n\nThe `WikiPageX` pipe uses a `WikiGraph` in order to find chunks in a text that match Wikipedia page titles.\n\n``` python\nfrom spacy import load as spacy_load\nfrom spikex.wikigraph import load as wg_load\nfrom spikex.pipes import WikiPageX\n\nnlp = spacy_load(\"en_core_web_sm\")\ndoc = nlp(\"An apple a day keeps the doctor away\")\nwg = wg_load(\"simplewiki_core\")\nwpx = WikiPageX(wg)\ndoc = wpx(doc)\nfor span in doc._.wiki_spans:\n  print(span._.wiki_pages)\n\n\u003e\u003e\u003e ['An']\n\u003e\u003e\u003e ['Apple', 'Apple_(disambiguation)', 'Apple_(company)', 'Apple_(tree)']\n\u003e\u003e\u003e ['A', 'A_(musical_note)', 'A_(New_York_City_Subway_service)', 'A_(disambiguation)', 'A_(Cyrillic)')]\n\u003e\u003e\u003e ['Day']\n\u003e\u003e\u003e ['The_Doctor', 'The_Doctor_(Doctor_Who)', 'The_Doctor_(Star_Trek)', 'The_Doctor_(disambiguation)']\n\u003e\u003e\u003e ['The']\n\u003e\u003e\u003e ['Doctor_(Doctor_Who)', 'Doctor_(Star_Trek)', 'Doctor', 'Doctor_(title)', 'Doctor_(disambiguation)']\n``` \n\n### ClusterX\n\nThe `ClusterX` pipe takes noun chunks in a text and clusters them using a Radial Ball Mapper algorithm.\n\n``` python\nfrom spacy import load as spacy_load\nfrom spikex.pipes import ClusterX\n\nnlp = spacy_load(\"en_core_web_sm\")\ndoc = nlp(\"Grab this juicy orange and watch a dog chasing a cat.\")\nclusterx = ClusterX(min_score=0.65)\ndoc = clusterx(doc)\nfor cluster in doc._.cluster_chunks:\n  print(cluster)\n\n\u003e\u003e\u003e [this juicy orange]\n\u003e\u003e\u003e [a cat, a dog]\n```\n\n### AbbrX\n\nThe **AbbrX** pipe finds abbreviations and acronyms in the text, linking short and long forms together:\n\n```python\nfrom spacy import load as spacy_load\nfrom spikex.pipes import AbbrX\n\nnlp = spacy_load(\"en_core_web_sm\")\ndoc = nlp(\"a little snippet with an abbreviation (abbr)\")\nabbrx = AbbrX(nlp.vocab)\ndoc = abbrx(doc)\nfor abbr in doc._.abbrs:\n  print(abbr, \"-\u003e\", abbr._.long_form)\n\n\u003e\u003e\u003e abbr -\u003e abbreviation\n```\n\n### LabelX\n\nThe `LabelX` pipe matches and labels patterns in text, solving overlappings, abbreviations and acronyms.\n\n```python\nfrom spacy import load as spacy_load\nfrom spikex.pipes import LabelX\n\nnlp = spacy_load(\"en_core_web_sm\")\ndoc = nlp(\"looking for a computer system engineer\")\npatterns = [\n  [{\"LOWER\": \"computer\"}, {\"LOWER\": \"system\"}],\n  [{\"LOWER\": \"system\"}, {\"LOWER\": \"engineer\"}],\n]\nlabelx = LabelX(nlp.vocab, [(\"TEST\", patterns)], validate=True, only_longest=True)\ndoc = labelx(doc)\nfor labeling in doc._.labelings:\n  print(labeling, f\"[{labeling.label_}]\")\n\n\u003e\u003e\u003e computer system engineer [TEST]\n```\n\n### PhraseX\n\nThe `PhraseX` pipe creates a custom `Doc`'s underscore extension which fulfills with matches from phrase patterns.\n\n```python\nfrom spacy import load as spacy_load\nfrom spikex.pipes import PhraseX\n\nnlp = spacy_load(\"en_core_web_sm\")\ndoc = nlp(\"I have Melrose and McIntosh apples, or Williams pears\")\npatterns = [\n  [{\"LOWER\": \"mcintosh\"}],\n  [{\"LOWER\": \"melrose\"}],\n]\nphrasex = PhraseX(nlp.vocab, \"apples\", patterns)\ndoc = phrasex(doc)\nfor apple in doc._.apples:\n  print(apple)\n\n\u003e\u003e\u003e Melrose\n\u003e\u003e\u003e McIntosh\n```\n### SentX\n\nThe **SentX** pipe splits sentences in a text. It modifies tokens' *is_sent_start* attribute, so it's mandatory to add it before *parser* pipe in the spaCy pipeline:\n\n```python\nfrom spacy import load as spacy_load\nfrom spikex.pipes import SentX\nfrom spikex.defaults import spacy_version\n\nif spacy_version \u003e= 3:\n  from spacy.language import Language\n\n  @Language.factory(\"sentx\")\n  def create_sentx(nlp, name):\n      return SentX()\n\nnlp = spacy_load(\"en_core_web_sm\")\nsentx_pipe = SentX() if spacy_version \u003c 3 else \"sentx\"\nnlp.add_pipe(sentx_pipe, before=\"parser\")\ndoc = nlp(\"A little sentence. Followed by another one.\")\nfor sent in doc.sents:\n  print(sent)\n\n\u003e\u003e\u003e A little sentence.\n\u003e\u003e\u003e Followed by another one.\n```\n\n## That's all folks\nFeel free to contribute and have fun!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferre-quadro%2Fspikex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ferre-quadro%2Fspikex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferre-quadro%2Fspikex/lists"}