{"id":20874497,"url":"https://github.com/wikidata/strephit","last_synced_at":"2025-05-12T15:30:38.635Z","repository":{"id":73797833,"uuid":"50506296","full_name":"Wikidata/StrepHit","owner":"Wikidata","description":"An intelligent reading agent that understands text and translates it into Wikidata statements.","archived":false,"fork":false,"pushed_at":"2016-07-14T09:59:43.000Z","size":8542,"stargazers_count":115,"open_issues_count":17,"forks_count":14,"subscribers_count":25,"default_branch":"master","last_synced_at":"2025-04-01T07:01:50.401Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Wikidata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-01-27T12:38:09.000Z","updated_at":"2025-03-23T13:47:03.000Z","dependencies_parsed_at":"2023-03-25T11:18:05.230Z","dependency_job_id":null,"html_url":"https://github.com/Wikidata/StrepHit","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wikidata%2FStrepHit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wikidata%2FStrepHit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wikidata%2FStrepHit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wikidata%2FStrepHit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Wikidata","download_url":"https://codeload.github.com/Wikidata/StrepHit/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253765740,"owners_count":21960780,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-18T06:33:10.688Z","updated_at":"2025-05-12T15:30:38.622Z","avatar_url":"https://github.com/Wikidata.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# StrepHit\n*StrepHit* is a **Natural Language Processing** pipeline that understands human language, extracts facts from text and produces **[Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) statements** with **references**.\n\n*StrepHit* is a IEG project **funded by the [Wikimedia Foundation](https://wikimediafoundation.org/wiki/Home)**.\n\n*StrepHit* will enhance the data quality of Wikidata by **suggesting references to validate statements**, and will help Wikidata become the gold-standard hub of the Open Data landscape.\n\n# Official Project Page\nhttps://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References\n\n# Documentation\nhttps://www.mediawiki.org/wiki/StrepHit\n\n# Features\n- **[Web spiders](strephit/web_sources_corpus)** to collect a biographical corpus from a [list of reliable sources](https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Timeline#Biographies)\n- **[Corpus analysis](strephit/corpus_analysis)** to understand the most meaningful verbs \n- **[Extraction](strephit/extraction)** of sentences and semi-structured data from a corpus\n- Train an automatic classifier through **[crowdsourcing](strephit/annotation)**\n- **Extract facts** from text in 2 ways:\n    - [Supervised](strephit/classification)\n    - [Rule-based](strephit/rule_based)\n- Several **[utilities](strephit/commons)**, ranging from NLP tasks like *[tokenization](https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis))* and *[part-of-speech tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging)*, to facilities for parallel processing, caching and logging\n\n# Pipeline\n1. Corpus Harvesting\n2. Corpus Analysis\n3. Sentence Extraction\n4. N-ary Relation Extraction\n5. Dataset Serialization\n\n# Get Ready\n- Install **[Python 2.7](https://www.python.org/downloads/)** and **[pip](https://pip.pypa.io/en/stable/installing/)**\n- Clone the repository and create the output folder:\n```\n$ git clone https://github.com/Wikidata/StrepHit.git\n$ mkdir StrepHit/output\n```\n- Install all the Python requirements (preferably in a [virtualenv](http://docs.python-guide.org/en/latest/dev/virtualenvs/))\n```\n$ cd StrepHit\n$ pip install -r requirements.txt\n```\n- Install [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)\n- Register for a free account on the [Dandelion APIs](https://dandelion.eu/accounts/register/?next=/docs/api/datatxt/nex/getting-started/)\n- Create the file `strephit/commons/secret_keys.py` with your API token. You can find it in [your dashboard](https://dandelion.eu/profile/dashboard/)\n```\nNEX_URL = 'https://api.dandelion.eu/datatxt/nex/v1/'\nNEX_TOKEN = 'your API token here'\n```\n\n## Optional dependency\nIf you want to **[extract sentences](strephit/extraction/extract_sentences.py)** via __[syntactic parsing](https://en.wikipedia.org/wiki/Parsing)__, you will need to install:\n- [Java 8](http://www.java.com/en/download/)\n- [Stanford CoreNLP](http://stanfordnlp.github.io/CoreNLP/), through our utility:\n```\n$ python -m strephit commons download stanford_corenlp\n```\n\n# Command Line\nYou can run all the NLP pipeline components through a command line.\nDo not specify any argument, or use `--help` to see the available options.\nEach command can have a set of sub-commands, depending on its granularity.\n```\n$ python -m strephit                                                                             \nUsage: __main__.py [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  --log-level \u003cTEXT CHOICE\u003e...\n  --cache-dir DIRECTORY\n  --help                        Show this message and exit.\n\nCommands:\n  annotation          Corpus annotation via crowdsourcing\n  classification      Roles classification\n  commons             Common utilities used by others\n  corpus_analysis     Corpus analysis module\n  extraction          Data extraction from the corpus\n  rule_based          Unsupervised fact extraction\n  side_projects       Side projects scripts\n  web_sources_corpus  Corpus retrieval from the web\n```\n\n# Get Started\n- Generate a dataset of Wikidata assertions (*[QuickStatements](https://tools.wmflabs.org/wikidata-todo/quick_statements.php)* syntax) from semi-structured data in the corpus (takes time, and a good internet connection):\n```\n$ python -m strephit extraction process_semistructured -p 1 samples/corpus.jsonlines\n```\n\n- Produce a ranking of meaningful verbs:\n```\n$ python -m strephit commons pos_tag samples/corpus.jsonlines bio en\n$ python -m strephit corpus_analysis rank_verbs output/pos_tagged.jsonlines bio en\n```\n\n- Extract sentences using the ranking and perform [Entity Linking](https://en.wikipedia.org/wiki/Entity_linking):\n```\n$ python -m strephit extraction extract_sentences samples/corpus.jsonlines output/verbs.json en\n$ python -m strephit commons entity_linking -p 1 output/sentences.jsonlines en\n```\n\n- Extract facts with the rule-based classifier:\n```\n$ python -m strephit rule_based classify output/entity_linked.jsonlines samples/lexical_db.json en\n```\n\n- Train the supervised classifier and extract facts:\n```\n$ python -m strephit annotation parse_results samples/crowdflower_results.csv\n$ python -m strephit classification train output/training_set.jsonlines en\n$ python -m strephit classification classify output/entity_linked.jsonlines output/classifier_model.pkl en\n```\n\n- Serialize the supervised classification results into a dataset of Wikidata assertions (*QuickStatements*):\n```\n$ python -m strephit commons serialize -p 1 output/supervised_classified.jsonlines samples/lexical_db.json en\n```\n\n**N.B.**: you will find all the output files in the `output` folder.\n\n## Note on Parallel Processing\nBy default, StrepHit uses as many processes as the number of CPU cores in the machine where it runs.\nAdd the `-p` parameter if you want to change the behavior.\n\nSet `-p 1` to **disable** parallel processing.\n\n# License\nThe source code is under the terms of the [GNU General Public License, version 3](http://www.gnu.org/licenses/gpl.html).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwikidata%2Fstrephit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwikidata%2Fstrephit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwikidata%2Fstrephit/lists"}