{"id":17202907,"url":"https://github.com/kermitt2/grisp","last_synced_at":"2025-04-13T21:12:15.855Z","repository":{"id":42014249,"uuid":"77175767","full_name":"kermitt2/grisp","owner":"kermitt2","description":"Knowledge Base stuff","archived":false,"fork":false,"pushed_at":"2025-02-23T21:46:35.000Z","size":89530,"stargazers_count":18,"open_issues_count":6,"forks_count":3,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-04-13T21:12:13.295Z","etag":null,"topics":["entity-fishing","nerd","wikidata","wikipedia"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kermitt2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-12-22T21:04:52.000Z","updated_at":"2025-02-04T19:24:54.000Z","dependencies_parsed_at":"2025-01-20T00:35:25.432Z","dependency_job_id":"0ed0aabf-8bd3-4725-aa34-12c2eeadfc59","html_url":"https://github.com/kermitt2/grisp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kermitt2%2Fgrisp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kermitt2%2Fgrisp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kermitt2%2Fgrisp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kermitt2%2Fgrisp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kermitt2","download_url":"https://codeload.github.com/kermitt2/grisp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248782259,"owners_count":21160717,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["entity-fishing","nerd","wikidata","wikipedia"],"created_at":"2024-10-15T02:16:19.943Z","updated_at":"2025-04-13T21:12:15.817Z","avatar_url":"https://github.com/kermitt2.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GRISP\n\nPre-process the Language and Knowledge Base data for loading into [entity-fishing](https://github.com/kermitt2/entity-fishing).\n\n## Create entity-fishing Wikipedia and Wikidata preprocessed data\n\nThe sub-module `nerd-data` pre-processes the Wikidata JSON and Wikipedia XML dumps to create compiled data to be used by [entity-fishing](https://github.com/kermitt2/entity-fishing), a machine learning tool for extracting and disambiguating Wikidata entities in text and PDF at scale. \n\nThe pre-processing is an adaptation of the [WikipediaMiner 2.0](https://github.com/dnmilne/wikipediaminer) for the XML dump processing, which relies on Hadoop. The main Modifications include the usage of the [Sweble MediaWiki document parser](https://en.wikipedia.org/wiki/Sweble) for Wikipedia pages (the most comprehensive, reliable and fast MediaWiki parser following our tests, apart MediaWiki itself), a complete review of the compiled statistics, processing of Wikidata dump, the usage of LMDB to avoid distributed data, additional extraction related to multilinguality and various speed optimization.\n\nThe Wikipedia pre-processing supports current the Wikipedia dumps (2022) and was successfully tested with English, French, German, Italian, Spanish, Arabic, Mandarin, Russian, Japanese, Portuguese and Farsi XML dumps. The Wikipedia XML dumps and additional required files are available at the Wikimedia Downloads [page](https://dumps.wikimedia.org/), as well as the Wikidata JSON dump.\n\n### Preliminary install of entity-fishing and GRISP\n\n[entity-fishing](https://github.com/kermitt2/entity-fishing) needs to be installed first on the system and built, without the knowledge-base and language data:\n\n```console\ngit clone https://github.com/kermitt2/entity-fishing\ncd entity-fishing\n./gradlew clean build -x test\n```\n\nThe `-x test` when building is important to skip tests, because there is no knowledge-base and language resource data available for the tests yet. \n\nThen install and build GRISP:\n\n```console\ngit clone https://github.com/kermitt2/grisp\ncd grisp\nmvn clean install \n```\n\n**Note:** current latest versions of GRISP and [entity-fishing](https://github.com/kermitt2/entity-fishing) are `0.0.6`.\n\n### Script for preparing the Wikidata and Wikipedia resources \n\nA script is available to:\n* download the different resources needed fromn Wikidata and Wikipedia for a set of specified languages\n* create cvs translation files between languages\n* generate Wikidata property labels for each language\n* creating Wikidata knowledge base backbone and language-specific mapping with Wikidata entities\n\nThe script has been tested on a Linux setup, but it is likely to work also on MacOS. To run the script:\n\n```console\ncd grisp/scripts/\n./wikipedia-resources.sh [instal path of GRISP] [storage path of the data resources]\n```\n\nFor example:\n\n```console\n./wikipedia-resources.sh /home/lopez/grisp/ /media/lopez/data/wikipedia/latest/\n```\n\nThe above mentioned steps are realized successively by the scripts. By default all the languages will be covered, but you can change to a subset of languages by modifying the script at the following line:\n\n```bash\ndeclare -a languages=(\"en\" \"de\" \"fr\" \"it\" \"es\" \"ar\" \"zh\" \"ja\" \"ru\" \"pt\" \"fa\" \"uk\" \"sv\" \"hi\" \"bn\")\n```\n\nNote that English `\"en\"` at least is mandatory to further running [entity-fishing](https://github.com/kermitt2/entity-fishing). \n\nBe aware that the data path must have enough storage: as of April 2022, 74GB are needed for Wikidata dump and 70GB for all the language resources. To accomodate all resources, including the next Hadoop processing step, consider 200GB for all the languages. \n\n### Haddop processing of Wikipedia XML article dump files\n\nOnce all the required resources have been downloaded via the provided script, see above, we can run the pre-processing of the Wikipedia dumps.\n\nThe parsing and processing of the Wikipedia XML article dump files is computationally expensive, it has to be parallelized and we are using an Hadoop process for this purpose. A pseudo distributed mode (just running the process on one machine with several CPU) is enough for reasonnable processing time. A \"real\" distributed mode has not been tested for the moment and is thus currently not supported. \n\nCreate the hadoop job jar:\n\n```console\ncd grisp/nerd-data\n\u003e mvn clean package\n```\n\nThen see instructions under [nerd-data/doc/hadoop.md](nerd-data/doc/hadoop.md) for running the hadoop job and getting csv file results.\n\nThis processing is an adaptation and optimization of the [WikipediaMiner 2.0](https://github.com/dnmilne/wikipediaminer) XML dump processing. It enables the support of the latest Wikipedia dump files. The processing is considerably faster than with WikipediaMiner and a single server is enough for processing the lastest XML dumps in a reasonnable time. For December 2016 English Wikipedia XML dump: around 7 hours 30 minutes. For December 2016 French and German Wikipedia XML dump: around 2 hours 30 minutes (in pseudo distributed mode, one server Intel Core i7-4790K CPU 4.00GHz Haswell, 16GB memory, with 4 cores, 8 threads, SSD). \n\nWe think that it is possible to still improve significantly the processing time, lower memory consumption, and avoid completely Hadoop - simply by optimizing the processing for a common single multi-thread machine. But given that the current state of the library gives satisfactory performance, we leave these improvements for the future if necessary. \n\n### Final hierarchy of files \n\nHere how the final data tree should look like from the root directory (for 3 languages, additional languages follow the same pattern), ready to be loaded and further optimized in embedded databases by [entity-fishing](https://github.com/kermitt2/entity-fishing): \n\n```\n.\n├── de\n│   ├── articleParents.csv\n│   ├── categoryParents.csv\n│   ├── childArticles.csv\n│   ├── childCategories.csv\n│   ├── dewiki-latest-langlinks.sql.gz\n│   ├── dewiki-latest-page_props.sql.gz\n│   ├── dewiki-latest-pages-articles-multistream.xml.bz2\n│   ├── label.csv\n│   ├── page.csv\n│   ├── pageLabel.csv\n│   ├── pageLinkIn.csv\n│   ├── pageLinkOut.csv\n│   ├── redirectSourcesByTarget.csv\n│   ├── redirectTargetsBySource.csv\n│   ├── stats.csv\n│   ├── translations.csv\n│   └── wikidata-properties.json\n│   └── wikidata.txt\n├── en\n│   ├── articleParents.csv\n│   ├── categoryParents.csv\n│   ├── childArticles.csv\n│   ├── childCategories.csv\n│   ├── enwiki-latest-langlinks.sql.gz\n│   ├── enwiki-latest-page_props.sql.gz\n│   ├── enwiki-latest-pages-articles-multistream.xml.bz2\n│   ├── label.csv\n│   ├── page.csv\n│   ├── pageLabel.csv\n│   ├── pageLinkIn.csv\n│   ├── pageLinkOut.csv\n│   ├── redirectSourcesByTarget.csv\n│   ├── redirectTargetsBySource.csv\n│   ├── stats.csv\n│   ├── translations.csv\n│   └── wikidata-properties.json\n│   └── wikidata.txt\n├── fr\n│   ├── articleParents.csv\n│   ├── categoryParents.csv\n│   ├── childArticles.csv\n│   ├── childCategories.csv\n│   ├── frwiki-latest-langlinks.sql.gz\n│   ├── frwiki-latest-page_props.sql.gz\n│   ├── frwiki-latest-pages-articles-multistream.xml.bz2\n│   ├── label.csv\n│   ├── page.csv\n│   ├── pageLabel.csv\n│   ├── pageLinkIn.csv\n│   ├── pageLinkOut.csv\n│   ├── redirectSourcesByTarget.csv\n│   ├── redirectTargetsBySource.csv\n│   ├── stats.csv\n│   ├── translations.csv\n│   └── wikidata-properties.json\n│   └── wikidata.txt\n├── wikidata\n│   ├── wikidataIds.csv \n│   ├── latest-all.json.bz2\n\n```\n\nNote:\n\n- it is expected to have 15 files in each language-specific directory, plus 3 Wikipedia dump files (the `.bz2` `.gz` files),\n\n- the full Wikipedia article dump for each language must be present in the language-specific directories (e.g. `enwiki-latest-pages-articles-multistream.xml.bz2` or `enwiki-latest-pages-articles-multistream.xml.gz` or `enwiki-latest-pages-articles-multistream.xml`, they are required to generate definitions for entities, create training data, compute additional entity embeddings) ; the dump file can be compressed in `bz2`, `gzip` or uncompressed - all these variants should be loaded appropriately by entity-fishing,\n\n- the wikidata identifiers csv file `wikidataIds.csv` and the full wikidata JSON dump file `latest-all.json.bz2` are under a `wikidata` sub-directory while the wikidata language-specific Wikidata mapping files `wikidata.txt` and `wikidata-properties.json` are installed in each language-specific sub-directory,\n\n- in entity-fishing the loading of these files is automatic when building the project or starting the service (if not present), be sure to indicate the path to these above generated files in the entity-fishing config files.\n\n\n### More to come\n\nWe considering generating more KB data to be mapped: geonames, geospecies, etc. and better exploiting Wikidata labels and statements.\n\n## Credits\n\nMany thanks to David Milne for the Wikipedia XML dump processing. The present pre-processing of the Wikipedia data is originally a fork of a part of his project. \n\n## License\n\nGRISP is distributed under [GPL 3.0 license](https://www.gnu.org/licenses/gpl-3.0.html). \n\nContact: Patrice Lopez (\u003cpatrice.lopez@science-miner.com\u003e)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkermitt2%2Fgrisp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkermitt2%2Fgrisp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkermitt2%2Fgrisp/lists"}