{"id":13622421,"url":"https://github.com/ICIJ/datashare","last_synced_at":"2025-04-15T06:30:27.737Z","repository":{"id":39568451,"uuid":"56667109","full_name":"ICIJ/datashare","owner":"ICIJ","description":"A self-hosted search engine for documents.","archived":false,"fork":false,"pushed_at":"2025-04-10T15:37:01.000Z","size":414224,"stargazers_count":626,"open_issues_count":58,"forks_count":57,"subscribers_count":28,"default_branch":"main","last_synced_at":"2025-04-10T16:53:41.705Z","etag":null,"topics":["datashare","docker","elasticsearch","extract","investigative-journalism","named-entity-recognition","text-extraction","web-gui"],"latest_commit_sha":null,"homepage":"https://datashare.icij.org","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ICIJ.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2016-04-20T07:52:07.000Z","updated_at":"2025-04-10T08:23:30.000Z","dependencies_parsed_at":"2023-10-16T19:52:34.275Z","dependency_job_id":"68af2544-19f4-4e09-ba60-03db39ca7b51","html_url":"https://github.com/ICIJ/datashare","commit_stats":null,"previous_names":[],"tags_count":830,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ICIJ%2Fdatashare","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ICIJ%2Fdatashare/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ICIJ%2Fdatashare/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ICIJ%2Fdatashare/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ICIJ","download_url":"https://codeload.github.com/ICIJ/datashare/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249020568,"owners_count":21199581,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datashare","docker","elasticsearch","extract","investigative-journalism","named-entity-recognition","text-extraction","web-gui"],"created_at":"2024-08-01T21:01:18.927Z","updated_at":"2025-04-15T06:30:22.728Z","avatar_url":"https://github.com/ICIJ.png","language":"Java","readme":"# Datashare [![CircleCI](https://circleci.com/gh/ICIJ/datashare.svg?style=shield)](https://circleci.com/gh/ICIJ/datashare) [![Crowdin](https://badges.crowdin.net/datashare/localized.svg)](https://crowdin.com/project/datashare)\n\n![Datashare: Better analyze information, in all its forms](https://i.imgur.com/9SPU1x2.png)\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://datashare-demo.icij.org\"\u003eDemo\u003c/a\u003e |\n\u003ca href=\"https://datashare.icij.org\"\u003eDownload\u003c/a\u003e |\n\u003ca href=\"https://github.com/ICIJ/datashare/wiki\"\u003eDocumentation\u003c/a\u003e |\n\u003ca href=\"https://icij.gitbook.io/datashare/\"\u003eUser Guide\u003c/a\u003e\n\u003c/p\u003e\n\n## Download\n\nDatashare is an open-source software developed by the International Consortium of Investigative Journalists (ICIJ). You can use it for free on your computer or install it on your server and analyse your documents with collaborative features.\n\nhttps://datashare.icij.org/\n\n## Follow new updates and features\n\n[@ICIJorg](https://twitter.com/ICIJorg) publishes video tweets of new features with the hashtag [#ICIJDatashare](https://twitter.com/hashtag/ICIJDatashare).\n\n## Frontend\n\nThis repository is only the backend part of Datashare.\n\nPlease find the frontend here : https://github.com/ICIJ/datashare-client.\n\n\n## Description\n\nDatashare is a free open-source desktop application developed by non-profit International Consortium of Investigative Journalists (ICIJ). \n\nDatashare allows investigative journalists to:\n- access all their documents in one place locally on their computer while securing them from potential third-party interferences\n- search pdfs, images, texts, spreadsheets, slides and any files, simultaneously\n- automatically detect and filter by people, organizations and locations\n\n## Translation of the interface\n\nYou're welcome to suggest translations on Datashare's Crowdin https://crwd.in/datashare. Please contact us if you would like to add a language.\n\n## Installing and using\n\n### Using with elasticsearch\n\nYou can download the script at datashare.icij.org.\n\nTo access web GUI, go in your documents folder and launch `path/to/datashare.sh` then connect datashare on http://localhost:8080\n\n### Using only Named Entity Recognition\n\nYou can use the datashare docker container only for HTTP exposed name finding API.\n\nJust run : \n\n    docker run -ti -p 8080:8080 -v /path/to/dist/:/home/datashare/dist icij/datashare:0.10 -m NER\n\nA bit of explanation : \n- `-p 8080:8080` maps the 8080 to 8080, the you could access datashare at localhost:8080 (If you want to access it at localhost:8081, the change to `-p 8081:8080`)\n- `-m NER` runs datashare without index at all on a stateless mode\n- `-v /path/to/dist:/home/datashare/dist` maps the directory where the NLP models will be read (and downloaded if they don't exist)\n\nThen query with curl the server with : \n\n    curl -i localhost:8080/api/ner/findNames/CORENLP --data-binary @path/to/a/file.txt\n\nThe last path part (CORENLP) is the framework. You can choose it among CORENLP, IXAPIPE, MITIE or OPENNLP.    \n\n### **Extract Text from Files** \n  \n*Implementations*\n  \n  - [TikaDocument](https://github.com/ICIJ/extract/blob/extractlib/extract-lib/src/main/java/org/icij/extract/document/TikaDocument.java) from ICIJ/extract \n  \n    [Apache Tika](https://tika.apache.org/) v1.18 (Apache Licence v2.0)\n  \n    with [Tesseract](https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM) v4.0 alpha \n\n\n*Support*\n\n  [Tika File Formats](https://tika.apache.org/1.18/formats.html)\n\n  \n### **Extract Persons, Organizations or Locations from Text** \n\nInfo: other languages than the ones listed below are not supported. We encourage you to reach out to the maintainers of the original NLP projects to support your preferred language.\n   \n*Implementations*\n  \n  - `org.icij.datashare.text.nlp.corenlp.CorenlpPipeline` \n  \n    [Stanford CoreNLP](http://stanfordnlp.github.io/CoreNLP) v3.8.0, \n    (Conditional Random Fields), \n    *Composite GPL v3+* \n\n  - `org.icij.datashare.text.nlp.ixapipe.IxapipePipeline` \n  \n    [Ixa Pipes Nerc](https://github.com/ixa-ehu/ixa-pipe-nerc) v1.6.1, \n    (Perceptron), \n    *Apache Licence v2.0*\n\n  - `org.icij.datashare.text.nlp.mitie.MitiePipeline` \n  \n    [MIT Information Extraction](https://github.com/mit-nlp/MITIE) v0.8, \n    (Structural Support Vector Machines), \n    *Boost Software License v1.0*\n\n  - `org.icij.datashare.text.nlp.opennlp.OpennlpPipeline` \n  \n    [Apache OpenNLP](https://opennlp.apache.org/) v1.6.0, \n    (Maximum Entropy), \n    *Apache Licence v2.0*\n\n  \n*Natural Language Processing Stages Support*\n\n| `NlpStage`       |\n|------------------|\n| `TOKEN`          |\n| `SENTENCE`       |\n| `POS`            |\n| `NER`            |\n\n*Named Entity Recognition Language Support*\n\n| *`NlpStage.NER`*           | `ENGLISH`  | `SPANISH`  | `GERMAN`  | `FRENCH`  | `CHINESE` |\n|---------------------------:|:----------:|:----------:|:---------:|:---------:|:---------:|\n| `NlpPipeline.Type.CORENLP` |     X      |      X     |      X    |  (w/ EN)  |     X     |\n| `NlpPipeline.Type.OPENNLP` |     X      |      X     |      -    |     X     |     -     |\n| `NlpPipeline.Type.IXAPIPE` |     X      |      X     |      X    |     -     |     -     |\n| `NlpPipeline.Type.MITIE`   |     X      |      X     |      X    |     -     |     -     |\n\n*Named Entity Categories Support*\n\n| `NamedEntity.Category` |\n|----------------------  |\n| `ORGANIZATION`         |\n| `PERSON`               |\n| `LOCATION`             |\n\n*Parts-of-Speech Language Support*\n\n|  *`NlpStage.POS`*          | `ENGLISH`  | `SPANISH`  | `GERMAN`  | `FRENCH`  |\n|---------------------------:|:----------:|:----------:|:---------:|:---------:|\n| `NlpPipeline.Type.CORE`    |     X      |      X     |     X     |     X     |\n| `NlpPipeline.Type.OPEN`    |     X      |      X     |     X     |     X     |\n| `NlpPipeline.Type.IXA`     |     X      |      X     |     X     |     X     |\n| `NlpPipeline.Type.MITIE`   |     -      |      -     |      -    |     -     |\n\n\n### **Store and Search Documents and Named Entities**\n\n *Implementations*\n  \n - `org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer`\n \n   [Elasticsearch](https://www.elastic.co/products/elasticsearch) v7.9.1, *Apache Licence v2.0*\n\n\n\n## Compilation / Build\n\nRequires \n[JDK 11](https://www.oracle.com/java/technologies/javase-jdk11-downloads.html),\n[Maven 3](http://maven.apache.org/download.cgi) and a running [PostgreSQL](https://www.postgresql.org/) database (hostname `postgres`) \nwith two databases `datashare` and `test` with write access for user `test` / password `test`. You'll need also a running\nelasticsearch instance with `elasticsearch` as hostname ; and a redis server named `redis` as well.\n\n```\nmvn validate\nmvn -pl commons-test -am install\nmvn -pl datashare-db liquibase:update\nmvn test\n```\n\n## Keeping the development environment up to date\n\nIt is important to keep `datashare` and `datashare-client` up to date by pulling from each repository's master branch. \n\nTo ensure that updates are registered, `make clean dist` must be run locally from each repository. \n\nIf dependencies have been updated on `datashare-client`, run `yarn` **before** `make clean dist`.\n\nIf the database models have changed within `datashare`, run the following commands **before** `make clean dist`:\n\n```\nsh datashare-db/scr/reset_datashare_db.sh\nmvn -pl commons-test -am install\nmvn -pl datashare-db liquibase:update\nmvn test\n```\n\n## License\n\nDatashare is released under the [GNU Affero General Public License](https://www.gnu.org/licenses/agpl-3.0.en.html)\n\n\n## Bug report, comment or (pull) request\n\nWe welcome feedback as well as contributions!\n\nFor any bug, question, comment or (pull) request, \n\nplease contact us at datashare@icij.org\n \n \n","funding_links":[],"categories":["Java","Analyse documents"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FICIJ%2Fdatashare","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FICIJ%2Fdatashare","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FICIJ%2Fdatashare/lists"}