Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ICIJ/extract
A cross-platform command line tool for parallelised content extraction and analysis.
https://github.com/ICIJ/extract
ediscovery etl index solr tika
Last synced: 20 days ago
JSON representation
A cross-platform command line tool for parallelised content extraction and analysis.
- Host: GitHub
- URL: https://github.com/ICIJ/extract
- Owner: ICIJ
- License: mit
- Created: 2015-05-07T16:24:57.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2024-04-24T15:16:28.000Z (2 months ago)
- Last Synced: 2024-04-24T18:26:09.988Z (2 months ago)
- Topics: ediscovery, etl, index, solr, tika
- Language: Java
- Homepage:
- Size: 69.4 MB
- Stars: 233
- Watchers: 21
- Forks: 30
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- my-awesome-stars - ICIJ/extract - A cross-platform command line tool for parallelised content extraction and analysis. (Java)
README
# Extract
[![Circle CI](https://circleci.com/gh/ICIJ/extract.png?style=shield&circle-token=8eeca3ff612e883bd07464b23551fab215d1129d)](https://circleci.com/gh/ICIJ/extract)
A cross-platform command line tool for parallelized, distributed content-extraction. Built on top of [Apache Tika](https://tika.apache.org/) and an essential part of the engineering behind the [Panama Papers](https://en.wikipedia.org/wiki/Panama_Papers), [Swiss Leaks](https://en.wikipedia.org/wiki/Swiss_Leaks) and [Luxembourg Leaks](https://en.wikipedia.org/wiki/Luxembourg_Leaks) investigations.
It supports Redis-backed queueing for distributed, parallel extraction and will write to Solr, plain text files or standard output.
For guidance and instructions, please see the [wiki](https://github.com/ICIJ/extract/wiki).
## Credits and Collaboration
Initialy developed by [Matthew Caruana Galizia](https://twitter.com/mcaruanagalizia) at [ICIJ](https://www.icij.org/).
We welcome contributions! Please submit pull requests or contact us directly.
## License
Copyright (c) 2018 International Consortium of Investigative Journalists. See `LICENSE`.