Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ICIJ/extract
A cross-platform command line tool for parallelised content extraction and analysis.
https://github.com/ICIJ/extract
ediscovery etl index solr tika
Last synced: 11 days ago
JSON representation
A cross-platform command line tool for parallelised content extraction and analysis.
- Host: GitHub
- URL: https://github.com/ICIJ/extract
- Owner: ICIJ
- License: mit
- Created: 2015-05-07T16:24:57.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2024-09-11T20:11:44.000Z (2 months ago)
- Last Synced: 2024-09-27T10:50:00.976Z (about 2 months ago)
- Topics: ediscovery, etl, index, solr, tika
- Language: Java
- Homepage:
- Size: 69.4 MB
- Stars: 238
- Watchers: 21
- Forks: 32
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Extract
[![Circle CI](https://circleci.com/gh/ICIJ/extract.png?style=shield&circle-token=8eeca3ff612e883bd07464b23551fab215d1129d)](https://circleci.com/gh/ICIJ/extract)
A cross-platform command line tool for parallelized, distributed content-extraction. Built on top of [Apache Tika](https://tika.apache.org/) and an essential part of the engineering behind the [Panama Papers](https://en.wikipedia.org/wiki/Panama_Papers), [Swiss Leaks](https://en.wikipedia.org/wiki/Swiss_Leaks) and [Luxembourg Leaks](https://en.wikipedia.org/wiki/Luxembourg_Leaks) investigations.
It supports Redis-backed queueing for distributed, parallel extraction and will write to Solr, plain text files or standard output.
For guidance and instructions, please see the [wiki](https://github.com/ICIJ/extract/wiki).
## Credits and Collaboration
Initialy developed by [Matthew Caruana Galizia](https://twitter.com/mcaruanagalizia) at [ICIJ](https://www.icij.org/).
We welcome contributions! Please submit pull requests or contact us directly.
## License
Copyright (c) 2018 International Consortium of Investigative Journalists. See `LICENSE`.