Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ICIJ/extract

A cross-platform command line tool for parallelised content extraction and analysis.
https://github.com/ICIJ/extract

ediscovery etl index solr tika

Last synced: 11 days ago
JSON representation

A cross-platform command line tool for parallelised content extraction and analysis.

Awesome Lists containing this project

README

        

# Extract

[![Circle CI](https://circleci.com/gh/ICIJ/extract.png?style=shield&circle-token=8eeca3ff612e883bd07464b23551fab215d1129d)](https://circleci.com/gh/ICIJ/extract)

A cross-platform command line tool for parallelized, distributed content-extraction. Built on top of [Apache Tika](https://tika.apache.org/) and an essential part of the engineering behind the [Panama Papers](https://en.wikipedia.org/wiki/Panama_Papers), [Swiss Leaks](https://en.wikipedia.org/wiki/Swiss_Leaks) and [Luxembourg Leaks](https://en.wikipedia.org/wiki/Luxembourg_Leaks) investigations.

It supports Redis-backed queueing for distributed, parallel extraction and will write to Solr, plain text files or standard output.

For guidance and instructions, please see the [wiki](https://github.com/ICIJ/extract/wiki).

## Credits and Collaboration

Initialy developed by [Matthew Caruana Galizia](https://twitter.com/mcaruanagalizia) at [ICIJ](https://www.icij.org/).

We welcome contributions! Please submit pull requests or contact us directly.

## License

Copyright (c) 2018 International Consortium of Investigative Journalists. See `LICENSE`.