An open API service indexing awesome lists of open source software.

https://github.com/alephdata/ingest-file

Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
https://github.com/alephdata/ingest-file

document-extraction documents email-forensics excel forensics forensics-investigations metadata-extraction ocr

Last synced: 23 days ago
JSON representation

Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.

Awesome Lists containing this project

README

        

# ingestors

``ingestors`` extract useful information from documents of different types in
a structured standard format. It retains folder structures across directories,
compressed archives and emails. The extracted data is formatted as Follow the
Money (FtM) entities, ready for import into Aleph, or processing as an object
graph.

Supported file types:

* Plain text
* Images
* Web pages, XML documents
* PDF files
* Emails (Outlook, plain text)
* Archive files (ZIP, Rar, etc.)

Other features:

* Extendable and composable using classes and mixins.
* Generates FollowTheMoney objects to a database as result objects.
* Lightweight worker-style support for logging, failures and callbacks.
* Throughly tested.

## Development environment

For local development with a virtualenv:

```bash
python3 -mvenv .env
source .env/bin/activate
pip install -r requirements.txt
```

## Release procedure

```bash
git pull --rebase
make build
make test
source .env/bin/activate
bump2version {patch,minor,major} # pick the appropriate one
git push --atomic origin $(git branch --show-current) $(git describe --tags --abbrev=0)
```

## Usage

Ingestors are usually called in the context of Aleph. In order to run them
stand-alone, you can use the supplied docker compose environment. To enter
a working container, run:

```bash
make build
make shell
```

Inside the shell, you will find the `ingestors` command-line tool. During
development, it is convenient to call its debug mode using files present
in the user's home directory, which is mounted at `/host`:

```bash
ingestors debug /host/Documents/sample.xlsx
```

## License

As of release version 3.18.4 `ingest-file` is licensed under the AGPLv3 or later license. Previous versions were released under the MIT license.