https://github.com/alephdata/ingest-file
Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
https://github.com/alephdata/ingest-file
document-extraction documents email-forensics excel forensics forensics-investigations metadata-extraction ocr
Last synced: 23 days ago
JSON representation
Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
- Host: GitHub
- URL: https://github.com/alephdata/ingest-file
- Owner: alephdata
- License: agpl-3.0
- Created: 2017-03-08T15:12:06.000Z (about 8 years ago)
- Default Branch: main
- Last Pushed: 2025-05-02T11:56:11.000Z (27 days ago)
- Last Synced: 2025-05-07T03:03:45.335Z (23 days ago)
- Topics: document-extraction, documents, email-forensics, excel, forensics, forensics-investigations, metadata-extraction, ocr
- Language: Python
- Homepage:
- Size: 67.1 MB
- Stars: 62
- Watchers: 4
- Forks: 31
- Open Issues: 27
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ingestors
``ingestors`` extract useful information from documents of different types in
a structured standard format. It retains folder structures across directories,
compressed archives and emails. The extracted data is formatted as Follow the
Money (FtM) entities, ready for import into Aleph, or processing as an object
graph.Supported file types:
* Plain text
* Images
* Web pages, XML documents
* PDF files
* Emails (Outlook, plain text)
* Archive files (ZIP, Rar, etc.)Other features:
* Extendable and composable using classes and mixins.
* Generates FollowTheMoney objects to a database as result objects.
* Lightweight worker-style support for logging, failures and callbacks.
* Throughly tested.## Development environment
For local development with a virtualenv:
```bash
python3 -mvenv .env
source .env/bin/activate
pip install -r requirements.txt
```## Release procedure
```bash
git pull --rebase
make build
make test
source .env/bin/activate
bump2version {patch,minor,major} # pick the appropriate one
git push --atomic origin $(git branch --show-current) $(git describe --tags --abbrev=0)
```## Usage
Ingestors are usually called in the context of Aleph. In order to run them
stand-alone, you can use the supplied docker compose environment. To enter
a working container, run:```bash
make build
make shell
```Inside the shell, you will find the `ingestors` command-line tool. During
development, it is convenient to call its debug mode using files present
in the user's home directory, which is mounted at `/host`:```bash
ingestors debug /host/Documents/sample.xlsx
```## License
As of release version 3.18.4 `ingest-file` is licensed under the AGPLv3 or later license. Previous versions were released under the MIT license.