{"id":26279725,"url":"https://github.com/alephdata/ingest-file","last_synced_at":"2025-05-07T03:04:18.709Z","repository":{"id":36954601,"uuid":"84333910","full_name":"alephdata/ingest-file","owner":"alephdata","description":"Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.","archived":false,"fork":false,"pushed_at":"2025-05-02T11:56:11.000Z","size":70349,"stargazers_count":62,"open_issues_count":27,"forks_count":31,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-05-07T03:03:45.335Z","etag":null,"topics":["document-extraction","documents","email-forensics","excel","forensics","forensics-investigations","metadata-extraction","ocr"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alephdata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-03-08T15:12:06.000Z","updated_at":"2025-05-01T20:51:41.000Z","dependencies_parsed_at":"2023-09-28T10:59:49.232Z","dependency_job_id":"5ead24c0-eb6c-4fa4-8a24-ad0c03438140","html_url":"https://github.com/alephdata/ingest-file","commit_stats":null,"previous_names":[],"tags_count":203,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alephdata%2Fingest-file","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alephdata%2Fingest-file/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alephdata%2Fingest-file/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alephdata%2Fingest-file/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alephdata","download_url":"https://codeload.github.com/alephdata/ingest-file/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252804206,"owners_count":21806769,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-extraction","documents","email-forensics","excel","forensics","forensics-investigations","metadata-extraction","ocr"],"created_at":"2025-03-14T14:16:08.230Z","updated_at":"2025-05-07T03:04:18.681Z","avatar_url":"https://github.com/alephdata.png","language":"Python","readme":"# ingestors\n\n``ingestors`` extract useful information from documents of different types in\na structured standard format. It retains folder structures across directories,\ncompressed archives and emails. The extracted data is formatted as Follow the \nMoney (FtM) entities, ready for import into Aleph, or processing as an object\ngraph.\n\nSupported file types:\n\n* Plain text\n* Images\n* Web pages, XML documents\n* PDF files\n* Emails (Outlook, plain text)\n* Archive files (ZIP, Rar, etc.)\n\nOther features:\n\n* Extendable and composable using classes and mixins.\n* Generates FollowTheMoney objects to a database as result objects.\n* Lightweight worker-style support for logging, failures and callbacks.\n* Throughly tested.\n\n## Development environment\n\nFor local development with a virtualenv:\n\n```bash\npython3 -mvenv .env\nsource .env/bin/activate\npip install -r requirements.txt\n```\n\n## Release procedure\n\n```bash\ngit pull --rebase\nmake build\nmake test\nsource .env/bin/activate\nbump2version {patch,minor,major} # pick the appropriate one\ngit push --atomic origin $(git branch --show-current) $(git describe --tags --abbrev=0)\n```\n\n## Usage\n\nIngestors are usually called in the context of Aleph. In order to run them\nstand-alone, you can use the supplied docker compose environment. To enter\na working container, run:\n\n```bash\nmake build\nmake shell\n```\n\nInside the shell, you will find the `ingestors` command-line tool. During\ndevelopment, it is convenient to call its debug mode using files present\nin the user's home directory, which is mounted at `/host`: \n\n```bash\ningestors debug /host/Documents/sample.xlsx\n```\n\n## License\n\nAs of release version 3.18.4 `ingest-file` is licensed under the AGPLv3 or later license. Previous versions were released under the MIT license.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falephdata%2Fingest-file","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falephdata%2Fingest-file","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falephdata%2Fingest-file/lists"}