Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sorcero/ingestum
Read-only mirror of https://gitlab.com/sorcero/community/ingestum
https://github.com/sorcero/ingestum
ingestion monitoring pdf processing python recognition transformers
Last synced: 11 days ago
JSON representation
Read-only mirror of https://gitlab.com/sorcero/community/ingestum
- Host: GitHub
- URL: https://github.com/sorcero/ingestum
- Owner: sorcero
- License: lgpl-3.0
- Created: 2021-07-24T00:58:55.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2023-01-23T20:42:15.000Z (almost 2 years ago)
- Last Synced: 2024-05-30T01:18:59.307Z (6 months ago)
- Topics: ingestion, monitoring, pdf, processing, python, recognition, transformers
- Language: Python
- Homepage: https://gitlab.com/sorcero/community/ingestum
- Size: 2.54 MB
- Stars: 7
- Watchers: 5
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# ingestum
[![Ingestum](docs/ingestum.png)](https://gitlab.com/sorcero/community/ingestum)
This library is used to transform common content formats into
documents that can be used by other pipelines and processes,
e.g. document comparison, search, or tagging. For example, taking an
HTML file, removing all of its HTML tags, and extracting only the
human-visible text. The resulting document is indistinguishable from
any other regular text document. This transformation process called
`ingestion`.To achieve this, the library relies on four main concepts:
1. [Sources](ingestum/sources/base.py), which refers to the common content formats that can be taken into the ingestion process, e.g. PDF, HTML, PNG, WAV, or feeds such as Twitter, ProQuest, or email.
2. [Documents](ingestum/documents/base.py), which refers to the final and intermediary state of an input _source_ during the ingestion process. Documents can be transformed into other types of documents, many times, until is ready for further processing.
3. [Transformers](ingestum/transformers/base.py), which refers to a single transformation function that can be applied to each content type, e.g. removing all hyphens from a text document, or removing all `` tags from a HTML document.
4. [Conditionals](ingestum/conditionals/base.py), which refers to a logic conditional operation that can be use to modify the behavior of a transformer.## Installation
Follow the [Installation Guide](https://sorcero.gitlab.io/community/ingestum/installation.html) for instructions.
## Documentation
Follow the compiled [Documentation](https://sorcero.gitlab.io/community/ingestum/) for introduction, guides, examples, and references.
## Disclaimer
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the [GNU Lesser General Public License](LICENSE) for more details.