https://github.com/tracebloc/data-ingestors
tracebloc data pipeline for training/test dataset setup
https://github.com/tracebloc/data-ingestors
data-ingestion data-pipeline data-preparation data-preprocessing-and-cleaning data-validation tracebloc
Last synced: 2 months ago
JSON representation
tracebloc data pipeline for training/test dataset setup
- Host: GitHub
- URL: https://github.com/tracebloc/data-ingestors
- Owner: tracebloc
- License: apache-2.0
- Created: 2024-10-18T05:50:16.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2026-03-15T20:21:48.000Z (2 months ago)
- Last Synced: 2026-03-16T08:07:58.190Z (2 months ago)
- Topics: data-ingestion, data-pipeline, data-preparation, data-preprocessing-and-cleaning, data-validation, tracebloc
- Language: Python
- Homepage:
- Size: 4.86 MB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
- License: LICENSE
Awesome Lists containing this project
README
[](LICENSE) [](https://pypi.org/project/tracebloc-ingestor/) [](https://python.org) [](https://ai.tracebloc.io)
# Data Ingestors ๐
Get your data into the [tracebloc](https://tracebloc.io/) training environment โ validated, clean, and ready for model evaluation.
These pipelines handle the full data preparation workflow: validation, preprocessing, and secure transfer into your Kubernetes cluster. A metadata representation syncs to the tracebloc web app so you can manage datasets visually. Your raw data never leaves your infrastructure.
## How it works
```
Your raw data
โ
โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data ingestor โโโโโโบโ Your Kubernetes cluster โ
โ โ โ โ
โ Validates โ โ Validated dataset โ
โ Preprocesses โ โ (ready for training) โ
โ Transfers โ โ โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ
Metadata only
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ tracebloc web app โ
โ (dataset management UI) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
Data stays on your infrastructure. Only metadata (structure, schema, statistics) syncs to the web app for dataset management and vendor guidance.
## Supported data types
| Type | Examples |
|---|---|
| **Image** | Classification, detection, segmentation datasets |
| **Text / NLP** | Document classification, sentiment, named entities |
| **Tabular** | Structured CSV data, feature tables |
| **Time series** | Sequential measurements, forecasting datasets |
## Install
```bash
pip install tracebloc-ingestor
```
## Prerequisites
- Python 3.8+
- A [tracebloc account](https://ai.tracebloc.io/signup) with an active use case
- A running [tracebloc client](https://github.com/tracebloc/client) on your infrastructure
For step-by-step data preparation instructions โ [Prepare Data guide](https://docs.tracebloc.io/create-use-case/prepare-dataset)
## Links
[Platform](https://ai.tracebloc.io/) ยท [Docs](https://docs.tracebloc.io/) ยท [Data preparation guide](https://docs.tracebloc.io/create-use-case/prepare-dataset) ยท [Discord](https://discord.gg/tracebloc)
## License
Apache 2.0 โ see [LICENSE](LICENSE).
**Questions?** [support@tracebloc.io](mailto:support@tracebloc.io) or [open an issue](https://github.com/tracebloc/data-ingestors/issues).