An open API service indexing awesome lists of open source software.

https://github.com/tracebloc/data-ingestors

tracebloc data pipeline for training/test dataset setup
https://github.com/tracebloc/data-ingestors

data-ingestion data-pipeline data-preparation data-preprocessing-and-cleaning data-validation tracebloc

Last synced: 2 months ago
JSON representation

tracebloc data pipeline for training/test dataset setup

Awesome Lists containing this project

README

          

[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE) [![PyPI](https://img.shields.io/pypi/v/tracebloc-ingestor.svg)](https://pypi.org/project/tracebloc-ingestor/) [![Python](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://python.org) [![Platform](https://img.shields.io/badge/platform-tracebloc-00C9A7.svg)](https://ai.tracebloc.io)

# Data Ingestors ๐Ÿ“Š

Get your data into the [tracebloc](https://tracebloc.io/) training environment โ€” validated, clean, and ready for model evaluation.

These pipelines handle the full data preparation workflow: validation, preprocessing, and secure transfer into your Kubernetes cluster. A metadata representation syncs to the tracebloc web app so you can manage datasets visually. Your raw data never leaves your infrastructure.

## How it works

```
Your raw data
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Data ingestor โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ Your Kubernetes cluster โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ Validates โ”‚ โ”‚ Validated dataset โ”‚
โ”‚ Preprocesses โ”‚ โ”‚ (ready for training) โ”‚
โ”‚ Transfers โ”‚ โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
Metadata only
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ tracebloc web app โ”‚
โ”‚ (dataset management UI) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

Data stays on your infrastructure. Only metadata (structure, schema, statistics) syncs to the web app for dataset management and vendor guidance.

## Supported data types

| Type | Examples |
|---|---|
| **Image** | Classification, detection, segmentation datasets |
| **Text / NLP** | Document classification, sentiment, named entities |
| **Tabular** | Structured CSV data, feature tables |
| **Time series** | Sequential measurements, forecasting datasets |

## Install

```bash
pip install tracebloc-ingestor
```

## Prerequisites

- Python 3.8+
- A [tracebloc account](https://ai.tracebloc.io/signup) with an active use case
- A running [tracebloc client](https://github.com/tracebloc/client) on your infrastructure

For step-by-step data preparation instructions โ†’ [Prepare Data guide](https://docs.tracebloc.io/create-use-case/prepare-dataset)

## Links

[Platform](https://ai.tracebloc.io/) ยท [Docs](https://docs.tracebloc.io/) ยท [Data preparation guide](https://docs.tracebloc.io/create-use-case/prepare-dataset) ยท [Discord](https://discord.gg/tracebloc)

## License

Apache 2.0 โ€” see [LICENSE](LICENSE).

**Questions?** [support@tracebloc.io](mailto:support@tracebloc.io) or [open an issue](https://github.com/tracebloc/data-ingestors/issues).