{"id":34080187,"url":"https://github.com/tracebloc/data-ingestors","last_synced_at":"2026-06-10T08:01:11.360Z","repository":{"id":272907970,"uuid":"874596122","full_name":"tracebloc/data-ingestors","owner":"tracebloc","description":"tracebloc data pipeline for training/test dataset setup","archived":false,"fork":false,"pushed_at":"2026-06-09T10:48:40.000Z","size":6403,"stargazers_count":8,"open_issues_count":18,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2026-06-09T11:23:01.734Z","etag":null,"topics":["data-ingestion","data-pipeline","data-preparation","data-preprocessing-and-cleaning","data-validation","tracebloc"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tracebloc.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-10-18T05:50:16.000Z","updated_at":"2026-06-09T07:46:38.000Z","dependencies_parsed_at":"2025-04-02T10:25:41.844Z","dependency_job_id":"a59874b5-f816-41af-8242-6c54a69ccc12","html_url":"https://github.com/tracebloc/data-ingestors","commit_stats":null,"previous_names":["tracebloc/data-ingestors"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/tracebloc/data-ingestors","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tracebloc%2Fdata-ingestors","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tracebloc%2Fdata-ingestors/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tracebloc%2Fdata-ingestors/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tracebloc%2Fdata-ingestors/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tracebloc","download_url":"https://codeload.github.com/tracebloc/data-ingestors/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tracebloc%2Fdata-ingestors/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34142643,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-ingestion","data-pipeline","data-preparation","data-preprocessing-and-cleaning","data-validation","tracebloc"],"created_at":"2025-12-14T11:47:43.449Z","updated_at":"2026-06-10T08:01:11.319Z","avatar_url":"https://github.com/tracebloc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE) [![PyPI](https://img.shields.io/pypi/v/tracebloc-ingestor.svg)](https://pypi.org/project/tracebloc-ingestor/) [![Python](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://python.org) [![Platform](https://img.shields.io/badge/platform-tracebloc-00C9A7.svg)](https://ai.tracebloc.io)\n\n# Data Ingestors 📊\n\nMove your data into the [tracebloc](https://tracebloc.io/) training environment — validated, clean, and ready for model evaluation. **Your raw data never leaves your infrastructure.**\n\n## How it works\n\n```\nYour raw data\n     │\n     ▼\n┌──────────────────┐     ┌──────────────────────────────────┐\n│  Data ingestor   │────►│  Your Kubernetes cluster         │\n│                  │     │                                  │\n│  Validates       │     │  Validated dataset               │\n│  Preprocesses    │     │  (ready for training)            │\n│  Transfers       │     │                                  │\n└──────────────────┘     └──────────────┬───────────────────┘\n                                        │\n                               Metadata only\n                                        │\n                                        ▼\n                         ┌──────────────────────────┐\n                         │  tracebloc web app       │\n                         │  (dataset management UI) │\n                         └──────────────────────────┘\n```\n\nOnly metadata (schema, statistics, structure) syncs to the web app. Raw data stays put.\n\n## Supported data types\n\n| Type | Categories |\n|---|---|\n| **Image** | [`image_classification`](templates/image_classification), [`object_detection`](templates/object_detection), [`keypoint_detection`](templates/keypoint_detection), [`semantic_segmentation`](templates/semantic_segmentation) |\n| **Text / NLP** | [`text_classification`](templates/text_classification), [`masked_language_modeling`](templates/masked_language_modeling) |\n| **Tabular** | [`tabular_classification`](templates/tabular_classification), [`tabular_regression`](templates/tabular_regression) |\n| **Time series** | [`time_series_forecasting`](templates/time_series_forecasting), [`time_to_event_prediction`](templates/time_to_event_prediction) |\n\nEach template ships a sample dataset and an [example `ingest.yaml`](examples/yaml/) you can copy as a starting point.\n\n## Quickstart — declarative YAML (recommended)\n\nDescribe your dataset in ~8 lines of YAML, then `helm install`. The official ingestor image (this package, signed + SBOM-attested, published as `ghcr.io/tracebloc/ingestor`) runs it. No Dockerfile, no Python script.\n\n**1. One-time: add the chart repo on your workstation.**\n\n```bash\nhelm repo add tracebloc https://tracebloc.github.io/client\nhelm repo update\n```\n\nThe `tracebloc/client` parent chart bootstraps the cluster (jobs-manager, MySQL, RBAC). The `tracebloc/ingestor` subchart submits per-dataset ingestion runs against it.\n\n\u003e **Already installed the client via the one-liner (`bash \u003c(curl -fsSL https://tracebloc.io/i.sh)`)?** Use `--reset-then-reuse-values` so the helm upgrade doesn't drop the values the installer applied:\n\u003e\n\u003e ```bash\n\u003e helm upgrade \u003cworkspace\u003e tracebloc/client -n \u003cnamespace\u003e --reset-then-reuse-values\n\u003e ```\n\u003e\n\u003e Append `--version \u003cversion-number\u003e` to pin a specific chart version.\n\n**2. Stage your data on the cluster's shared PVC.**\n\nThe chart **doesn't transport data into the cluster** — it points at data already accessible to the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside the ingestor Pod). Before installing, get your raw files there. The simplest pattern for a small dataset is a throwaway `kubectl cp` Pod that mounts the PVC; for production you'd typically use an init container with cloud-storage sync. Full staging recipe + manifests → [`tracebloc/client/ingestor/README.md#stage-your-data-on-the-shared-pvc`](https://github.com/tracebloc/client/blob/develop/ingestor/README.md#stage-your-data-on-the-shared-pvc).\n\n**3. Write your `ingest.yaml`.**\n\nThe example below is for `image_classification`. **Other categories require different fields** — e.g. `tabular_classification` has no `images:` and instead needs a typed `schema:` block. Don't copy this one blindly; grab the matching file from [`examples/yaml/`](examples/yaml/) (one per category) and edit from there. Per-category sample data and READMEs live under [`templates/`](https://github.com/tracebloc/data-ingestors/tree/master/templates).\n\n```yaml\napiVersion: tracebloc.io/v1\nkind: IngestConfig\ncategory: image_classification\ntable: cats_dogs_train\nintent: train\ncsv: /data/shared/cats-dogs/labels.csv\nimages: /data/shared/cats-dogs/images/\nlabel: label\n```\n\nThe top-level shape (`apiVersion`, `kind`, `category`, `table`, `intent`, `label`) is the same for every category; the `category` field picks the validator set, file-extension defaults, and column conventions, and the data-source fields (`csv:`, `images:`, `schema:`, …) vary per category. The paths are *paths inside the ingestor Pod*, which is the PVC mount you populated in step 2.\n\n**4. Install once per dataset.**\n\n```bash\nhelm install my-cats-dogs tracebloc/ingestor \\\n  --namespace tracebloc \\\n  --set-file ingestConfig=./ingest.yaml\n```\n\nThe ingestor runs once: validates your data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. Repeat per dataset. Customers never build an image, never write a Dockerfile, never track digest versions — the cluster's auto-upgrade flow keeps the official image current.\n\nFull chart docs (data-staging recipe, schema, every category, update model, verification, override knobs) → **[`tracebloc/client/ingestor/README.md`](https://github.com/tracebloc/client/blob/develop/ingestor/README.md)**.\n\n## Advanced: custom processors (legacy Python pattern)\n\nUse this when the declarative schema can't express what your data needs — typically when you have non-trivial preprocessing logic, a custom validator, or a `BaseProcessor` subclass.\n\n**1. Install the package.**\n\n```bash\npip install tracebloc-ingestor\n```\n\n**2. Pick a template + adapt the script.**\n\n```bash\ncp templates/image_classification/image_classification.py .\n```\n\nThe package exports `BaseIngestor`, `CSVIngestor`, `JSONIngestor`, plus validators (`FileTypeValidator`, `ImageResolutionValidator`, `TableNameValidator`, etc.) and the `Database` / `APIClient` helpers. See [`examples/`](examples) for working scripts.\n\n**3. Build + deploy as a Kubernetes Job.**\n\nThe legacy [`Dockerfile`](Dockerfile) and [`ingestor-job.yaml`](ingestor-job.yaml) remain the canonical pattern for custom-processor flows:\n\n```bash\ndocker build -t \u003cyour-registry\u003e/\u003cimage-name\u003e:latest .\ndocker push \u003cyour-registry\u003e/\u003cimage-name\u003e:latest\nkubectl apply -f ingestor-job.yaml\n```\n\nThe Job needs these environment variables (set in [`ingestor-job.yaml`](ingestor-job.yaml)):\n\n| Variable | What it is |\n|---|---|\n| `CLIENT_ID`, `CLIENT_PASSWORD` | Tracebloc client credentials |\n| `CLIENT_PVC` | PVC name shared with the client (must match `values.yaml`) |\n| `MYSQL_HOST` | Hostname of the client's MySQL service |\n| `SRC_PATH` | Where your raw data is mounted in the ingestor pod |\n| `LABEL_FILE` | Path to labels (e.g. `Xy_train.csv`) |\n| `TABLE_NAME` | Destination table name in the client database |\n| `TITLE` | *(optional)* Human-readable dataset name |\n| `LOG_LEVEL` | *(optional)* `INFO`, `WARNING`, `ERROR` |\n\n### Running custom-processor flows under Pod Security Standards (`restricted`)\n\nIf the namespace you're deploying into enforces the [`restricted`](https://kubernetes.io/docs/concepts/security/pod-security-standards/) Pod Security Standard (OpenShift, hardened clusters, many managed-Kubernetes namespaces), the stock [`Dockerfile`](Dockerfile) and [`ingestor-job.yaml`](ingestor-job.yaml) won't admit. (The declarative path's image is already PSA-restricted-compatible; this section only applies to custom Dockerfiles built from this repo.) Two changes are needed.\n\nCheck first:\n\n```bash\nkubectl get ns \u003cnamespace\u003e -o jsonpath='{.metadata.labels}' | jq\n```\n\nLook for `pod-security.kubernetes.io/enforce: restricted`. If absent, the stock files admit fine and you can skip this section.\n\n**1. `Dockerfile` — drop root.** Append before `ENTRYPOINT`:\n\n```dockerfile\n# OpenShift-compatible: grant group write via GID 0\nRUN chgrp -R 0 /app \u0026\u0026 chmod -R g=u /app\nUSER 1001\n```\n\n**2. `ingestor-job.yaml` — add a hardened `securityContext`.** Both pod-level and container-level:\n\n```yaml\nspec:\n  template:\n    spec:\n      securityContext:                    # pod-level\n        runAsNonRoot: true\n        runAsUser: 1001\n        seccompProfile:\n          type: RuntimeDefault\n      containers:\n      - name: api\n        # ... existing container spec ...\n        securityContext:                  # container-level\n          allowPrivilegeEscalation: false\n          capabilities:\n            drop: [\"ALL\"]\n```\n\n### Subclassing BaseIngestor\n\nFor data that doesn't fit any of the existing templates, subclass `BaseIngestor`:\n\n```python\nfrom tracebloc_ingestor import BaseIngestor, FileTypeValidator\n\nclass MyIngestor(BaseIngestor):\n    validators = [FileTypeValidator(allowed=[\".parquet\"])]\n\n    def transform(self, record):\n        # your preprocessing\n        return record\n\nif __name__ == \"__main__\":\n    MyIngestor().ingest()\n```\n\n## Prerequisites\n\n- Python 3.8+\n- A [tracebloc account](https://ai.tracebloc.io/signup)\n- A running [tracebloc client](https://github.com/tracebloc/client) on your infrastructure\n\n## Links\n\n[Platform](https://ai.tracebloc.io/) · [Docs](https://docs.tracebloc.io/) · [Data preparation guide](https://docs.tracebloc.io/create-use-case/prepare-dataset) · [Discord](https://discord.gg/tracebloc)\n\nMaintainers: see [RELEASING.md](RELEASING.md) for the release procedure.\n\n## License\n\nApache 2.0 — see [LICENSE](LICENSE).\n\n**Questions?** [support@tracebloc.io](mailto:support@tracebloc.io) or [open an issue](https://github.com/tracebloc/data-ingestors/issues).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftracebloc%2Fdata-ingestors","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftracebloc%2Fdata-ingestors","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftracebloc%2Fdata-ingestors/lists"}