{"id":34565733,"url":"https://github.com/datachain-ai/datachain","last_synced_at":"2026-04-30T00:07:38.785Z","repository":{"id":249816056,"uuid":"820144741","full_name":"datachain-ai/datachain","owner":"datachain-ai","description":"Data Memory: the operational data context layer for AI agents - typed, versioned datasets over images, video, docs and tables","archived":false,"fork":false,"pushed_at":"2026-04-28T18:24:05.000Z","size":16516,"stargazers_count":2737,"open_issues_count":76,"forks_count":139,"subscribers_count":17,"default_branch":"main","last_synced_at":"2026-04-28T19:32:32.194Z","etag":null,"topics":["ai-agents","claude-code","codex","data-context-layer","data-memory","data-processing","harness-engineering","knowledge-base","mlops","multimodal","pydantic","unstructured-data"],"latest_commit_sha":null,"homepage":"https://docs.datachain.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datachain-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/contributing.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.rst","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-06-25T22:29:35.000Z","updated_at":"2026-04-28T18:24:08.000Z","dependencies_parsed_at":"2026-02-15T05:03:03.294Z","dependency_job_id":null,"html_url":"https://github.com/datachain-ai/datachain","commit_stats":null,"previous_names":["iterative/datachain","datachain-ai/datachain"],"tags_count":252,"template":false,"template_full_name":null,"purl":"pkg:github/datachain-ai/datachain","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datachain-ai%2Fdatachain","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datachain-ai%2Fdatachain/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datachain-ai%2Fdatachain/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datachain-ai%2Fdatachain/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datachain-ai","download_url":"https://codeload.github.com/datachain-ai/datachain/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datachain-ai%2Fdatachain/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32448892,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T22:27:22.272Z","status":"ssl_error","status_checked_at":"2026-04-29T22:10:49.234Z","response_time":110,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","claude-code","codex","data-context-layer","data-memory","data-processing","harness-engineering","knowledge-base","mlops","multimodal","pydantic","unstructured-data"],"created_at":"2025-12-24T09:04:04.277Z","updated_at":"2026-04-30T00:07:38.779Z","avatar_url":"https://github.com/datachain-ai.png","language":"Python","funding_links":[],"categories":["NLP","Python","ai-agents","Memory Systems"],"sub_categories":["Usage"],"readme":"# ![DataChain](docs/assets/datachain.svg) DataChain - Data Memory for AI Agents\n\n[![PyPI](https://img.shields.io/pypi/v/datachain.svg)](https://pypi.org/project/datachain/)\n[![Python Version](https://img.shields.io/pypi/pyversions/datachain)](https://pypi.org/project/datachain)\n[![Codecov](https://codecov.io/gh/datachain-ai/datachain/graph/badge.svg?token=byliXGGyGB)](https://codecov.io/gh/datachain-ai/datachain)\n[![Tests](https://github.com/datachain-ai/datachain/actions/workflows/tests.yml/badge.svg)](https://github.com/datachain-ai/datachain/actions/workflows/tests.yml)\n[![DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/datachain-ai/datachain)\n\n**The model floor is the same for everyone. The context ceiling is yours.**\n\nYour data lives in object storage (millions of images, hours of video, documents) and databases (structured tables). Every chain a teammate or agent runs deposits a typed, versioned dataset into **Data Memory**: embeddings, classifications, joins, scores. At scale, those datasets are too expensive to recompute and too scattered to find on demand.\n\nDataChain is the Python library that runs your code over heavy files and tables in parallel and queries Data Memory at warehouse speed. Read from S3, GCS, or Azure, run your code, save as a Pydantic-typed dataset; the next pipeline or agent picks up from there.\n\n## 1. Why Data Memory\n\nClaude Code, Cursor, and Codex made AI good at code by giving it the repo context. Agents over your data need the same: a **data context layer** with schemas, lineage, and prior conclusions. **That layer is captured during production, not curated after.** Every DataChain pipeline run deposits a typed, versioned dataset into Data Memory; the Knowledge Base compiles those datasets into what agents read. Without production through DataChain, the layer has nothing structured to describe.\n\n## 2. Install\n\n```bash\npip install datachain\n```\n\nTo add the agent skill (Knowledge Base + code generation):\n\n```bash\ndatachain skill install --target claude     # also: --target cursor, --target codex\n```\n\nWorks with S3, GCS, Azure, and local filesystems.\n\n## 3. Quickstart: agent-driven pipeline\n\nTask: find dogs in S3 similar to a reference image, filtered by breed, mask availability, and image dimensions.\n\nGrab a reference image and run Claude Code (or other agent):\n```bash\ndatachain cp --anon s3://dc-readme/fiona.jpg .\n\nclaude\n```\n\nPrompt:\n```prompt\nFind dogs in s3://dc-readme/oxford-pets-micro/ similar to fiona.jpg:\n  - Pull breed metadata and mask files from annotations/\n  - Exclude images without mask\n  - Exclude Cocker Spaniels\n  - Only include images wider than 400px\n```\n\nResult:\n```\n  ┌──────┬───────────────────────────────────┬────────────────────────────┬──────────┐\n  │ Rank │               Image               │           Breed            │ Distance │\n  ├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤\n  │    1 │ shiba_inu_52.jpg                  │ shiba_inu                  │    0.244 │\n  ├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤\n  │    2 │ shiba_inu_53.jpg                  │ shiba_inu                  │    0.323 │\n  ├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤\n  │    3 │ great_pyrenees_17.jpg             │ great_pyrenees             │    0.325 │\n  └──────┴───────────────────────────────────┴────────────────────────────┴──────────┘\n\n  Fiona's closest matches are shiba inus (both top spots), which makes sense given her\n  tan coloring and pointed ears.\n```\n\nThe agent decomposed the task into steps - embeddings, breed metadata, mask join, quality filter - and saved each as a named, versioned dataset. Next time you ask a related question, it starts from what's already built.\n\nThe datasets are registered in a Knowledge Base optimized for both agents and humans:\n\n```bash\ndc-knowledge\n├── buckets\n│   └── s3\n│       └── dc_readme.md\n├── datasets\n│   ├── oxford_micro_dog_breeds.md\n│   ├── oxford_micro_dog_embeddings.md\n│   └── similar_to_fiona.md\n└── index.md\n```\n\nBrowse it as markdown files, navigate with wikilinks, or open in [Obsidian](https://obsidian.md/):\n\n![Visualize data Knowledge Base](docs/assets/readme_obsidian.gif)\n\n\n## 4. Data Harness\n\nCode harnesses (Claude Code, Cursor, Codex) give agents repo context, dedicated tools, and memory across sessions. DataChain adds the same for data: typed datasets the agent reads, chain operations the agent calls (`read_storage`, `map`, `save`), Data Memory where its results persist.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/harness.svg\" alt=\"DataChain as a data harness\" width=\"500\" /\u003e\n\u003c/p\u003e\n\nA **dataset** is the unit of work - a named, versioned result of a pipeline step like `pets_embeddings@1.0.0`. Every `.save()` registers one.\n\nFor the data-flow architecture (Python Data Engine, Data Memory, Query Engine, Knowledge Base) and how the components connect, see [Architecture](https://docs.datachain.ai/architecture/).\n\n\n## 5. Core concepts\n\n### 5.1. Dataset\n\nA dataset is a versioned data reasoning step - what was computed, from what input, producing what schema. DataChain indexes your storage into one: no data copied, just typed metadata and file pointers. Re-runs only process new or changed files.\n\nCreate a dataset manually `create_dataset.py`:\n```python\nfrom PIL import Image\nimport io\nfrom pydantic import BaseModel\nimport datachain as dc\n\nclass ImageInfo(BaseModel):\n    width: int\n    height: int\n\ndef get_info(file: dc.File) -\u003e ImageInfo:\n    img = Image.open(io.BytesIO(file.read()))\n    return ImageInfo(width=img.width, height=img.height)\n\nds = (\n    dc.read_storage(\n        \"s3://dc-readme/oxford-pets-micro/images/**/*.jpg\",\n        anon=True,\n        update=True,\n        delta=True,         # re-runs skip unchanged files\n    )\n    .settings(prefetch=64)\n    .map(info=get_info)\n    .save(\"pets_images\")\n)\nds.show(5)\n```\n\n`pets_images@1.0.0` is now the shared reference to this data - schema, version, lineage, and metadata.\n\nEvery `.save()` registers the dataset in **Data Memory**, DataChain's persistent store for schemas, versions, lineage, and processing state, kept locally in SQLite DB `.datachain/db`. Pipelines reference datasets by name, not paths. When the code or input data changes, the next run bumps dataset version.\n\nThis is what makes a **dataset a management unit:** owned, versioned, and queryable by everyone on the team.\n\n### 5.2. Schemas and types\n\nDataChain uses Pydantic to define the shape of every column. The return type of your UDF becomes the dataset schema - each field a queryable column in Data Memory.\n\n`show()` in the previous script renders nested fields as dotted columns:\n\n```bash\n                                          file    file  info   info\n                                          path    size width height\n0  oxford-pets-micro/images/Abyssinian_141.jpg  111270   461    500\n1  oxford-pets-micro/images/Abyssinian_157.jpg  139948   500    375\n2  oxford-pets-micro/images/Abyssinian_175.jpg   31265   600    234\n3  oxford-pets-micro/images/Abyssinian_220.jpg   10687   300    225\n4    oxford-pets-micro/images/Abyssinian_3.jpg   61533   600    869\n\n[Limited by 5 rows]\n```\n\n`.print_schema()` renders it's schema:\n```bash\nfile: File@v1\n  source: str\n  path: str\n  size: int\n  version: str\n  etag: str\n  is_latest: bool\n  last_modified: datetime\n  location: Union[dict, list[dict], NoneType]\ninfo: ImageInfo\n  width: int\n  height: int\n```\n\nModels can be arbitrarily nested - a `BBox` inside an `Annotation`, a `List[Citation]` inside an LLM Response - every leaf field stays queryable the same way. The schema lives in Data Memory and is enforced at dataset creation time.\n\nData Memory handles datasets of any size - 100 millions of files, hundreds of metadata rows - without loading anything into memory. **Pandas is limited by RAM; DataChain is not.** Export to pandas when you need it, on a filtered subset:\n\n```python\nimport datachain as dc\n\ndf = dc.read_dataset(\"pets_images\").filter(dc.C(\"info.width\") \u003e 500).to_pandas()\nprint(df)\n```\n\n### 5.3. Fast queries\n\nFilters, aggregations, and joins run as vectorized operations directly against Data Memory - metadata never leaves your machine, no files downloaded.\n\n```python\nimport datachain as dc\n\ncnt = (\n    dc.read_dataset(\"pets_images\")\n    .filter(\n        (dc.C(\"info.width\") \u003e 400) \u0026\n        ~dc.C(\"file.path\").ilike(\"%cocker_spaniel%\")   # case-insensitive\n    )\n    .count()\n)\nprint(f\"Large images with Cocker Spaniel: {cnt}\")\n```\n\nMilliseconds, even at 100M-file scale.\n```\nLarge images with Cocker Spaniel: 6\n```\n\n## 6. Resilient Pipelines\n\nWhen computation is expensive, bugs and new data are both inevitable. DataChain tracks processing state in Data Memory - so crashes and new data are handled automatically, without changing how you write pipelines.\n\n### 6.1. Data checkpoints\n\nSave to `embed.py`:\n```python\nimport open_clip, torch, io\nfrom PIL import Image\nimport datachain as dc\n\nmodel, _, preprocess = open_clip.create_model_and_transforms(\"ViT-B-32\", \"laion2b_s34b_b79k\")\nmodel.eval()\n\ncounter = 0\n\ndef encode(file: dc.File, model, preprocess) -\u003e list[float]:\n    global counter\n    counter += 1\n    if counter \u003e 236:                                    # ← bug: remove these two lines\n        raise Exception(\"some bug\")                      # ←\n    img = Image.open(io.BytesIO(file.read())).convert(\"RGB\")\n    with torch.no_grad():\n        return model.encode_image(preprocess(img).unsqueeze(0))[0].tolist()\n\n(\n    dc.read_dataset(\"pets_images\")\n    .settings(batch_size=100)\n    .setup(model=lambda: model, preprocess=lambda: preprocess)\n    .map(emb=encode)\n    .save(\"pets_embeddings\")\n)\n```\n\nIt fails due to a bug in the code:\n```\nException: some bug\n```\n\nRemove the two marked lines and re-run - DataChain resumes from image 201 (two 100 size batches are completed), the start of the last uncommitted batch:\n\n```\n$ python embed.py\nUDF 'encode': Continuing from checkpoint\n```\n\n### 6.2. Similarity search\n\nThe vectors live in Data Memory alongside all the metadata - `list[float]` type in pydentic schemas. Querying them is instant - no files re-read and can be combined with not vector filters like `info.width`:\n\nPrepare data:\n```bash\ndatachain cp s3://dc-readme/fiona.jpg .\n```\n\n`similar.py`:\n```python\nimport open_clip, torch, io\nfrom PIL import Image\nimport datachain as dc\n\nmodel, _, preprocess = open_clip.create_model_and_transforms(\"ViT-B-32\", \"laion2b_s34b_b79k\")\nmodel.eval()\n\nref_emb = model.encode_image(\n    preprocess(Image.open(\"fiona.jpg\")).unsqueeze(0)\n)[0].tolist()\n\n(\n    dc.read_dataset(\"pets_embeddings\")\n    .filter(dc.C(\"info.width\") \u003e 500)          # from pets_images - no re-read\n    .mutate(dist=dc.func.cosine_distance(dc.C(\"emb\"), ref_emb))\n    .order_by(\"dist\")\n    .limit(3)\n    .show()\n)\n```\n\nUnder a second - everything runs against Data Memory.\n\n\n### 6.3. Incremental updates\n\nThe bucket in this walkthrough is static, so there's nothing new to process. But in production - when new images land in your bucket - re-run the same scripts unchanged. `delta=True` in the original dataset ensures only new files are processed end to end while the whole dataset will be updated to `pets_images@1.0.1`:\n\n```python\n$ python create_dataset.py   # 500 new images arrived\nSkipping 10,000 unchanged  ·  indexing 500 new\nSaved pets_images@1.0.1  (+500 records)\n\n# Next day:\n\n$ python create_dataset.py\nSkipping 10,000 unchanged  ·  processing 500 new\nSaved pets_images@1.0.2  (+500 records)\n```\n\n## 7. Knowledge Base\n\nDataChain maintains two layers. **Data Memory** is the ground truth: schemas, processing state, lineage, the vectors themselves. **The Knowledge Base** is derived from it: structured markdown for humans and agents to read. Because it's derived, it's always accurate. The Knowledge Base is stored in `dc-knowledge/`.\n\nAsk the agent to build it (from Calude Code, Codex or Cursor):\n```bash\nclaude\n```\n\nPrompt:\n```prompt\nBuild a Knowledge Base for my current datasets\n```\n\nThe skill generates `dc-knowledge/` directory from Data Memory - one file per dataset and bucket:\n\n\n## 8. AI-Generated Pipelines\n\nThe skill gives the agent data awareness: it reads `dc-knowledge/` to understand what datasets exist, their schemas, which fields can be joined - and the meaning of columns inferred from the code that produced them.\n\nSee section `1. See it in action`. All the steps that were manually created could be just generated.\n\n\n## 9. Team and cloud: Studio\n\nData context built locally stays local. DataChain Studio makes it shared.\n\n```bash\ndatachain auth login\ndatachain job run --workers 20 --cluster gpu-pool caption.py\n# ✓ Job submitted → studio.datachain.ai/jobs/1042\n# Resuming from checkpoint (4,218 already done)...\n# Saved oxford-pets-caps@0.0.1  (3,182 processed)\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/studio_architecture.svg\" alt=\"DataChain Studio Architecture\" width=\"600\" /\u003e\n\u003c/p\u003e\n\nStudio adds: shared dataset registry, access control, UI for video/DICOM/NIfTI/point clouds, lineage graphs, reproducible runs.\n\nBring Your Own Cloud - all data and compute stay in your infrastructure. AWS, GCP, Azure, on-prem Kubernetes.\n\n→ [studio.datachain.ai](https://studio.datachain.ai)\n\n## 10. Contributing\n\nContributions are very welcome. To learn more, see the [Contributor Guide](https://docs.datachain.ai/contributing).\n\n## 11. Community and Support\n\n- [Report an issue](https://github.com/datachain-ai/datachain/issues) if you encounter any problems\n- [Docs](https://docs.datachain.ai/)\n- [Email](mailto:support@datachain.ai)\n- [Twitter](https://twitter.com/datachain_ai)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatachain-ai%2Fdatachain","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatachain-ai%2Fdatachain","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatachain-ai%2Fdatachain/lists"}