{"id":29029752,"url":"https://github.com/neuralinkcorp/datarepo","last_synced_at":"2025-08-17T18:33:27.098Z","repository":{"id":300684925,"uuid":"988483813","full_name":"neuralinkcorp/datarepo","owner":"neuralinkcorp","description":null,"archived":false,"fork":false,"pushed_at":"2025-07-25T19:33:55.000Z","size":12909,"stargazers_count":93,"open_issues_count":5,"forks_count":13,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-08-08T19:25:36.075Z","etag":null,"topics":["data-warehouse","datalake","datawarehouse","delta-lake"],"latest_commit_sha":null,"homepage":"https://data-repo.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/neuralinkcorp.png","metadata":{"files":{"readme":"docs/README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-22T16:00:23.000Z","updated_at":"2025-08-05T14:04:05.000Z","dependencies_parsed_at":"2025-06-25T22:34:07.031Z","dependency_job_id":"2d150e47-fa0b-4d48-9623-b50acdc9913a","html_url":"https://github.com/neuralinkcorp/datarepo","commit_stats":null,"previous_names":["neuralinkcorp/neuralake","neuralinkcorp/datarepo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/neuralinkcorp/datarepo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralinkcorp%2Fdatarepo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralinkcorp%2Fdatarepo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralinkcorp%2Fdatarepo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralinkcorp%2Fdatarepo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/neuralinkcorp","download_url":"https://codeload.github.com/neuralinkcorp/datarepo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralinkcorp%2Fdatarepo/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270891830,"owners_count":24663538,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-17T02:00:09.016Z","response_time":129,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-warehouse","datalake","datawarehouse","delta-lake"],"created_at":"2025-06-26T08:37:12.278Z","updated_at":"2025-08-17T18:33:27.090Z","avatar_url":"https://github.com/neuralinkcorp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- Using CSS to hide this on the site, as the logo is already on the nav.--\u003e\n\u003cdiv align=\"center\" class=\"github-only\"\u003e\n    \u003cimg src=\"images/banner_black.png\"\u003e\n    \u003cbr\u003e\n    \u003ca href=\"https://data-repo.io\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/DOCS-blue?style=for-the-badge\" alt=\"Documentation\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://pypi.org/project/data-repository/\"\u003e\n        \u003cimg src=\"https://img.shields.io/pypi/v/data-repository?style=for-the-badge\" alt=\"PyPI Version\"\u003e\n    \u003c/a\u003e\n\u003c/div\u003e\n\n# datarepo: a simple platform for complex data\n\n`datarepo` is a simple query interface for multimodal data at any scale.\n\nWith `datarepo`, you can define a catalog, databases, and tables to query any existing data source. Once you've defined your catalog, you can spin up a static site for easy browsing or a read-only API for programmatic access. No running servers or services!\n\nThe `datarepo` catalog has native, declarative connectors to [Delta Lake](https://delta.io/) and [Parquet](https://parquet.apache.org/) stores. `datarepo` also supports defining tables via custom Python functions, so you can connect to any data source!\n\nHere's an example catalog:\n\n\u003cdiv class=\"github-only\"\u003e\n    \u003cimg src=\"images/catalog.png\" /\u003e\n\u003c/div\u003e\n\n\u003c!-- The below comment is replaced by a mkdown hook to insert an iFrame catalog --\u003e\n\u003c!-- this is done via hooks because we can't show the iFrame on GitHub, but want to show it on the static site. --\u003e\n\u003c!-- mkdocs:iframe --\u003e\n\n## Key features\n\n- **Unified interface**: Query data across different storage modalities (Parquet, DeltaLake, relational databases)\n- **Declarative catalog syntax**: Define catalogs in python without running services\n- **Catalog site generation**: Generate a static site catalog for visual browsing\n- **Extensible**: Declare tables as custom python functions for querying **any** data\n- **API support**: Generate a YAML config for querying with [ROAPI](https://github.com/roapi/roapi)\n- **Fast**: Uses Rust-native libraries such as [polars](https://github.com/pola-rs/), [delta-rs](https://github.com/delta-io/delta-rs), and [Apache DataFusion](https://github.com/apache/datafusion) for performant reads\n\n## Philosophy\nData engineering should be simple. That means:\n\n1. **Scale up and scale down** - tools should scale down to a developer's laptop and up to stateless clusters\n2. **Prioritize local development experience** - use composable libraries instead of distributed services\n3. **Code as a catalog** - define tables *in code*, generate a static site catalog and APIs without running services\n\n## Quick start\n\nInstall the latest version with:\n\n```bash\npip install data-repository\n```\n\n### Create a table and catalog\n\nFirst, create a module to define your tables (e.g., `tpch_tables.py`):\n\n```python\n# tpch_tables.py\nfrom datarepo.core import (\n    DeltalakeTable,\n    ParquetTable,\n    Filter,\n    table,\n    NlkDataFrame,\n    Partition,\n    PartitioningScheme,\n)\nimport pyarrow as pa\nimport polars as pl\n\n# Delta Lake backed table\npart = DeltalakeTable(\n    name=\"part\",\n    uri=\"s3://my-bucket/tpc-h/part\",\n    schema=pa.schema(\n        [\n            (\"p_partkey\", pa.int64()),\n            (\"p_name\", pa.string()),\n            (\"p_mfgr\", pa.string()),\n            (\"p_brand\", pa.string()),\n            (\"p_type\", pa.string()),\n            (\"p_size\", pa.int32()),\n            (\"p_container\", pa.string()),\n            (\"p_retailprice\", pa.decimal128(12, 2)),\n            (\"p_comment\", pa.string()),\n        ]\n    ),\n    docs_filters=[\n        Filter(\"p_partkey\", \"=\", 1),\n        Filter(\"p_brand\", \"=\", \"Brand#1\"),\n    ],\n    unique_columns=[\"p_partkey\"],\n    description=\"\"\"\n    Part information from the TPC-H benchmark.\n    Contains details about parts including name, manufacturer, brand, and retail price.\n    \"\"\",\n    table_metadata_args={\n        \"data_input\": \"Part catalog data from manufacturing systems, updated daily\",\n        \"latency_info\": \"Daily batch updates from manufacturing ERP system\",\n        \"example_notebook\": \"https://example.com/notebooks/part_analysis.ipynb\",\n    },\n)\n\n# Table defined as a function\n@table(\n    data_input=\"Supplier master data from vendor management system \u003ccode\u003e/api/suppliers/master\u003c/code\u003e endpoint\",\n    latency_info=\"Updated weekly by the supplier_master_sync DAG on Airflow\",\n)\ndef supplier() -\u003e NlkDataFrame:\n    \"\"\"Supplier information from the TPC-H benchmark.\"\"\"\n    data = {\n        \"s_suppkey\": [1, 2, 3, 4, 5],\n        \"s_name\": [\n            \"Supplier#1\",\n            \"Supplier#2\",\n        ],\n        \"s_address\": [\n            \"123 Main St\",\n            \"456 Oak Ave\",\n        ],\n        \"s_nationkey\": [1, 1],\n        \"s_phone\": [\"555-0001\", \"555-0002\"],\n        \"s_acctbal\": [1000.00, 2000.00],\n        \"s_comment\": [\"Comment 1\", \"Comment 2\"],\n    }\n    return pl.LazyFrame(data)\n\n```\n\n```python\n# tpch_catalog.py\nfrom datarepo.core import Catalog, ModuleDatabase\nimport tpch_tables\n\n# Create a catalog\ndbs = {\"tpc-h\": ModuleDatabase(tpch_tables)}\nTPCHCatalog = Catalog(dbs)\n```\n\n### Query the data\n\n```python\n\u003e\u003e\u003e from tpch_catalog import TPCHCatalog\n\u003e\u003e\u003e from datarepo.core import Filter\n\u003e\u003e\u003e\n\u003e\u003e\u003e # Get part and supplier information\n\u003e\u003e\u003e part_data = TPCHCatalog.db(\"tpc-h\").table(\n...     \"part\",\n...     (\n...         Filter('p_partkey', 'in', [1, 2, 3, 4]),\n...         Filter('p_brand', 'in', ['Brand#1', 'Brand#2', 'Brand#3']),\n...     ),\n... )\n\u003e\u003e\u003e\n\u003e\u003e\u003e supplier_data = TPCHCatalog.db(\"tpc-h\").table(\"supplier\")\n\u003e\u003e\u003e\n\u003e\u003e\u003e # Join part and supplier data and select specific columns\n\u003e\u003e\u003e joined_data = part_data.join(\n...     supplier_data,\n...     left_on=\"p_partkey\",\n...     right_on=\"s_suppkey\",\n... ).select([\"p_name\", \"p_brand\", \"s_name\"]).collect()\n\u003e\u003e\u003e\n\u003e\u003e\u003e print(joined_data)\nshape: (4, 3)\n┌────────────┬────────────┬────────────┐\n│ p_name     │ p_brand    │ s_name     │\n│ ---        │ ---        │ ---        │\n│ str        │ str        │ str        │\n╞════════════╪════════════╪════════════╡\n│ Part#1     │ Brand#1    │ Supplier#1 │\n│ Part#2     │ Brand#2    │ Supplier#2 │\n│ Part#3     │ Brand#3    │ Supplier#3 │\n│ Part#4     │ Brand#1    │ Supplier#4 │\n└────────────┴────────────┴────────────┘\n```\n\n### Generate a static site catalog\nYou can export your catalog to a static site with a single command:\n\n```python\n# export.py\nfrom datarepo.export.web import export_and_generate_site\nfrom tpch_catalog import TPCHCatalog\n\n# Export and generate the site\nexport_and_generate_site(\n    catalogs=[(\"tpch\", TPCHCatalog)], output_dir=str(output_dir)\n)\n```\n\n\n### Generate an API\n\nYou can also generate a YAML configuration for [ROAPI](https://github.com/roapi/roapi):\n\n```python\nfrom datarepo.export import roapi\nfrom tpch_catalog import TPCHCatalog\n\n# Generate ROAPI config\nroapi.generate_config(TPCHCatalog, output_file=\"roapi-config.yaml\")\n```\n\n## About Neuralink\n\n`datarepo` is part of Neuralink's commitment to the open source community. By maintaining free and open source software, we aim to accelerate data engineering and biotechnology.\n\nNeuralink is creating a generalized brain interface to restore autonomy to those with unmet medical needs today, and to unlock human potential tomorrow.\n\nYou don't have to be a brain surgeon to work at Neuralink. We are looking for exceptional individuals from many fields, including software and data engineering. Learn more at [neuralink.com/careers](https://neuralink.com/careers/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneuralinkcorp%2Fdatarepo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneuralinkcorp%2Fdatarepo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneuralinkcorp%2Fdatarepo/lists"}