{"id":47938391,"url":"https://github.com/lancedb/hf-upload-demo","last_synced_at":"2026-04-04T07:55:23.806Z","repository":{"id":341500941,"uuid":"1169588375","full_name":"lancedb/hf-upload-demo","owner":"lancedb","description":"How to upload a Lance dataset to Hugging Face Hub and query it in LanceDB","archived":false,"fork":false,"pushed_at":"2026-03-02T03:04:37.000Z","size":2509,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-02T07:14:16.649Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lancedb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-28T22:52:21.000Z","updated_at":"2026-03-02T03:04:42.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/lancedb/hf-upload-demo","commit_stats":null,"previous_names":["lancedb/hf-upload-demo"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/lancedb/hf-upload-demo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Fhf-upload-demo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Fhf-upload-demo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Fhf-upload-demo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Fhf-upload-demo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lancedb","download_url":"https://codeload.github.com/lancedb/hf-upload-demo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Fhf-upload-demo/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31392188,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T04:26:24.776Z","status":"ssl_error","status_checked_at":"2026-04-04T04:23:34.147Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-04T07:55:23.173Z","updated_at":"2026-04-04T07:55:23.794Z","avatar_url":"https://github.com/lancedb.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LanceDB Hugging Face Update Demo\n\nThis project demonstrates a simple staged workflow to manage your Lance datasets on [Hugging Face Hub](https://huggingface.co/datasets/lancedb/magical_kingdom).\n\nTo create a repo like this, first create and upload an initial table using LanceDB on a local machine, and then upload it to the Hub via a CLI command.\n\nAs the dataset evolves, you can apply a one-time schema + data update, and upload the updated version of the data back to the Hub. Only the new data is uploaded, keeping things clean.\n\n## Setup\n\nUse `uv`, and export both `OPENAI_API_KEY` and `HF_TOKEN`.\n\n```bash\nuv sync\nexport OPENAI_API_KEY=...\nexport HF_TOKEN=hf_...\nhf auth login --token \"$HF_TOKEN\"\n```\n\nThe scripts look for a local file named `.env`, so to run any of them, you'll need to copy the `.env.example` to a new file named `.env` and update the respective env variables there.\n\n## Raw data layout\n\nSource JSON and generated portraits live under `raw_data/`:\n\n- `raw_data/magical_kingdom.json`\n- `raw_data/img/`\n- `raw_data/generate_images.py`\n\nSee `raw_data/README.md` for details on regenerating those assets.\n\n## Optional: Regenerate character portraits\n\nIf you want to regenerate the source images using an OpenAI model, run:\n\n```bash\nuv run python raw_data/generate_images.py\n```\n\n## Step 1: Create the initial Lance table\n\nStart clean, then build the Lance table locally.\nThis creates the `characters` table, computes embeddings in batches, and creates an FTS index.\n\n```bash\nrm -rf magical_kingdom\nuv run python create_dataset.py\n```\n\n## Step 2: Upload the Initial Snapshot to the Hub\n\nUpload the full `magical_kingdom` directory to `datasets/lancedb/magical_kingdom`.\n\n```bash\nhf upload-large-folder magical_kingdom magical_kingdom \\\n  --repo-type dataset \\\n  --revision main\n```\n\n`hf upload-large-folder` uses a resumable multi-commit flow, which is more flexible and error-tolerant than `hf upload`, but it does not support custom commit messages.\n\n## Step 3: Update the dataset locally\n\nImagine a scenario where you want to add a new `category` column and backfill its values with a single `merge_insert` operation into your existing table.\n\nThis is **both as schema update and a data update**, which Lance excels at: because Lance supports incremental [data evolution](https://lance.org/guide/data_evolution): it can add, remove and alter columns _without rewriting any data files_ in the existing dataset without touching existing data, making it very I/O-efficient when updating large tables.\n\n```bash\nuv run python update_dataset.py\n```\n\nOver time, you can run a [compaction](https://docs.lancedb.com/lance#data-compaction) job that calls `table.optimize()` to manage the number of manifests that are recorded in the history.\n\n## Step 4: Upload the updated version to the Hub\n\nUpload the same local directory again (now a new version of the dataset).\n\n```bash\nhf upload-large-folder lancedb/magical_kingdom magical_kingdom \\\n  --repo-type dataset \\\n  --revision main\n```\n\n## Step 5: Inspect versions and query on the Hub\n\n`inspect_dataset.py` reads from `hf://datasets/lancedb/magical_kingdom` and prints table versions.\n\n\n```bash\nuv run python inspect_dataset.py\n```\n\nIf you run `update_dataset.py` again without resetting, it will fail at `add_columns`\nbecause the `category` column already exists. If you want to upsert the column's data,\ncomment out the line that adds the `category` column.\n\n`query.py` also reads from the Hub and runs all five example queries.\n\n```bash\nuv run python query.py\n```\n\nExample:\n```python\nimport lancedb\n\n# Scan data directly from the Hugging Face Hub\n# (No need to download the dataset locally)\ndb = lancedb.connect(\"hf://datasets/lancedb/magical_kingdom\")\ntable = db.open_table(\"characters\")\n\nr = table.search() \\\n    .where(\"category = 'knight'\") \\\n    .select([\"name\", \"role\", \"stats.strength\"]) \\\n    .limit(4) \\\n    .to_polars() \\\n    .sort(\"stats.strength\", descending=True) \\\n    .head(1)\nprint(r)\n```\nThe character belonging to the `knight` category with the greatest strength is Sir Lancelot! 🗡️\n\n```\n┌──────────────┬───────────────────────────┬────────────────┐\n│ name         ┆ role                      ┆ stats.strength │\n│ ---          ┆ ---                       ┆ ---            │\n│ str          ┆ str                       ┆ i8             │\n╞══════════════╪═══════════════════════════╪════════════════╡\n│ Sir Lancelot ┆ Knight of the Round Table ┆ 5              │\n└──────────────┴───────────────────────────┴────────────────┘\n```\n\n## Update the Dataset Card\n\nThe Hub dataset card allows you to communicate the schema and usage of the dataset to other developers.\nIt sits at the repo’s root in a file named `README.md` on the Hub.\nThis project keeps the source card text in `HF_DATASET_CARD.md`, so you can publish updates\nto the dataset there and upload it as `README.md` using the following command on the HF CLI:\nthis requires a regular `hf upload` because it is a single-file upload to a specific target path -- and a custom commit message can be added.\n\n```bash\nhf upload lancedb/magical_kingdom HF_DATASET_CARD.md README.md \\\n  --repo-type dataset \\\n  --commit-message \"Update dataset card\"\n```\n\n## Optional: Reset the Hub Repo\n\nIf you want to reproduce the full demo from scratch on the Hub, delete the existing repo and recreate it:\n\n```bash\nhf repos delete lancedb/magical_kingdom --repo-type dataset\nhf repos create lancedb/magical_kingdom --repo-type dataset\n```\n\nThen, work through the steps described above. Have fun uploading your Lance datasets on Hugging Face!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flancedb%2Fhf-upload-demo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flancedb%2Fhf-upload-demo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flancedb%2Fhf-upload-demo/lists"}