{"id":31959424,"url":"https://github.com/huggingface/pyspark_huggingface","last_synced_at":"2025-10-14T15:32:15.282Z","repository":{"id":273960085,"uuid":"877891911","full_name":"huggingface/pyspark_huggingface","owner":"huggingface","description":"PySpark custom data source for Hugging Face Datasets","archived":false,"fork":false,"pushed_at":"2025-08-12T17:23:08.000Z","size":224,"stargazers_count":17,"open_issues_count":0,"forks_count":5,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-09-30T18:02:30.715Z","etag":null,"topics":["datasets","datasource","huggingface","huggingface-datasets","spark","spark-datasource"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggingface.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-24T12:27:30.000Z","updated_at":"2025-09-17T12:52:19.000Z","dependencies_parsed_at":"2025-01-24T02:22:20.041Z","dependency_job_id":"3bc873ba-0d9e-44c1-896c-2ff8648c54dc","html_url":"https://github.com/huggingface/pyspark_huggingface","commit_stats":null,"previous_names":["huggingface/pyspark_huggingface"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/huggingface/pyspark_huggingface","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fpyspark_huggingface","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fpyspark_huggingface/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fpyspark_huggingface/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fpyspark_huggingface/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggingface","download_url":"https://codeload.github.com/huggingface/pyspark_huggingface/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fpyspark_huggingface/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279019322,"owners_count":26086711,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datasets","datasource","huggingface","huggingface-datasets","spark","spark-datasource"],"created_at":"2025-10-14T15:30:26.976Z","updated_at":"2025-10-14T15:32:15.277Z","avatar_url":"https://github.com/huggingface.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg alt=\"Hugging Face x Spark\" src=\"https://pbs.twimg.com/media/FvN1b_2XwAAWI1H?format=jpg\u0026name=large\" width=\"352\" style=\"max-width: 100%;\"\u003e\n  \u003cbr/\u003e\n  \u003cbr/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/huggingface/pyspark_huggingface/releases\"\u003e\u003cimg alt=\"GitHub release\" src=\"https://img.shields.io/github/release/huggingface/pyspark_huggingface.svg\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://huggingface.co/datasets/\"\u003e\u003cimg alt=\"Number of datasets\" src=\"https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets\u0026color=brightgreen\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n# Spark Data Source for Hugging Face Datasets\n\nA Spark Data Source for accessing [🤗 Hugging Face Datasets](https://huggingface.co/datasets):\n\n- Stream datasets from Hugging Face as Spark DataFrames\n- Select subsets and splits, apply projection and predicate filters\n- Save Spark DataFrames as Parquet files to Hugging Face\n- Fast deduped uploads\n- Fully distributed\n- Authentication via `huggingface-cli login` or tokens\n- Compatible with Spark 4 (with auto-import)\n- Backport for Spark 3.5, 3.4 and 3.3\n\n## Installation\n\n```\npip install pyspark_huggingface\n```\n\n## Usage\n\nLoad a dataset (here [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb)):\n\n```python\nimport pyspark_huggingface\ndf = spark.read.format(\"huggingface\").load(\"stanfordnlp/imdb\")\n```\n\nSave to Hugging Face:\n\n```python\n# Login with huggingface-cli login\ndf.write.format(\"huggingface\").save(\"username/my_dataset\")\n# Or pass a token manually\ndf.write.format(\"huggingface\").option(\"token\", \"hf_xxx\").save(\"username/my_dataset\")\n```\n\n## Advanced\n\nSelect a split:\n\n```python\ntest_df = (\n    spark.read.format(\"huggingface\")\n    .option(\"split\", \"test\")\n    .load(\"stanfordnlp/imdb\")\n)\n```\n\nSelect a subset/config:\n\n```python\ntest_df = (\n    spark.read.format(\"huggingface\")\n    .option(\"config\", \"sample-10BT\")\n    .load(\"HuggingFaceFW/fineweb-edu\")\n)\n```\n\nFilters columns and rows (especially efficient for Parquet datasets):\n\n```python\ndf = (\n    spark.read.format(\"huggingface\")\n    .option(\"filters\", '[(\"language_score\", \"\u003e\", 0.99)]')\n    .option(\"columns\", '[\"text\", \"language_score\"]')\n    .load(\"HuggingFaceFW/fineweb-edu\")\n)\n```\n\n## Fast deduped uploads\n\nHugging Face uses Xet: a dedupe-based storage which enables fast deduped uploads.\n\nUnlike traditional remote storage, uploads are faster on Xet because duplicate data is only uploaded once.\nFor example: if some or all of the data already exists in other files on Xet, it is not uploaded again, saving bandwidth and speeding up uploads. Deduplication for Parquet is enabled through Content Defined Chunking (CDC).\n\nThanks to Parquet CDC and Xet deduplication, saving a dataset on Hugging Face is faster than on any traditional remote storage.\n\nFor more information, see [https://huggingface.co/blog/parquet-cdc](https://huggingface.co/blog/parquet-cdc).\n\n## Backport\n\nWhile the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.\n\nImporting `pyspark_huggingface` patches the PySpark reader and writer to add the \"huggingface\" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:\n\n```python\n\u003e\u003e\u003e import pyspark_huggingface\nhuggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)\n```\n\nThe import is only necessary on Spark 3.x to enable the backport.\nSpark 4 automatically imports `pyspark_huggingface` as soon as it is installed, and registers the \"huggingface\" data source.\n\n\n## Development\n\n[Install uv](https://docs.astral.sh/uv/getting-started/installation/) if not already done.\n\nThen, from the project root directory, sync dependencies and run tests.\n```\nuv sync\nuv run pytest\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fpyspark_huggingface","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggingface%2Fpyspark_huggingface","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fpyspark_huggingface/lists"}