{"id":13815110,"url":"https://github.com/xetdata/pyxet","last_synced_at":"2025-10-11T18:51:06.491Z","repository":{"id":154480527,"uuid":"629673310","full_name":"xetdata/pyxet","owner":"xetdata","description":"Python SDK for XetHub","archived":false,"fork":false,"pushed_at":"2024-10-16T17:58:41.000Z","size":1640,"stargazers_count":49,"open_issues_count":8,"forks_count":8,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-29T12:09:11.070Z","etag":null,"topics":["sdk-python","xethub"],"latest_commit_sha":null,"homepage":"https://xethub.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xetdata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-18T19:46:45.000Z","updated_at":"2024-12-11T17:50:09.000Z","dependencies_parsed_at":"2024-05-20T02:46:07.845Z","dependency_job_id":"ded41e12-7090-4587-8bb7-6c4f2e750712","html_url":"https://github.com/xetdata/pyxet","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xetdata%2Fpyxet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xetdata%2Fpyxet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xetdata%2Fpyxet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xetdata%2Fpyxet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xetdata","download_url":"https://codeload.github.com/xetdata/pyxet/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247339158,"owners_count":20923014,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["sdk-python","xethub"],"created_at":"2024-08-04T04:02:58.317Z","updated_at":"2025-10-11T18:51:01.450Z","avatar_url":"https://github.com/xetdata.png","language":"Python","readme":"[![No Maintenance Intended](http://unmaintained.tech/badge.svg)](http://unmaintained.tech/)\n\n# [DEPRECATED] pyxet - The SDK for XetHub\n\n**_XetHub has joined [Hugging Face 🤗](https://huggingface.co/blog/xethub-joins-hf). Follow our work to improve large scale collaboration on [Hugging Face Hub](https://huggingface.co/xet-team)._**\n\n----\n\npyxet is a Python library that provides a pythonic interface for\n[XetHub](https://xethub.com/).  Xethub is simple git-based system capable of\nstoring TBs of ML data and models in a single repository, with block-level \ndata deduplication that enables hundreds of versions of similar data to be\nstored without requiring much storage. \n\n## License\n\n[BSD 3](LICENSE)\n\n## Features\n\nPyxet has 3 components:\n\n1. A [fsspec](https://filesystem-spec.readthedocs.io)\ninterface that allows compatible libraries such as Pandas, Polars and Duckdb\nto directly access any version of any file in a Xet repository. See below\nfor some examples.\n\n2. A command line interface inspired by AWSCLI that allows files to be \nuploaded to and downloaded from Xet repository conveniently and efficiently.\n\n3. A file system mount mechanism that allows any version of any Xet repository\nto be mounted. This works on Mac, Linux, and Windows 11 Pro.\n\nFor API documentation and full examples, please see [here](https://pyxet.readthedocs.io/en/latest/).\n\n\n## Installation\n\nSet up your virtualenv with:\n\n```sh\n$ python -m venv .venv\n$ . .venv/bin/activate\n```\n\nThen, install pyxet with:\n\n```sh\n$ pip install pyxet\n```\n\n\n## Authentication\n\nSignup on [XetHub](https://xethub.com/user/sign_up) and obtain\na username and access token. You should write this down.\n\nThere are three ways to authenticate with XetHub:\n\n### Command Line\n\n```bash\nxet login -e \u003cemail\u003e -u \u003cusername\u003e -p \u003cpersonal_access_token\u003e\n```\nXet login will write authentication information to `~/.xetconfig`\n\n### Environment Variable\nEnvironment variables may be sometimes more convenient:\n```\nexport XET_USER_EMAIL = \u003cemail\u003e\nexport XET_USER_NAME = \u003cusername\u003e\nexport XET_USER_TOKEN = \u003cpersonal_access_token\u003e\n```\n\n### In Python\nFinally if in a notebook environment, or a non-persistent environment, \nwe also provide a method to authenticate directly from Python. Note that\nthis must be the first thing you run before any other operation:\n```python\nimport pyxet\npyxet.login(\u003cusername\u003e, \u003cpersonal_access_token\u003e, \u003cemail\u003e)\n```\n\n# Usage\n\nWe have, a few basic usage examples here. For complete documentation\nplease see [here](https://pyxet.readthedocs.io/en/latest/).\n\nOur examples are based on a small Titanic dataset you can see and explore [here](https://xethub.com/xethub/titanic).\n\n## Reading Files and Accessing Repos\n\nA XetHub URL for pyxet is in the form:\n```\nxet://\u003cendpoint\u003e:\u003crepo_owner\u003e/\u003crepo_name\u003e/\u003cbranch\u003e/\u003cpath_to_file\u003e\n```\n\nUse our public `xethub.com` endpoint unless you're on a custom enterprise deployment.\n\nReading files from pyxet is easy: `pyxet.open` on a Xet path will return a\npython file-like object which you can directly read from.\n\n```python\nimport pyxet            \nprint(pyxet.open('xet://xethub.com:XetHub/titanic/main/README.md').readlines())\n```\n\n\n## Pandas Integration\n\nFSSpec integration means that many libraries support reading from Xethub\ndirectly.  For instance: we can easily read the CSV file directly into a Pandas\ndataframe:\n\n```python\nimport pyxet            # make xet:// protocol available\nimport pandas as pd     # assumes pip install pandas has been run\n\ndf = pd.read_csv('xet://xethub.com:XetHub/titanic/main/titanic.csv')\ndf\n```\n\nThis should return something like:\n\n```\nOut[3]:\n     PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked\n0              1         0       3  ...   7.2500   NaN         S\n1              2         1       1  ...  71.2833   C85         C\n2              3         1       3  ...   7.9250   NaN         S\n3              4         1       1  ...  53.1000  C123         S\n4              5         0       3  ...   8.0500   NaN         S\n..           ...       ...     ...  ...      ...   ...       ...\n886          887         0       2  ...  13.0000   NaN         S\n887          888         1       1  ...  30.0000   B42         S\n888          889         0       3  ...  23.4500   NaN         S\n889          890         1       1  ...  30.0000  C148         C\n890          891         0       3  ...   7.7500   NaN         Q\n\n[891 rows x 12 columns]\n```\n\n## Working with a Blob Store\n\nThe `XetFS` object in Pyxet implements all the [fsspec](https://filesystem-spec.readthedocs.io/en/latest/)\nAPI For instance, you can list folders with:\n```python\nfs = pyxet.XetFS()\nprint(fs.listdir('xethub/titanic/main'))\n```\n\nWhich should output something like the following:\n```\n[{'name': 'xethub/titanic/main/.gitattributes', 'size': 79, 'type': 'file'},\n{'name': 'xethub/titanic/main/data', 'size': 0, 'type': 'directory'},\n{'name': 'xethub/titanic/main/readme.md', 'size': 58, 'type': 'file'},\n{'name': 'xethub/titanic/main/titanic.csv', 'size': 61194, 'type': 'file'},\n{'name': 'xethub/titanic/main/titanic.json', 'size': 165682, 'type': 'file'},\n{'name': 'xethub/titanic/main/titanic.parquet',\n'size': 27175,\n'type': 'file'}]\n```\n\nHere are some other simple ways to access information from an existing repository:\n\n```python\nimport pyxet\n\nfs = pyxet.XetFS()  # fsspec filesystem\n\nfs.info(\"xethub/titanic/main/titanic.csv\")\n# returns repo level info: {'name': 'https://xethub.com/xethub/titanic/titanic.csv', 'size': 61194, 'type': 'file'}\n\nfs.open(\"xethub/titanic/main/titanic.csv\", 'r').read(20)\n# returns first 20 characters: 'PassengerId,Survived'\n\nfs.get(\"xethub/titanic/main/data/\", \"data\", recursive=True)\n# download remote directory recursively into a local data folder\n\nfs.ls(\"xethub/titanic/main/data/\", detail=False)\n# returns ['data/titanic_0.parquet', 'data/titanic_1.parquet']\n```\n\nPyxet also allows you to write to repositories with Git versioning.\n\n## Writing files with Pyxet\n\nTo write files with pyxet, you need to first make a repository you have access to.\nAn easy thing you can do is to simply fork the titanic repo. You can do so with\n\n```bash\nxet repo fork xet://xethub.com:XetHub/titanic\n```\n(see the Xet CLI documentation below)\n\nThis will create a private version of the titanic repository under `xet://xethub.com:\u003cusername\u003e/titanic`.\n\nUnlike typical blob stores, XetHub writes are *transactional*. This means the\nentire write succeeds, or the entire write fails \n(there is a transaction limit of about 1024 files).\n\n```python\nimport pyxet\nfs = pyxet.XetFS()\nuser_name = \u003cuser_name\u003e\nwith fs.transaction as tr:\n    tr.set_commit_message(\"hello world\")\n    f = fs.open(f\"{user_name}/titanic/main/hello_world.txt\", 'w')\n    f.write(\"hello world\")\n    f.close()\n```\n\nIf you navigate to your titanic repository on XetHub, you'll see the new \n`hello_world.txt`.\n\n\n# Xet CLI\nThe Xet Command line is the easiest way to interact with a Xet repository.\n\n## Listing and time travel\nYou can browse the repository with:\n```bash\nxet ls xet://xethub.com:\u003cusername\u003e/titanic/main\n```\n\nYou can even browse it at any point in history (say 5 minutes ago) with:\n```bash\nxet ls xet://xethub.com:\u003cusername\u003e/titanic/main@{5.minutes.ago}\n```\n\n## Downloading\nThis syntax works everywhere, you can download files with `xet cp`\n```bash\n# syntax is similar to AWS CLI \nxet cp xet://xethub.com:\u003cusername\u003e/titanic/main/\u003cpath\u003e \u003clocal_path\u003e\nxet cp xet://xethub.com:\u003cusername\u003e/titanic/main@{5.minutes.ago}/\u003cpath\u003e \u003clocal_path\u003e\n```\n\nAnd you can also use `xet cp` to upload files:\n\n## Uploading\n```bash\nxet cp \u003cfile/directory\u003e xet://xethub.com:\u003cusername\u003e/titanic/main/\u003cpath\u003e\n```\nOf course, you cannot rewrite history, so uploading to `main@{5.minutes.ago}`\nis prohibited. \n\n## Branches\nYou can easily create branches for collaboration:\n```bash\nxet branch make xet://xethub.com:\u003cusername\u003e/titanic main another_branch\n```\nThis is fast regardless of the size of the repo.\n\n## Copying across repos and branches\nCopying across branches are efficient, and can be used to restore a historical\ncopy of a file which you accidentally overwrote:\n\n```bash\n# copying across branch\nxet cp xet://xethub.com:\u003cusername\u003e/titanic/branch/\u003cfile\u003e xet://xethub.com:\u003cusername\u003e/titanic/main/\u003cfile\u003e\n# copying from history to current\nxet cp xet://xethub.com:\u003cusername\u003e/titanic/main@{5.minutes.ago}/\u003cfile\u003e xet://xethub.com:\u003cusername\u003e/titanic/main/\u003cfile\u003e\n```\n\n## S3, GCP, etc\nXet CLI understand every protocol FSSpec does. So all the commands above\nwork with S3, GCP and many other protocols too. You can also use Xet CLI to\ndirectly upload and download data from S3 to XetHub:\n```\n$ xet cp xet://... s3://...\n$ xet cp s3://... xet://...\n```\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxetdata%2Fpyxet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxetdata%2Fpyxet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxetdata%2Fpyxet/lists"}