{"id":30264970,"url":"https://github.com/huggingface/aisheets","last_synced_at":"2025-10-14T15:33:06.161Z","repository":{"id":309112563,"uuid":"917782991","full_name":"huggingface/aisheets","owner":"huggingface","description":"Build, enrich, and transform datasets using AI models with no code","archived":false,"fork":false,"pushed_at":"2025-09-29T16:45:46.000Z","size":1767,"stargazers_count":1484,"open_issues_count":11,"forks_count":124,"subscribers_count":15,"default_branch":"main","last_synced_at":"2025-09-29T18:37:12.778Z","etag":null,"topics":["ai","llm-evaluation","llms","nocode","oss","synthetic-data"],"latest_commit_sha":null,"homepage":"https://huggingface.co/spaces/aisheets/sheets","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggingface.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-16T16:30:46.000Z","updated_at":"2025-09-29T14:22:01.000Z","dependencies_parsed_at":"2025-08-09T23:50:37.061Z","dependency_job_id":"984a4e9c-4c92-40a1-8fce-5a66c6bb7960","html_url":"https://github.com/huggingface/aisheets","commit_stats":null,"previous_names":["huggingface/aisheets"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/huggingface/aisheets","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Faisheets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Faisheets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Faisheets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Faisheets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggingface","download_url":"https://codeload.github.com/huggingface/aisheets/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Faisheets/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279019320,"owners_count":26086711,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","llm-evaluation","llms","nocode","oss","synthetic-data"],"created_at":"2025-08-15T22:02:44.701Z","updated_at":"2025-10-14T15:33:06.155Z","avatar_url":"https://github.com/huggingface.png","language":"TypeScript","funding_links":[],"categories":["ai","Repos"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 🤗 Hugging Face AI Sheets\n\n*Build, enrich, and transform datasets using AI models with no code. Deploy locally or on the Hub with access to thousands of open models.*\n\n[Introduction](https://huggingface.co/blog/aisheets) • [Try it out](https://huggingface.co/spaces/aisheets/sheets)\n\n\u003cvideo width=\"400\" src=\"https://github.com/user-attachments/assets/a284e4d4-3c11-4885-96cc-2f6f0314f2a1\"\u003e\u003c/video\u003e\n\n\u003c/div\u003e\n\n## What's AI Sheets?\n\nHugging Face AI Sheets is an open-source tool for building, enriching, and transforming datasets using AI models with no code. The tool can be deployed locally or on the Hub. It lets you use thousands of open models from the Hugging Face Hub via Inference Providers or local models, including `gpt-oss` from OpenAI!\n\n## Quick Start\n\n### Using the AI Sheets Space\n\nTry it instantly at \u003chttps://huggingface.co/spaces/aisheets/sheets\u003e\n\n### Using Docker\n\nFirst, get your Hugging Face token from \u003chttps://huggingface.co/settings/tokens\u003e\n\n```bash\nexport HF_TOKEN=your_token_here\ndocker run -p 3000:3000 \\\n-e HF_TOKEN=HF_TOKEN \\\naisheets/sheets\n```\n\nOpen `http://localhost:3000` in your browser.\n\n### Using pnpm\n\nFirst, [install pnpm](https://pnpm.io/installation) if you haven't already.\n\n```bash\ngit clone https://github.com/huggingface/aisheets.git\ncd sheets\nexport HF_TOKEN=your_token_here\npnpm install --frozen-lockfile\npnpm dev\n```\n\nOpen `http://localhost:5173` in your browser.\n\n#### Building for production\n\nTo build the production application, run:\n\n```bash\npnpm build\n```\n\nThis will create a production build in the `dist` directory.\n\nThen, you can launch the built-in Express server to serve the production build:\n\n```bash\nexport HF_TOKEN=your_token_here\npnpm serve\n```\n\n## Running data generation scripts using HF Jobs\n\nIf you want to generate a larger dataset, you can use the above-mentioned config and script, like this:\n\n```bash\nhf jobs uv run \\\n-s HF_TOKEN=$HF_TOKEN \\\nhttps://github.com/huggingface/aisheets/raw/refs/heads/main/scripts/extend_dataset/with_inference_client.py \\ # script for running the pipeline\nnvidia/Nemotron-Personas dvilasuero/nemotron-kimi-qa-distilled \\\n--config https://huggingface.co/datasets/dvilasuero/nemotron-personas-kimi-questions/raw/main/config.yml \\ # config with prompts\n--num-rows 100 # limit to 100 rows, leave empty for the full dataset\n```\n\nAlternatively, you can use a script that utilizes vllm inference instead of the inference client. This script helps you to save on inference costs, but it requires you to set up a vllm-compatible flavor when running the job:\n\n```bash\nhf jobs uv run --flavor l4x1 \\\n-s HF_TOKEN=$HF_TOKEN \\\nhttps://github.com/huggingface/aisheets/raw/refs/heads/main/scripts/extend_dataset/with_vllm.py \\ # script for running the pipeline\nnvidia/Nemotron-Personas dvilasuero/nemotron-kimi-qa-distilled \\\n--config https://huggingface.co/datasets/dvilasuero/nemotron-personas-kimi-questions/raw/main/config.yml \\ # config with prompts\n--num-rows 100 \\ # limit to 100 rows, leave empty for the full dataset\n--vllm-model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\n```\n\n## Running AI Sheets with custom (and local) LLMs\n\nBy default, AI Sheets is configured to use the Huggingface Inference Providers API to run inference on the latest open-source models. However, you can also run Sheets with own custom LLMs, such as those hosted on your own infrastructure or other cloud providers. The only requirement is that your LLMs must support the [OpenAI API specification](https://platform.openai.com/docs/api-reference/introduction).\n\n### Steps\n\nWhen running AI Sheets with custom LLMs, you need to set some environment variables to point the inference calls to your custom LLMs. Here are the steps:\n\n1. **Set the `MODEL_ENDPOINT_URL` environment variable**: This variable should point to the base URL of your custom LLM's API endpoint. For example, if you are using Ollama to run your LLM locally, you would set it like this:\n\n```sh\nexport MODEL_ENDPOINT_URL=http://localhost:11434\n```\n\nSince Ollama starts a local server on port `11434` by default, this URL will point to your local Ollama instance.\n\n2. **Set the `MODEL_ENDPOINT_NAME` environment variable**: This variable should specify the name of the model you want to use. For example, if you are using the `llama3` model, you would set it like this:\n\n```sh\nexport MODEL_ENDPOINT_NAME=llama3\n```\n\nThis is a crucial step to conform to the OpenAI API specification. The model name is a required parameter in the [OpenAI API](https://platform.openai.com/docs/api-reference/responses/create#responses-create-model), and it is used to identify which model to use for inference.\n\n3. **Run the AI Sheets app**: After setting the environment variables, you can run the Sheets app as usual. The app will now use your custom LLM for inference instead of the default Huggingface Inference Providers API as the default behavior. Anyway, all the models provided by the Huggingface Inference Providers API will still be available when selecting a model in the column settings.\n\n* Note: The text-to-image generation feature cannot be customized yet. It will always utilize the Hugging Face Inference Providers API to generate images. Take this into account when running AI Sheets with custom LLMs.\n\n### Example of running AI Sheets with Ollama\n\nTo run AI Sheets with Ollama, you can follow these steps:\n\n1. Start the Ollama server, and run the model of your choice\n\n```sh\nexport OLLAMA_NOHISTORY=1\nollama serve\n```\n\n```sh\nollama run llama3\n```\n\n(Visit the Ollama [FAQ](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size) page to know more about Ollama server configuration)\n\n2. Set the environment variables:\n\n```sh\nexport MODEL_ENDPOINT_URL=http://localhost:11434\nexport MODEL_ENDPOINT_NAME=llama3\n```\n\n3. Run the AI Sheets app:\n\n```sh\npnpm serve\n```\n\nThis will start the AI Sheets app and use the `llama3` model running on your local Ollama instance for inference.\n\n## Advanced configuration\n\nAI Sheets defines some environment variables that can be used to customize the behavior of the application. In the following sections, we will describe the available environment variables and their usage.\n\n###  Authentication\n\n* `OAUTH_CLIENT_ID`: The Hugging Face OAuth client ID for the application. This is used to authenticate users via the Hugging Face OAuth. If this variable is defined, it will be used to authenticate users. (See how to setup the Hugging Face OAuth [here](https://huggingface.co/blog/frascuchon/running-sheets-locally#oauth-authentication)).\n\n* `HF_TOKEN`: A Hugging Face token to use for authentication. If this variable is defined, it will be used for authenticated inference calls, instead of the OAuth token.\n\n* `OAUTH_SCOPES`: The scopes to request during the OAuth authentication. The default value is `openid profile inference-api manage-repos`. This variable is used to request the necessary permissions for the application to function correctly, and normally does not need to be changed.\n\n###  Inference\n\n* `DEFAULT_MODEL`: The default model id to use when calling the inference API for text generation. The default value is `meta-llama/Llama-3.3-70B-Instruct`. This variable can be used to change the default model used for text generation and must be a valid model id from the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation\u0026inference_provider=all\u0026sort=trending),\n\n* `DEFAULT_MODEL_PROVIDER`: The default model provider to use when calling the inference API for text generation. The default value is `nebius`. This variable can be used to change the default model provider used for text generation and must be a valid provider from the [Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers/en/index).\n\n* `ORG_BILLING`: The organization billing to use for inference calls. If this variable is defined, the inference calls will be billed to the specified organization. This is useful for organizations that want to manage their inference costs and usage. Remember that users must be part of the organization to use this feature, or an `HF_TOKEN` of a user that is part of the organization must be defined.\n\n* `MODEL_ENDPOINT_URL`:  The URL of a custom inference endpoint to use for text generation. If this variable is defined, it will be used instead of the default Hugging Face Inference API. This is useful for using custom inference endpoints that are not hosted on the Hugging Face Hub, such as Ollama or LLM Studio. The URL must be a valid endpoint that supports the [OpenAI API format](https://platform.openai.com/docs/api-reference/chat/create).\n\n* `MODEL_ENDPOINT_NAME`: The model id to use when calling the custom inference endpoint defined by `MODEL_ENDPOINT_URL`. This variable is required if `MODEL_ENDPOINT_URL` is defined for custom inference endpoints that require a model id, such as Ollama or LLM Studio. The model id must correspond to the model deployed on the custom inference endpoint.\n\n* `NUM_CONCURRENT_REQUESTS`: The number of concurrent requests to allow when calling the inference API in the column cells generation process. The default value is `5`, and the maximum value is `10`. This is useful to control the number of concurrent requests made to the inference API and avoid hitting rate limits defined by the provider.\n\n### Miscellaneous\n\n* `DATA_DIR`: The directory where the application will store all its data. The default value is `./data`. This variable can be used to change the data directory used by the application. The directory must be writable by the application.\n\n* `SERPER_API_KEY`: The API key to use for the Serper web search API. If this variable is defined, it will be used to authenticate web search requests. If this variable is not defined, web search will be disabled. The Serper API key can be obtained from the [Serper website](https://serper.dev/).\n\n* `TELEMETRY_ENABLED`: A boolean value that indicates whether telemetry is enabled or not. The default value is `1`. This variable can be used to disable telemetry if desired. Telemetry is used to collect anonymous usage data to help improve the application.\n\n* `EXAMPLES_PROMPT_MAX_CONTEXT_SIZE`: The maximum context size (in characters) for the examples section in the prompt for text generation. The default value is `8192`. If the examples section exceeds this size, it will be truncated. This variable can be used when the examples section is too large and needs to be reduced to fit within the context size limits of the model.\n\n* `SOURCES_PROMPT_MAX_CONTEXT_SIZE`: The maximum context size (in characters) for the sources section in the prompt for text generation. The default value is `61440`. If the sources section exceeds this size, it will be truncated. This variable can be used when the sources section is too large and needs to be reduced to fit within the context size limits of the model.\n\n## Developer docs\n\n### Dev dependencies on your vscode\n\n#### vitest runner\n\n\u003chttps://marketplace.visualstudio.com/items?itemName=rluvaton.vscode-vitest\u003e\n\n#### biome\n\n\u003chttps://marketplace.visualstudio.com/items?itemName=biomejs.biome\u003e\n\n### Project Structure\n\nThis project is using Qwik with [QwikCity](https://qwik.dev/qwikcity/overview/). QwikCity is just an extra set of tools on top of Qwik to make it easier to build a full site, including directory-based routing, layouts, and more.\n\nInside your project, you'll see the following directory structure:\n\n```\n├── public/\n│   └── ...\n└── src/\n    ├── components/ --\u003e Stateless components\n    │   └── ...\n    ├── features/ --\u003e Components with business logic\n    │   └── ...\n    └── routes/\n        └── ...\n```\n\n* `src/routes`: Provides the directory-based routing, which can include a hierarchy of `layout.tsx` layout files, and an `index.tsx` file as the page. Additionally, `index.ts` files are endpoints. Please see the [routing docs](https://qwik.dev/qwikcity/routing/overview/) for more info.\n\n* `src/components`: Recommended directory for components.\n\n* `public`: Any static assets, like images, can be placed in the public directory. Please see the [Vite public directory](https://vitejs.dev/guide/assets.html#the-public-directory) for more info.\n\n### Development\n\nRun this on your root folder\n\n```sh\ntouch .env\n```\n\nAdd in your `.env` file the following variable:\n\n```\nHF_TOKEN=your_hugging_face_token\n```\n\nDevelopment mode uses [Vite's development server](https://vitejs.dev/). The `dev` command will server-side render (SSR) the output during development.\n\n```shell\npnpm dev\n```\n\n\u003e Note: during dev mode, Vite may request a significant number of `.js` files. This does not represent a Qwik production build.\n\n### Preview\n\nThe preview command will create a production build of the client modules, a production build of `src/entry.preview.tsx`, and run a local server. The preview server is only for convenience to preview a production build locally and should not be used as a production server.\n\n```shell\npnpm preview\n```\n\n### Production\n\nThe production build will generate client and server modules by running both client and server build commands. The build command will use Typescript to run a type check on the source code.\n\n```shell\npnpm build\n```\n\n### Express Server\n\nThis app has a minimal [Express server](https://expressjs.com/) implementation. After running a full build, you can preview the build using the command:\n\n```\npnpm serve\n```\n\nThen visit [http://localhost:3000/](http://localhost:3000/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Faisheets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggingface%2Faisheets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Faisheets/lists"}