{"id":50568739,"url":"https://github.com/shrut2702/upasak","last_synced_at":"2026-06-04T17:00:27.771Z","repository":{"id":327474942,"uuid":"1109418379","full_name":"shrut2702/upasak","owner":"shrut2702","description":"UI-based Fine-Tuning for Large Language Models (LLMs)","archived":false,"fork":false,"pushed_at":"2025-12-04T17:51:51.000Z","size":495,"stargazers_count":20,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-05-03T06:12:31.247Z","etag":null,"topics":["gemma","gemma3","largelanguagemodels","llm","llm-training","nlp","no-code-framework","open-source","pii-detection","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shrut2702.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-03T19:23:06.000Z","updated_at":"2026-01-27T22:24:15.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/shrut2702/upasak","commit_stats":null,"previous_names":["shrut2702/upasak"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/shrut2702/upasak","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shrut2702%2Fupasak","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shrut2702%2Fupasak/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shrut2702%2Fupasak/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shrut2702%2Fupasak/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shrut2702","download_url":"https://codeload.github.com/shrut2702/upasak/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shrut2702%2Fupasak/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33914548,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-04T02:00:06.755Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gemma","gemma3","largelanguagemodels","llm","llm-training","nlp","no-code-framework","open-source","pii-detection","transformers"],"created_at":"2026-06-04T17:00:27.087Z","updated_at":"2026-06-04T17:00:27.756Z","avatar_url":"https://github.com/shrut2702.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\n# Upasak - UI-based Fine-Tuning for Large Language Models (LLMs)\n\n**Upasak** is a flexible, mindful to privacy, no-code/low-code framework for fine-tuning large language models, built around [Hugging Face Transformers](https://huggingface.co/docs/transformers/en/index).\nIt features an easy-to-use Streamlit-based interface, multi-format dataset support, built-in PII and sensitive information sanitization, and a customizable training process.\n Whether you're experimenting, researching, or performing internal fine-tuning tasks, Upasak makes it easily accessible and compliant.\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/upasak/\"\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/v/upasak\" alt=\"PyPI Version\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/shrut2702/upasak/blob/main/LICENSE\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/license/shrut2702/upasak\" alt=\"License\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/shrut2702/upasak/refs/heads/main/assets/upasak_logo.png\" width=\"400\" /\u003e\n\u003c/p\u003e\n\n## **Key Features**\n\n### **LLM Fine-Tuning**\n* Developed on top of Hugging Face's Transformers library.\n* Supports Text-only models of Gemma-3 LLM family for instruction-tuning or domain adaptation.\n* Full-parameter fine-tuning or LoRA (Parameter-Efficient Fine-Tuning).\n* Future support planned for image-text-to-text Gemma-3 models, LLaMA, Qwen, Phi, Mixtral.\n\n### **Flexible Dataset Handling**\nUpload or import datasets in multiple file formats:\n\n* `.json`\n* `.jsonl`\n* `.csv`\n* `.zip` (containing `.txt`)\n\nOr select datasets directly from the **Hugging Face Hub**.\n\n\n### **Auto-Detection of Dataset Schema**\n\nUpasak intelligently identifies and structures your dataset into training-ready format.\nSupported schemas:\n\n| Schema              | Format                                            | Notes                                       |\n| ------------------- | -------------------------------------------------- | ------------------------------------------- |\n| **DAPT**            | `[{\"text\":\"...\"}]` or `text` column                                         | Document Adaptation / continued pretraining |\n| **ALPACA**          | `[{\"instruction\":\"...\", \"output\":\"...\"}]` (+ optional `\"input\"`) or `instruction`, `output`, `input` (optional) columns | Converted to user → assistant turns         |\n| **CHATML**          | `[{\"messages\":[{\"role\":\"...\", \"content\":\"...\"}]}]` or `messages` column                                     | Supports role/content pairs                 |\n| **SHARE_GPT**       | `[{\"conversations\":[{\"from\":\"...\", \"value\":\"...\"}]}]` or `conversations` column                                | Converts human ↔ model to user ↔ assistant  |\n| **PROMPT_RESPONSE** | `[{\"prompt\":\"...\", \"response\":\"...\"}]` or `prompt`, `response` columns                           | Simple instruction → answer                 |\n| **QA**              | `[{\"question\":\"\", \"answer\":\"\"}]` or `question`, `answer` columns                           | Q\u0026A format                          |\n| **QLA**             | `[{\"question\":\"...\", \"long_answer\":\"...\"}]` or `question`, `long_answer` columns                      | Long-form generation                        |\n\n### **Built-In PII \u0026 Sensitive Information Sanitization**\n\nUpasak ensures privacy compliance by:\n\n* Automatically detecting and redacting/masking PII\n* Using placeholder tokens to preserve dataset utility\n* Offering AI-assisted detection with manual review loops, which uses [GLiNER](https://huggingface.co/urchade/gliner_multi_pii-v1) (Named Entity Recognition) model.\n* Logging sanitization results for auditability\n\nUpasak automatically detects and redacts:\n* Personal names\n* Emails / phone numbers\n* IP addresses, IMEI\n* Credit card / bank details\n* National IDs (Aadhaar, PAN, Voter ID)\n* API keys\n* GitHub/GitLab tokens\n* Database credentials\n* Residential \u0026 workplace addresses\n\nTwo sanitization modes:\n\n1. **Rule-Based** (default)\n2. **Hybrid (Rule-Based + NER-based)**\n\n   * Optional human review\n   * Configure HITL ratio \u0026 max samples for human review\n   * Accept/reject uncertain detections directly in the UI\n   * Preview sanitized sample before training\n---\n\n## **Streamlit UI – No-Code Training Workflow**\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/shrut2702/upasak/refs/heads/main/assets/Screenshot-UI.png\" width=\"900\" /\u003e\n\u003c/p\u003e\n\nThe visual interface provides fully interactive control:\n\n### **1. Model Selection**\n\nChoose supported base models (currently Gemma-3 text-only).\nFuture updates will include LLaMA, Mixtral, Phi, Qwen and multimodal variants.\n\n### **2. HF Token Handling**\n\n* Read token for pulling models\n* Write token for pushing fine-tuned models back to HF Hub\n\n### **3. Dataset Input**\n\n* Upload dataset files\n* Or load from Hugging Face dataset list\n\n### **4. PII Sanitization Panel**\n\n* Enable/disable sanitization\n* Select detection method (rule-based / hybrid)\n* Enable Human Review \u0026 configure ratios\n* View uncertain detections and choose actions\n* Preview sanitized sample before training\n\n### **5. Hyperparameter Controls**\n\n#### **Basic Hyperparameters**\n\n* Learning rate\n* Batch size\n* Epochs\n* Max sequence length\n* Logging steps\n* LR scheduler\n\n#### **Advanced Hyperparameters**\n\n* Gradient accumulation\n* Gradient clipping\n* LR warmup ratio\n* Weight decay\n* Checkpoint save strategy\n* Evaluation strategy + steps\n* Validation split\n* Model tracker platform (Comet / WandB / none)\n* Tracker API keys\n\n### **6. LoRA Configuration**\n\n* LoRA rank\n* LoRA alpha\n* LoRA dropout\n* Target modules\n* Optional merging of LoRA adapters\n\n### **7. Training Control**\n\n* Start / Stop training\n* Live training metrics inside the app:\n\n  * Training loss\n  * Validation loss\n  * Token-level curves\n* Optional external tracking (Comet / WandB)\n\n### **8. Inference Script Generation**\nAfter training completes, Upasak automatically generates a customized inference.py script tailored to your training configuration.\n\n* **LoRA support** – Handles both scenarios:\n    * **LoRA + merged adapters** – Loads the fully merged model.\n    * **LoRA + unmerged adapters** – Loads base model + applies LoRA adapters at runtime.\n    * **Full fine-tune** – Standard model loading\n* **Ready to use** - Access it in your output directory\n\n**Usage**\n\n```bash\ncd path_to_output_dir\npython inference.py\n```\n\n\n### **9. Export \u0026 Push**\n\n* Output directory for checkpoints, final model, and merged model\n* Push to HF Hub (when write-enabled token is provided)\n\n---\n\n\n# **Installation**\n\n### **Install from PyPI (recommended)**\n\n```bash\npip install upasak\n```\n\n### **Or install from source**\n\n```bash\n# Clone this repo\ngit clone https://github.com/shrut2702/upasak\ncd upasak\n```\n```bash\n# optional\n\n## For Windows\npython -m venv vir_env\n./vir_env/scripts/activate\n\n## For macOS\npython -m venv vir_env\nsource vir_env/bin/activate\n```\n\n```bash\n# Install required dependencies\npip install -r requirements.txt\n```\n\n---\n\n\n## **Usage**\n\nUpasak is used as a Python-triggered Streamlit app.\n\n### **After installing the package:**\n\n#### **1. Create a Python launcher file**\n\nFor example: `run_upasak.py`\n\n```python\nfrom upasak import main\n\nif __name__ == \"__main__\":\n    main()\n```\n\n#### **2. Launch the Streamlit application**\n\n```bash\nstreamlit run run_upasak.py\n```\n\nor \n\n```bash\nstreamlit run run_upasak.py --server.maxUploadSize=1024 # for configuring upload file size limit in MB\n```\n\nThis opens the Upasak UI in your browser.\n\n### **After installing from source**\n\n#### **1. Launch `app.py`**\n```bash\nstreamlit run app.py\n```\n\nor \n\n```bash\nstreamlit run app.py --server.maxUploadSize=1024 # for configuring upload file size limit in MB\n```\n\n\n### **Reusability of Upasak Modules**\n\n\nAlthough Upasak provides a full end-to-end UI, **every internal component is designed to be reusable in isolation**.\nYou can import and use modules such as:\n\n* `TokenizerWrapper` → standalone tokenization \n* `TrainingEngine` + `TrainerConfig` → run full or LoRA fine-tuning programmatically\n* `PIISanitizer` → rule-based or hybrid PII detection/sanitization\n\nYou can refer to [examples](https://github.com/shrut2702/upasak/tree/f4252b2e2072aad9e878005108abc564d8b670a0/examples) to more details.\n\nThis allows you to integrate Upasak **directly into custom pipelines**, backend services, notebooks, or data-processing workflows — **without launching the Streamlit UI**.\n\n---\n\n# **Use Cases**\n\n* Educational fine-tuning demonstrations \n* Rapid prototyping in quick-shipping environments\n* Dataset preparation and anonymization workflows\n* Internal LLM finetuning on sensitive or regulated data\n* Developers with no domain expertise who wants LLM in their application\n\n\n---\n\n# **Contributing**\n\nContributions are welcome!\nPlease open an issue or submit a pull request for bug fixes, features, documentation, or dataset schema support.\n\n---\n\n# **Support**\n\nFor issues, questions, or feature requests:\nCreate a GitHub issue in this repository.\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshrut2702%2Fupasak","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshrut2702%2Fupasak","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshrut2702%2Fupasak/lists"}