https://github.com/shrut2702/upasak

UI-based Fine-Tuning for Large Language Models (LLMs)
https://github.com/shrut2702/upasak

gemma gemma3 largelanguagemodels llm llm-training nlp no-code-framework open-source pii-detection transformers

Last synced: 8 days ago
JSON representation

UI-based Fine-Tuning for Large Language Models (LLMs)

Host: GitHub
URL: https://github.com/shrut2702/upasak
Owner: shrut2702
License: mit
Created: 2025-12-03T19:23:06.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-12-04T17:51:51.000Z (6 months ago)
Last Synced: 2026-05-03T06:12:31.247Z (about 1 month ago)
Topics: gemma, gemma3, largelanguagemodels, llm, llm-training, nlp, no-code-framework, open-source, pii-detection, transformers
Language: Python
Homepage:
Size: 483 KB
Stars: 20
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Roadmap: ROADMAP.md

Awesome Lists containing this project

README

# Upasak - UI-based Fine-Tuning for Large Language Models (LLMs)

**Upasak** is a flexible, mindful to privacy, no-code/low-code framework for fine-tuning large language models, built around [Hugging Face Transformers](https://huggingface.co/docs/transformers/en/index).
It features an easy-to-use Streamlit-based interface, multi-format dataset support, built-in PII and sensitive information sanitization, and a customizable training process.
Whether you're experimenting, researching, or performing internal fine-tuning tasks, Upasak makes it easily accessible and compliant.

## **Key Features**

### **LLM Fine-Tuning**
* Developed on top of Hugging Face's Transformers library.
* Supports Text-only models of Gemma-3 LLM family for instruction-tuning or domain adaptation.
* Full-parameter fine-tuning or LoRA (Parameter-Efficient Fine-Tuning).
* Future support planned for image-text-to-text Gemma-3 models, LLaMA, Qwen, Phi, Mixtral.

### **Flexible Dataset Handling**
Upload or import datasets in multiple file formats:

* `.json`
* `.jsonl`
* `.csv`
* `.zip` (containing `.txt`)

Or select datasets directly from the **Hugging Face Hub**.

### **Auto-Detection of Dataset Schema**

Upasak intelligently identifies and structures your dataset into training-ready format.
Supported schemas:

| Schema | Format | Notes |
| ------------------- | -------------------------------------------------- | ------------------------------------------- |
| **DAPT** | `[{"text":"..."}]` or `text` column | Document Adaptation / continued pretraining |
| **ALPACA** | `[{"instruction":"...", "output":"..."}]` (+ optional `"input"`) or `instruction`, `output`, `input` (optional) columns | Converted to user → assistant turns |
| **CHATML** | `[{"messages":[{"role":"...", "content":"..."}]}]` or `messages` column | Supports role/content pairs |
| **SHARE_GPT** | `[{"conversations":[{"from":"...", "value":"..."}]}]` or `conversations` column | Converts human ↔ model to user ↔ assistant |
| **PROMPT_RESPONSE** | `[{"prompt":"...", "response":"..."}]` or `prompt`, `response` columns | Simple instruction → answer |
| **QA** | `[{"question":"", "answer":""}]` or `question`, `answer` columns | Q&A format |
| **QLA** | `[{"question":"...", "long_answer":"..."}]` or `question`, `long_answer` columns | Long-form generation |

### **Built-In PII & Sensitive Information Sanitization**

Upasak ensures privacy compliance by:

* Automatically detecting and redacting/masking PII
* Using placeholder tokens to preserve dataset utility
* Offering AI-assisted detection with manual review loops, which uses [GLiNER](https://huggingface.co/urchade/gliner_multi_pii-v1) (Named Entity Recognition) model.
* Logging sanitization results for auditability

Upasak automatically detects and redacts:
* Personal names
* Emails / phone numbers
* IP addresses, IMEI
* Credit card / bank details
* National IDs (Aadhaar, PAN, Voter ID)
* API keys
* GitHub/GitLab tokens
* Database credentials
* Residential & workplace addresses

Two sanitization modes:

1. **Rule-Based** (default)
2. **Hybrid (Rule-Based + NER-based)**

* Optional human review
* Configure HITL ratio & max samples for human review
* Accept/reject uncertain detections directly in the UI
* Preview sanitized sample before training
---

## **Streamlit UI – No-Code Training Workflow**

The visual interface provides fully interactive control:

### **1. Model Selection**

Choose supported base models (currently Gemma-3 text-only).
Future updates will include LLaMA, Mixtral, Phi, Qwen and multimodal variants.

### **2. HF Token Handling**

* Read token for pulling models
* Write token for pushing fine-tuned models back to HF Hub

### **3. Dataset Input**

* Upload dataset files
* Or load from Hugging Face dataset list

### **4. PII Sanitization Panel**

* Enable/disable sanitization
* Select detection method (rule-based / hybrid)
* Enable Human Review & configure ratios
* View uncertain detections and choose actions
* Preview sanitized sample before training

### **5. Hyperparameter Controls**

#### **Basic Hyperparameters**

* Learning rate
* Batch size
* Epochs
* Max sequence length
* Logging steps
* LR scheduler

#### **Advanced Hyperparameters**

* Gradient accumulation
* Gradient clipping
* LR warmup ratio
* Weight decay
* Checkpoint save strategy
* Evaluation strategy + steps
* Validation split
* Model tracker platform (Comet / WandB / none)
* Tracker API keys

### **6. LoRA Configuration**

* LoRA rank
* LoRA alpha
* LoRA dropout
* Target modules
* Optional merging of LoRA adapters

### **7. Training Control**

* Start / Stop training
* Live training metrics inside the app:

* Training loss
* Validation loss
* Token-level curves
* Optional external tracking (Comet / WandB)

### **8. Inference Script Generation**
After training completes, Upasak automatically generates a customized inference.py script tailored to your training configuration.

* **LoRA support** – Handles both scenarios:
* **LoRA + merged adapters** – Loads the fully merged model.
* **LoRA + unmerged adapters** – Loads base model + applies LoRA adapters at runtime.
* **Full fine-tune** – Standard model loading
* **Ready to use** - Access it in your output directory

**Usage**

```bash
cd path_to_output_dir
python inference.py
```

### **9. Export & Push**

* Output directory for checkpoints, final model, and merged model
* Push to HF Hub (when write-enabled token is provided)

---

# **Installation**

### **Install from PyPI (recommended)**

```bash
pip install upasak
```

### **Or install from source**

```bash
# Clone this repo
git clone https://github.com/shrut2702/upasak
cd upasak
```
```bash
# optional

## For Windows
python -m venv vir_env
./vir_env/scripts/activate

## For macOS
python -m venv vir_env
source vir_env/bin/activate
```

```bash
# Install required dependencies
pip install -r requirements.txt
```

---

## **Usage**

Upasak is used as a Python-triggered Streamlit app.

### **After installing the package:**

#### **1. Create a Python launcher file**

For example: `run_upasak.py`

```python
from upasak import main

if __name__ == "__main__":
main()
```

#### **2. Launch the Streamlit application**

```bash
streamlit run run_upasak.py
```

```bash
streamlit run run_upasak.py --server.maxUploadSize=1024 # for configuring upload file size limit in MB
```

This opens the Upasak UI in your browser.

### **After installing from source**

#### **1. Launch `app.py`**
```bash
streamlit run app.py
```

```bash
streamlit run app.py --server.maxUploadSize=1024 # for configuring upload file size limit in MB
```

### **Reusability of Upasak Modules**

Although Upasak provides a full end-to-end UI, **every internal component is designed to be reusable in isolation**.
You can import and use modules such as:

* `TokenizerWrapper` → standalone tokenization
* `TrainingEngine` + `TrainerConfig` → run full or LoRA fine-tuning programmatically
* `PIISanitizer` → rule-based or hybrid PII detection/sanitization

You can refer to [examples](https://github.com/shrut2702/upasak/tree/f4252b2e2072aad9e878005108abc564d8b670a0/examples) to more details.

This allows you to integrate Upasak **directly into custom pipelines**, backend services, notebooks, or data-processing workflows — **without launching the Streamlit UI**.

---

# **Use Cases**

* Educational fine-tuning demonstrations
* Rapid prototyping in quick-shipping environments
* Dataset preparation and anonymization workflows
* Internal LLM finetuning on sensitive or regulated data
* Developers with no domain expertise who wants LLM in their application

---

# **Contributing**

Contributions are welcome!
Please open an issue or submit a pull request for bug fixes, features, documentation, or dataset schema support.

---

# **Support**

For issues, questions, or feature requests:
Create a GitHub issue in this repository.

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shrut2702/upasak

Awesome Lists containing this project

README