{"id":26536962,"url":"https://github.com/prithivsakthiur/auto-abliteration","last_synced_at":"2026-04-11T20:02:03.029Z","repository":{"id":283355916,"uuid":"951471557","full_name":"PRITHIVSAKTHIUR/Auto-Abliteration","owner":"PRITHIVSAKTHIUR","description":"modify a language model's behavior by abliterating its weights.","archived":false,"fork":false,"pushed_at":"2025-03-19T19:44:11.000Z","size":0,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-19T20:33:54.810Z","etag":null,"topics":["abliteration","gemma3","huggingface-transformers","llm","llms","ollama","streamlit","uncensored"],"latest_commit_sha":null,"homepage":"https://huggingface.co/spaces/prithivMLmods/Auto-Abliteration","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PRITHIVSAKTHIUR.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-19T18:22:06.000Z","updated_at":"2025-03-19T19:44:15.000Z","dependencies_parsed_at":"2025-03-19T20:43:58.630Z","dependency_job_id":null,"html_url":"https://github.com/PRITHIVSAKTHIUR/Auto-Abliteration","commit_stats":null,"previous_names":["prithivsakthiur/auto-abliteration"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/PRITHIVSAKTHIUR/Auto-Abliteration","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRITHIVSAKTHIUR%2FAuto-Abliteration","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRITHIVSAKTHIUR%2FAuto-Abliteration/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRITHIVSAKTHIUR%2FAuto-Abliteration/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRITHIVSAKTHIUR%2FAuto-Abliteration/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PRITHIVSAKTHIUR","download_url":"https://codeload.github.com/PRITHIVSAKTHIUR/Auto-Abliteration/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRITHIVSAKTHIUR%2FAuto-Abliteration/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31693275,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-11T13:07:20.380Z","status":"ssl_error","status_checked_at":"2026-04-11T13:06:47.903Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["abliteration","gemma3","huggingface-transformers","llm","llms","ollama","streamlit","uncensored"],"created_at":"2025-03-21T22:17:45.814Z","updated_at":"2026-04-11T20:02:02.995Z","avatar_url":"https://github.com/PRITHIVSAKTHIUR.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **Auto Abliteration**\n\nhttps://github.com/user-attachments/assets/cd182313-dcb4-4d72-aee7-c324aca10091\n\nAuto Abliteration is a Streamlit-based application that enables you to modify a language model’s behavior by \"abliterating\" its weights. This tool is especially recommended for edge-device LLMs (e.g., 0.5B, 1B, 1.5B models). By orthogonalizing key model weights along a computed refusal direction, Auto Abliteration can subtly alter the model’s responses.\n\n## Features\n\n- **Customizable Abliteration:** Adjust key parameters including the target layer (by relative ratio) and refusal weight.\n- **Dataset Driven:** Uses a target dataset (for harmful behaviors) and a baseline dataset (for harmless behaviors) to compute a “refusal direction.”\n- **Dynamic Response Comparison:** Compare model responses before and after abliterating its weights.\n- **Hugging Face Integration:** Automatically push the modified model to the Hugging Face Hub (with the option for private upload).\n- **Edge-device Support:** Optimized for smaller models suitable for edge devices.\n\n## Architecture Overview\n\nThe application uses several helper functions:\n\n- **`load_instructions`:** Loads a specified number of instructions from a Hugging Face dataset.\n- **`generate_response`:** Generates a response from the model for a given prompt.\n- **`generate_outputs`:** Obtains hidden states for a series of instructions, which are later used to compute the refusal direction.\n- **`orthogonalize_matrix`:** Adjusts model weights by subtracting the projection of a given vector (refusal direction).\n\nBy processing instructions from both target and baseline datasets, the script calculates a normalized refusal direction. Then, it orthogonalizes the weights (e.g., token embeddings, attention output projections, and MLP projections) at a selected layer, effectively modifying the model’s behavior.\n\n## Installation\n\n1. **Clone the repository:**\n\n   ```bash\n   git clone https://github.com/PRITHIVSAKTHIUR/Auto-Abliteration.git\n   cd Auto-Abliteration\n   ```\n\n2. **Set up a virtual environment (optional but recommended):**\n\n   ```bash\n   python -m venv venv\n   source venv/bin/activate  # On Windows: venv\\Scripts\\activate\n   ```\n\n3. **Install the dependencies:**\n\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n   Make sure you have the following packages installed:\n   - `torch`\n   - `transformers`\n   - `datasets`\n   - `streamlit`\n   - `tqdm`\n\n## Usage\n\n1. **Configure Abliteration Parameters:**\n\n   When you run the app, use the sidebar to:\n   - Enter the model ID (e.g., `prithivMLmods/FastThink-0.5B-Tiny`).\n   - Specify the number of instructions to use.\n   - Adjust the target layer (as a relative ratio) and refusal weight.\n   - Input your Hugging Face token if accessing private or restricted models.\n   - Set the target and baseline prompts, along with their corresponding dataset IDs and column names.\n\n2. **Run the Streamlit App:**\n\n   Launch the app using:\n\n   ```bash\n   streamlit run \u003cyour_script_name\u003e.py\n   ```\n\n   Replace `\u003cyour_script_name\u003e.py` with the filename containing your code.\n\n3. **Workflow:**\n\n   - The app first loads the model and tokenizer.\n   - It generates an initial response for a sample prompt (e.g., \"How to write a computer virus?\").\n   - It then loads target and baseline instructions, and generates hidden states from both.\n   - The mean hidden states from each set are used to compute and normalize a refusal direction.\n   - Selected model weights (token embeddings, attention output, and MLP projections) are orthogonalized using this direction.\n   - A new response is generated to showcase the change.\n   - Finally, the modified model is optionally pushed to the Hugging Face Hub.\n\n## Example\n\nWhen you run the application, you might see two sections:\n\n- **Before Abliteration Response:** Shows the model’s original response.\n- **After Abliteration Response:** Displays the modified response after weight abliterations.\n\nAdditionally, debugging logs in the app provide details on each processing step.\n\n## Credits\n\n- Thanks to [Maxime Labonne](https://huggingface.co/mlabonne) \n- **Hugging Face:** Utilizing the `transformers` and `datasets` libraries from Hugging Face.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprithivsakthiur%2Fauto-abliteration","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprithivsakthiur%2Fauto-abliteration","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprithivsakthiur%2Fauto-abliteration/lists"}