{"id":48355183,"url":"https://github.com/taqsblaze/hush","last_synced_at":"2026-04-05T11:01:53.552Z","repository":{"id":343116531,"uuid":"1176339043","full_name":"TaqsBlaze/Hush","owner":"TaqsBlaze","description":"Hush: A lightweight, context-aware text toxicity classifier. Leveraging NLP and Random Forest ensemble learning to detect and mitigate harmful language in real-time. Built for efficiency, safety, and cleaner digital communication.","archived":false,"fork":false,"pushed_at":"2026-03-08T23:48:06.000Z","size":52,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-09T04:21:00.773Z","etag":null,"topics":["content-moderation","machine-learning","nlp","random-forest","safety-tools","scikit-learn","text-classification","toxicity-detection"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TaqsBlaze.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-08T23:17:42.000Z","updated_at":"2026-03-08T23:48:09.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/TaqsBlaze/Hush","commit_stats":null,"previous_names":["taqsblaze/hush"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/TaqsBlaze/Hush","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TaqsBlaze%2FHush","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TaqsBlaze%2FHush/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TaqsBlaze%2FHush/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TaqsBlaze%2FHush/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TaqsBlaze","download_url":"https://codeload.github.com/TaqsBlaze/Hush/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TaqsBlaze%2FHush/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31433044,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T08:13:15.228Z","status":"ssl_error","status_checked_at":"2026-04-05T08:13:11.839Z","response_time":75,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["content-moderation","machine-learning","nlp","random-forest","safety-tools","scikit-learn","text-classification","toxicity-detection"],"created_at":"2026-04-05T11:01:49.276Z","updated_at":"2026-04-05T11:01:53.545Z","avatar_url":"https://github.com/TaqsBlaze.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![icon](https://raw.githubusercontent.com/TaqsBlaze/Hush/refs/heads/main/image/image.png)\n\n# Hush\n\nHush is a research-grade text classifier that flags toxic language in long-form messages by combining character-level TF-IDF extraction with a robust linear classifier. The project is tuned for clarity of metrics, reproducible training, and simple deployment, making it easy for moderators, educators, or open-source contributors to iterate on custom rules or datasets.\n\n## Highlights\n- **Character-aware embedding**: `TfidfVectorizer` runs on `char_wb` n-grams (3,5 characters) so the model catches insults that span creative spellings or leetspeak.\n- **Balanced linear model**: `SGDClassifier` with `modified_huber` loss and class weights keeps training fast, stable, and sensitive to the minority toxic class.\n- **Versioned artifacts**: Each training run writes timestamped models, vectorizers, and metadata plus `latest` copies for quick inference.\n\n## Datasets \u0026 Generated Content\n- `classification_data.csv` is the primary labeled corpus (toxic=1, non-toxic=0) that `trainer.py` consumes. The dataset mixes real and synthetic sentences and already includes an 80/20 split inside the training workflow.\n- `classification_data-shona.csv` mirrors that labeling format but covers Shona-language statements to help evaluate multilingual generalization.\n- `generated_5000_dataset.csv` is produced by `data_generator.py`, which stitches together templates for supportive and toxic phrasing. Re-run the generator to refresh the synthetic pool when you need more training examples.\n- `classification_data-old.csv` and the metadata JSON files (e.g., `metadata_v20260311_011709.json`) document prior runs or auxiliary exports.\n\n## Getting Started\n1. Install the required packages:\n   ```\n   pip install pandas scikit-learn joblib\n   ```\n2. Adjust `classification_data.csv` (or swap in `generated_5000_dataset.csv`) as needed.\n3. Run the training script to produce fresh artifacts.\n\n## Training (`trainer.py`)\n```bash\npython trainer.py\n```\n- Loads the chosen CSV, drops NaNs, and stratifies into an 80/20 train/test split using `train_test_split(random_state=42)`.\n- Configures `TfidfVectorizer(analyzer=\"char_wb\", ngram_range=(3,5), max_features=50000)` and fits/transforms the text.\n- Fits `SGDClassifier(loss=\"modified_huber\", penalty=\"l2\", alpha=0.0001, class_weight=\"balanced\", random_state=42)` on the vectorized training set.\n- Computes accuracy and a classification report on the test split, then saves:\n  - `toxic_model_v\u003ctimestamp\u003e.hush`\n  - `vectorizer_v\u003ctimestamp\u003e.hush`\n  - `metadata_v\u003ctimestamp\u003e.json` (contains accuracy, precision/recall for the toxic label, and training params)\n  - `toxic_model_latest.hush` / `vectorizer_latest.hush` (overwrites with the newest run)\n\n## Evaluation (`test_model.py`)\n```bash\npython test_model.py\n```\n- Loads the versioned artifacts referenced at the top of the script (update the filenames if you retrain with new timestamps).\n- Runs through curated test cases that cover non-toxic, obviously toxic, subtle toxicity, and edge cases.\n- Prints a simple table with pass/fail status plus overall percentage correct.\n\n## Inference (`model.py`)\n```bash\npython model.py \"Your message here\"\n# or run without arguments to use the interactive prompt\n```\n- Loads the artifacts hard-coded near the top; swap those filenames after retraining.\n- Transforms the user text and prints whether Hush considers it toxic.\n\n## Supporting Scripts\n- `data_generator.py` regenerates a balanced (2,500/2,500) dataset of synthetic sentences with both polite and aggressive language. Run it to refresh `generated_5000_dataset.csv` or to seed new labels.\n- Keep `README.md`, `FIX.md`, and `metadata_*.json` up to date whenever you change the training pipeline so contributors can track regressions.\n\n## Artifact Reference\n| File | Purpose |\n| --- | --- |\n| `toxic_model_v\u003cTIMESTAMP\u003e.hush` | Versioned classifier for reproducibility. |\n| `vectorizer_v\u003cTIMESTAMP\u003e.hush` | Matching vectorizer used during training. |\n| `metadata_v\u003cTIMESTAMP\u003e.json` | Stores metrics, parameters, and dataset provenance. |\n| `toxic_model_latest.hush`, `vectorizer_latest.hush` | Handy shortcuts for inference. |\n| `generated_5000_dataset.csv` | Output of `data_generator.py`, useful as supplemental training data. |\n\n## License\nHush is MIT-licensed. See [LICENSE](LICENSE) for the full text.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaqsblaze%2Fhush","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftaqsblaze%2Fhush","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaqsblaze%2Fhush/lists"}