https://github.com/c2ramel/autonomous-semantic-discovery
An unsupervised machine learning engine that utilizes Non-negative Matrix Factorization (NMF) to autonomously extract and visualize latent semantic topics from the 20 Newsgroups dataset.
https://github.com/c2ramel/autonomous-semantic-discovery
data-visualization machine-learning nlp nmf python scikit-learn unsupervised-learning
Last synced: 2 months ago
JSON representation
An unsupervised machine learning engine that utilizes Non-negative Matrix Factorization (NMF) to autonomously extract and visualize latent semantic topics from the 20 Newsgroups dataset.
- Host: GitHub
- URL: https://github.com/c2ramel/autonomous-semantic-discovery
- Owner: c2ramel
- Created: 2025-12-21T06:54:19.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-22T02:27:58.000Z (6 months ago)
- Last Synced: 2025-12-23T01:26:23.199Z (6 months ago)
- Topics: data-visualization, machine-learning, nlp, nmf, python, scikit-learn, unsupervised-learning
- Language: Python
- Homepage:
- Size: 1.36 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# The Autonomous Semantic Discovery Engine
### Unsupervised Machine Learning on the "20 Newsgroups" Dataset
**Author:** Jasper Kuo,
**Course:** Unsupervised Machine Learning,
**Status:** Complete (and surprisingly functional)
---
## 🍰 The Mission: "The Cake"
As Yann LeCun famously posited, if intelligence is a cake, unsupervised learning is the cake itself, while supervised learning is merely the icing. This project aims to eat the cake.
The objective was to ingest **18,000 unlabeled, unstructured documents** (emails from 1993) and autonomously discover the latent thematic structures hidden within them using **Non-negative Matrix Factorization (NMF)**.
## 🛠 Tech Stack
* **Language:** Python 3.8+
* **Vectorization:** TF-IDF (Term Frequency-Inverse Document Frequency)
* **Dimensionality Reduction:** NMF & PCA
* **Visualization:** Matplotlib
## 📊 Key Results
The engine successfully identified 10 distinct semantic topics without human intervention.
* **Topic 2 (Religion):** `god`, `jesus`, `bible`, `faith`
* **Topic 4 (Hardware):** `drive`, `scsi`, `disk`, `controller`
* **Topic 7 (Sports):** `game`, `team`, `year`, `hockey`
## 🚀 How to Run
1. Clone this repository.
2. Install dependencies:
```bash
pip install -r requirements.txt
3. Run the analysis engine:
```bash
python src/engine.py
## 📂 Project Structure
src/: Contains the core NMF logic and visualization scripts.
docs/: Includes the full project report and presentation slides.