https://github.com/blackaly/classify
A Python tool that scrapes news articles from websites and classifies them into categories using Natural Language Processing techniques.
https://github.com/blackaly/classify
nlp nlp-machine-learning nltk nltk-python pandas pandas-python python python3 sklearn
Last synced: 4 months ago
JSON representation
A Python tool that scrapes news articles from websites and classifies them into categories using Natural Language Processing techniques.
- Host: GitHub
- URL: https://github.com/blackaly/classify
- Owner: blackaly
- License: mit
- Created: 2023-05-21T18:28:30.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2025-04-11T20:48:53.000Z (about 1 year ago)
- Last Synced: 2025-10-28T05:32:12.386Z (7 months ago)
- Topics: nlp, nlp-machine-learning, nltk, nltk-python, pandas, pandas-python, python, python3, sklearn
- Language: Jupyter Notebook
- Homepage:
- Size: 2.01 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# News Article Classifier
A Python tool that scrapes news articles from websites and classifies them into categories using Natural Language Processing techniques.
## Features
- Web scraping from multiple news sources (currently supports NY Times and The Guardian)
- Text preprocessing with NLTK (tokenization, stopword removal, lemmatization)
- News article classification using Naive Bayes
- Command-line interface for training and analyzing articles
## Requirements
- Python 3.6+
- BeautifulSoup4
- NLTK
- Pandas
- Scikit-learn
- Requests
## Installation
```bash
# Clone the repository
git clone https://github.com/blackaly/classify
cd classify
# Install dependencies
pip install -r requirements.txt
```
## Usage
### Command Line Interface
Train the classifier:
```bash
python classify.py train /path/to/training_data.csv
```
Analyze a news article:
```bash
python classify.py analyze https://www.nytimes.com/path/to/article.html
```
### As a Module
```python
from classify import NewsAnalyzer
analyzer = NewsAnalyzer()
# Train classifier
metrics = analyzer.train_classifier("/path/to/training_data.csv")
print(f"Accuracy: {metrics['accuracy']:.2f}%")
# Analyze article
result = analyzer.analyze_url("https://www.nytimes.com/path/to/article.html")
print(f"Category: {result['category']}")
```
## Improvements in the Refactored Code
1. **Modular Architecture**
- Separated scraping, text processing, and classification into distinct classes
- Used object-oriented programming principles (inheritance, abstraction)
- Implemented the Factory pattern for scraper creation
2. **Error Handling**
- Added robust error handling throughout the code
- Graceful degradation when network requests fail
3. **Performance Improvements**
- Added TF-IDF vectorization option (often performs better than CountVectorizer)
- Optimized text processing pipeline
- Added type hints for better IDE support and code reliability
4. **New Features**
- Command-line interface for easy usage
- Extensible architecture for adding new scrapers
- Improved results formatting
5. **Code Readability**
- Comprehensive docstrings
- Consistent code style
- Logical organization of functions and classes
## License
MIT License.