Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lostdir/news_scrapper
News website scraper using python beautifulsoup4
https://github.com/lostdir/news_scrapper
Last synced: 3 days ago
JSON representation
News website scraper using python beautifulsoup4
- Host: GitHub
- URL: https://github.com/lostdir/news_scrapper
- Owner: lostdir
- License: mit
- Created: 2024-09-01T14:30:01.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-09-15T07:36:18.000Z (2 months ago)
- Last Synced: 2024-09-15T10:39:25.221Z (2 months ago)
- Language: Python
- Homepage:
- Size: 18.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: newscrapper.py
Awesome Lists containing this project
README
# News Scraper for Times of India
This project is a Python-based web scraper designed to extract news articles from the Times of India website. It allows you to scrape a random selection of articles from a specified range of years and save the extracted data (title, date, content, and URL) to a CSV file.
## Table of Contents
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Functionality](#functionality)
- [Dependencies](#dependencies)
- [Contributing](#contributing)
- [License](#license)## Features
- **Scrape Articles**: Extracts news articles from the Times of India archive pages.
- **Error Handling**: Includes retry mechanisms and error handling to manage network issues and timeouts.
- **Random Article Selection**: Selects random articles to scrape within a specified date range.
- **Save to CSV**: Saves scraped articles to a CSV file for further analysis or use.
- **Progress Bar**: Displays a progress bar to track the scraping process.## Installation
To use this scraper, you need to have Python installed. Follow the steps below to set up the project:
1. **Clone the repository:**
```bash
git clone https://github.com/lostdir/News_Scrapper.git
cd news_scraper
```2. **Create a virtual environment:**
```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```3. **Install the required dependencies:**
```bash
pip install -r requirements.txt
```## Usage
To run the scraper, execute the following command:
```bash
python scraper.py
```You can modify the parameters such as the range of years and the number of articles to scrape by editing the `scrape_random_articles` function call in the `if __name__ == '__main__':` block.
### Example Usage
```python
scrape_random_articles(2017, 2024, num_articles=10)
```This example will scrape 10 random articles between the years 2017 and 2024.
## Functionality
### Functions
- **`scrape_archive_page(archive_url, max_retries=3)`**: Scrapes all article links from a single archive page with retries in case of failures.
- **`scrape_article(article_url, max_retries=3)`**: Scrapes the title, content, and date from an individual article page.
- **`scrape_random_articles(start_year, end_year, num_articles=65)`**: Scrapes a random selection of articles from a range of dates.### How It Works
1. **Scrape Archive Pages**: The script navigates through the Times of India archive pages and extracts article URLs.
2. **Scrape Article Content**: For each article URL, it scrapes the title, content, and date.
3. **Random Selection**: The script selects random articles to ensure a diverse dataset.
4. **Save to CSV**: The scraped data is saved in a CSV file named `times_of_india_articles.csv`.## Dependencies
The scraper relies on the following Python libraries:
- `requests`: To send HTTP requests to the Times of India server.
- `beautifulsoup4`: To parse and extract information from HTML pages.
- `pandas`: To handle data and save it to a CSV file.
- `tqdm`: To display a progress bar for the scraping process.Install all dependencies using:
```bash
pip install -r requirements.txt
```## Contributing
Contributions are welcome! If you have any improvements or new features to suggest, please open an issue or submit a pull request.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
---