https://github.com/mandarwagh9/web-scraper

This project is a Flask-based web application designed to scrape various types of content from a specified URL.
https://github.com/mandarwagh9/web-scraper

flask flask-application pytho webscraper webscraping

Last synced: 6 months ago
JSON representation

This project is a Flask-based web application designed to scrape various types of content from a specified URL.

Host: GitHub
URL: https://github.com/mandarwagh9/web-scraper
Owner: mandarwagh9
Created: 2024-08-20T11:24:07.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-08-20T11:40:15.000Z (11 months ago)
Last Synced: 2024-12-02T04:08:58.136Z (7 months ago)
Topics: flask, flask-application, pytho, webscraper, webscraping
Language: Python
Homepage: https://youtu.be/5jNEx0zlp20
Size: 31.3 KB
Stars: 11
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Web Scraper

## Overview

This project is a versatile web scraping tool with two interfaces: a Flask-based web application and a terminal-based script. Both tools are designed to scrape various types of content from a specified URL and save them to designated folders.

## Features

- **Web Application**:
- Scrapes text from `

` and header tags (`

`, `

`, etc.)
- Extracts all hyperlinks from the page
- Downloads images, videos, audio files, and documents based on user selection
- Saves scraped content to specified folders
- Provides a user-friendly interface for scraping configuration

- Terminal Script:
- Allows scraping of text, links, and media (images, videos, audio files) directly from the terminal
- Asks for the URL, types of content to scrape, and the locations to save the scraped content

## Setup

### Prerequisites

- Python 3.x
- Flask
- Requests
- BeautifulSoup4

### Installation

1. Clone the repository:

```bash
git clone https://github.com/mandarwagh9/web-scraper.git

cd web-scraper

```

2. Create and activate a virtual environment (optional but recommended):

```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```

3. Install the required packages:

```bash
pip install -r requirements.txt
```

### Requirements

Create a `requirements.txt` file with the following content:

```
Flask==2.2.2
requests==2.28.2
beautifulsoup4==4.12.2
```

## Usage

### Web Application

1. Run the Flask application:

```bash
python app.py
```

2. Open your web browser and navigate to:

```
http://127.0.0.1:5000
```

3. Use the form to specify the URL and choose the types of content you want to scrape. You can select:
- Text
- Links
- Images
- Videos
- Audio
- Documents

4. Specify the folders where you want to save each type of content. For optional fields, leave them blank if you don't want to save that type of content.

5. Click the "Start Scraping" button to begin the scraping process. A popup will appear to notify you when the scraping is complete.

### Terminal Script

1. Run the terminal script:

```bash
python terminal_scraper.py
```

2. Follow the prompts in the terminal:
- Enter the URL to scrape
- Choose the types of content you want to scrape by entering the corresponding numbers (e.g., text, links, media)
- Provide the folder names for each type of content (if applicable)

3. The script will process the URL and save the scraped content to the specified folders. You will be notified once the scraping is complete.

## Example

![Example Screenshot](https://github.com/mandarwagh9/web-scraper/blob/main/webscraper.PNG?raw=true)

## Contributing

Feel free to submit issues or pull requests. Please make sure to follow the coding style and include relevant tests with your contributions.

1. Fork the repository
2. Create a new branch (`git checkout -b feature-branch`)
3. Commit your changes (`git commit -am 'Add new feature'`)
4. Push to the branch (`git push origin feature-branch`)
5. Create a new Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

Let me know if there's anything else you'd like to add or adjust!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mandarwagh9/web-scraper

Awesome Lists containing this project

README

`, `