Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/danhilse/web-scraper
A versatile Python-based web scraper that extracts content from single URLs or entire sitemaps, organizing data into structured text files. Features include sitemap parsing, content grouping by URL structure, and an easy-to-use command-line interface. Ideal for data extraction, content analysis, and web research tasks.
https://github.com/danhilse/web-scraper
beautifulsoup cli-tool data-extraction python sitemap-parser web-scraping
Last synced: about 1 month ago
JSON representation
A versatile Python-based web scraper that extracts content from single URLs or entire sitemaps, organizing data into structured text files. Features include sitemap parsing, content grouping by URL structure, and an easy-to-use command-line interface. Ideal for data extraction, content analysis, and web research tasks.
- Host: GitHub
- URL: https://github.com/danhilse/web-scraper
- Owner: danhilse
- Created: 2024-06-21T17:14:18.000Z (7 months ago)
- Default Branch: master
- Last Pushed: 2024-06-21T19:26:50.000Z (7 months ago)
- Last Synced: 2024-06-23T09:53:44.888Z (7 months ago)
- Topics: beautifulsoup, cli-tool, data-extraction, python, sitemap-parser, web-scraping
- Language: Python
- Homepage:
- Size: 4.28 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Web Scraper
![Python](https://img.shields.io/badge/Python-3.9%2B-blue)
![License](https://img.shields.io/badge/License-MIT-green)
![Last Commit](https://img.shields.io/github/last-commit/danhilse/web-scraper)A powerful command-line web scraper tool that extracts content from websites and saves it to organized text files.
![Web Scraper Demo](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExOTZtMG91aXkyMnJ5eXo5NXB0cTk3dnA5Z3Nwa3gyZ3dldGthbTBnNSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/LL6x8hTiAExvxVbe3q/giphy.gif)
## Features
- Scrape content from a single URL or an entire sitemap
- Group scraped content into separate files based on URL structure
- Output content to multiple text files, organized by website sections
- Executable file for easy use without Python installation## Installation
1. Clone this repository:
git clone https://github.com/yourusername/web-scraper.git2. Install the required dependencies:
pip install -r requirements.txt## Usage
To scrape a single URL:
python web_scraper.py https://example.comTo scrape an entire sitemap:
python web_scraper.py https://example.com --sitemap## Project Structure
- `web_scraper.py`: Main script containing the web scraper logic
- `requirements.txt`: List of Python dependencies## Executable
A pre-built executable is available in the `dist` folder. You can download and run it directly without needing to install Python or any dependencies.
## License
This project is open source and available under the [MIT License](LICENSE).