https://github.com/hasnocool/web-harvester
A GUI-based web crawler application that harvests data from websites according to specified parameters.
https://github.com/hasnocool/web-harvester
console gui multi output pyqt5 python requests scraping threading urllib web
Last synced: 9 months ago
JSON representation
A GUI-based web crawler application that harvests data from websites according to specified parameters.
- Host: GitHub
- URL: https://github.com/hasnocool/web-harvester
- Owner: hasnocool
- Created: 2024-05-11T10:40:11.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2024-09-18T10:39:54.000Z (over 1 year ago)
- Last Synced: 2025-02-17T02:44:38.485Z (12 months ago)
- Topics: console, gui, multi, output, pyqt5, python, requests, scraping, threading, urllib, web
- Language: Python
- Size: 15.6 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
**WebCrawler: A PyQt5-Based Web Harvester**
=====================================
**Project Title:** WebCrawler
-----------------------------
A simple yet powerful web crawler built using PyQt5, designed to extract data from websites and save it to a CSV file.
**Description:**
---------------
Are you tired of manually copying website data? Do you want to automate the process and focus on more exciting tasks? Look no further! I built this to make web harvesting easier, faster, and more efficient. With WebCrawler, you can:
* Explore websites with ease
* Extract relevant information from web pages
* Save your findings to a CSV file for further analysis
**Features:**
-------------
### Key Features
* **User-Friendly Interface**: A clean and intuitive PyQt5-based GUI makes it easy to input website URLs, set crawling depths, and select filters.
* **Flexible Filtering**: Define URL filters using regex-like syntax to exclude unwanted pages from the crawl results.
* **Progress Monitoring**: Track the crawler's progress in real-time through a live console output window.
### One Cool Feature: **Stop and Resume**
Want to stop the crawler mid-crawl or resume it later? No problem! Simply click the "Stop Crawling" button, and WebCrawler will save your progress. When you're ready to continue, just restart the crawler with the same settings.
**Installation:**
-----------------
To get started with WebCrawler, follow these steps:
1. Clone this repository using `git clone https://github.com/your-username/web-harvester.git`
2. Install required dependencies by running `pip install -r requirements.txt` in your terminal
3. Run the project using `python web_crawler.py`
**Usage:**
-------------
1. Launch WebCrawler and input the website URL you want to crawl.
2. Set the crawling depth, select filters (if needed), and click "Start Crawling".
3. Monitor progress in real-time through the live console output window.
**Contributing:**
----------------
I'm always open to contributions! If you'd like to enhance WebCrawler with new features or improve existing functionality, please feel free to fork this repository and submit a pull request.
**License:**
------------
WebCrawler is released under the MIT License. See LICENSE.txt for details.
**Tags/Keywords:** web-crawler, PyQt5, web-harvester, website-extraction, csv-output, filters, real-time-progress-monitoring, user-friendly-interface