https://github.com/miiraak/scrapc
C# WinForms - Crawler & Scraper Web content
https://github.com/miiraak/scrapc
crawler csharp html scraper url web windows-forms
Last synced: 5 months ago
JSON representation
C# WinForms - Crawler & Scraper Web content
- Host: GitHub
- URL: https://github.com/miiraak/scrapc
- Owner: Miiraak
- License: mit
- Created: 2024-07-23T17:24:37.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2024-10-21T13:34:40.000Z (over 1 year ago)
- Last Synced: 2025-03-25T18:21:26.000Z (about 1 year ago)
- Topics: crawler, csharp, html, scraper, url, web, windows-forms
- Language: C#
- Homepage:
- Size: 92.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Scrapc
## Description
A Windows Form app that let you recursivly crawl and crape web site to extract differents content.
---
## Features
| Label | Desc |
|---|---|
| **Crawling** | Recursivly collect URLs from given web pages. |
| **Scraping content** | Extract and save the web page content. |
| **Scraping HTLM** | Extract and save the HTML's web pages. |
| **Scraping Image** | Extract and save images from web pages. |
| **Scraping URLS** | Extract and save all urls encountered from web pages |
| **URLs limitation** | Choose the maximal number of urls to scrape. |
---
# Disclaimer
Warning: The use of this application must be done in a responsible and legal way.
- Compliance with the Terms of Use: Make sure you comply with the terms of use of the websites that you are crawling. Many websites limit the frequency of requests, explicitly prohibit scraping or access to certain resources. (Sorry Wikipedia it was not intended 😅🙏)
- Distributed Denial of Service (DDoS): Improper use of this application can result in a large number of simultaneous requests, potentially causing an unintended DDoS. Limit the number of simultaneous requests and the frequency of requests to avoid this.
- Prohibited Content: Do not crawl websites containing illegal content or sensitive information.
The author of this software is not responsible for any damages or legal consequences resulting from improper or illegal use of this application.
---
## Prerequisites
Before running the project, make sure you have the following installed:
- [.NET Framework](https://dotnet.microsoft.com/fr-fr/download/dotnet-framework)
- [HtmlAgilityPack](https://github.com/zzzprojects/html-agility-pack)
---
## Usage
- Start App.
- Select wanted content.

- Enter a valid url in the right field.
_You can try it with : [Book to Scrape](https://books.toscrape.com/) (Thanks to them 🫀)_
- Choose the maximum number of urls you wanna crawl.
- Click on `crawl` to start the gathering.

- Use `URLs ?` to show URLs gathered. (optionnal)

- Next click `Scrap` to extract and save the choosen content, defined on step 2, of the crawled pages.
---
## Contributing
Contributions are welcome! To contribute to this project, please follow these steps:
1. Fork the repository.
2. Create a new branch for your feature (`git checkout -b my-new-feature`).
3. Make your changes.
4. Commit your changes (`git commit -m 'Add my new feature'`).
5. Push your branch (`git push origin my-new-feature`).
6. Open a Pull Request.
---
## Issues and Suggestions
If you encounter any issues or have suggestions for improving the project, please use the [GitHub issue tracker](https://github.com/Miiraak/[APP_NAME]]/issues).
---
## License
This project is licensed under the MIT. See the [LICENSE](./LICENSE) file for more details.
---
## Authors
**[Miiraak](https://github.com/miiraak)** - *Lead Developer* -