https://github.com/jaypyles/Scraperr
Self-hosted webscraper.
https://github.com/jaypyles/Scraperr
opensource self-hosted webscraper
Last synced: 4 months ago
JSON representation
Self-hosted webscraper.
- Host: GitHub
- URL: https://github.com/jaypyles/Scraperr
- Owner: jaypyles
- License: mit
- Created: 2024-07-06T16:55:49.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-11-17T03:03:08.000Z (12 months ago)
- Last Synced: 2024-11-17T03:27:15.658Z (12 months ago)
- Topics: opensource, self-hosted, webscraper
- Language: TypeScript
- Homepage: https://scraperr-docs.pages.dev/
- Size: 1.85 MB
- Stars: 1,020
- Watchers: 7
- Forks: 44
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome - jaypyles/Scraperr - Self-hosted webscraper. (TypeScript)
- my-awesome-github-stars - jaypyles/Scraperr - Self-hosted webscraper. (TypeScript)
- Awesome-NAS-Docker - 点我查看 - commit/jaypyles/Scraperr?label) | (置顶 / 1、AI应用生态)
README
**A powerful self-hosted web scraping solution**
## 📋 Overview
Scrape websites without writing a single line of code.
> 📚 **[Check out the docs](https://scraperr-docs.pages.dev)** for a comprehensive quickstart guide and detailed information.
## ✨ Key Features
- **XPath-Based Extraction**: Precisely target page elements
- **Queue Management**: Submit and manage multiple scraping jobs
- **Domain Spidering**: Option to scrape all pages within the same domain
- **Custom Headers**: Add JSON headers to your scraping requests
- **Media Downloads**: Automatically download images, videos, and other media
- **Results Visualization**: View scraped data in a structured table format
- **Data Export**: Export your results in markdown and csv formats
- **Notifcation Channels**: Send completion notifcations, through various channels
## 🚀 Getting Started
### Docker
```bash
make up
```
### Helm
> Refer to the docs for helm deployment: https://scraperr-docs.pages.dev/guides/helm-deployment
## ⚖️ Legal and Ethical Guidelines
When using Scraperr, please remember to:
1. **Respect `robots.txt`**: Always check a website's `robots.txt` file to verify which pages permit scraping
2. **Terms of Service**: Adhere to each website's Terms of Service regarding data extraction
3. **Rate Limiting**: Implement reasonable delays between requests to avoid overloading servers
> **Disclaimer**: Scraperr is intended for use only on websites that explicitly permit scraping. The creator accepts no responsibility for misuse of this tool.
## 💬 Join the Community
Get support, report bugs, and chat with other users and contributors.
👉 [Join the Scraperr Discord](https://discord.gg/89q7scsGEK)
## 📄 License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
## 👏 Contributions
Development made easier with the [webapp template](https://github.com/jaypyles/webapp-template).
To get started, simply run `make build up-dev`.