https://github.com/spider-rs/web-crawling-guides
How to guides on web-crawling or scraping
https://github.com/spider-rs/web-crawling-guides
agents ai-agents ai-scraping clean-markdown crawler fast-webcrawler html-to-markdown llm-webcrawler scraper web-scraping
Last synced: 12 months ago
JSON representation
How to guides on web-crawling or scraping
- Host: GitHub
- URL: https://github.com/spider-rs/web-crawling-guides
- Owner: spider-rs
- Created: 2024-06-16T21:13:38.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-04-26T13:29:55.000Z (about 1 year ago)
- Last Synced: 2025-06-25T01:54:16.369Z (12 months ago)
- Topics: agents, ai-agents, ai-scraping, clean-markdown, crawler, fast-webcrawler, html-to-markdown, llm-webcrawler, scraper, web-scraping
- Homepage: https://spider.cloud/guides
- Size: 7.85 MB
- Stars: 20
- Watchers: 2
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spider Web Crawling and Scraping Guides
This repo contains a collection of guides on how to effectively use the Spider service to crawl or scrape. Contributors are welcome! 😁
## Collection
- [Using the Spider API](spider-api.md)
- [How to Use Proxy Mode](proxy-mode.md)
- [LangChain + Groq + Spider = 🚀 (Integration Guide)](langchain-groq.md)
- [CrewAI Spider Stock Research](crewai-spider-research-agent.md)
- [Extracting Contacts](extracting-contacts.md)
- [Automated Cold Email Outreach Using Spider](auto-email-response-outreach.md)
- [How to Archive Full Website](website-archiving.md)
- Building A Speedy Resilient Web Scraper for RAG AI ([Part 1](building-a-speedy-resilient-web-scraper-for-rag-ai-part1-preparing.md), [Part 2](building-a-speedy-resilient-web-scraper-for-rag-ai-part2-scaling-up.md))
- [Agents from Scratch](ai-agent-from-scratch.md)
## Anti-Bot Detection
Spider, combined with the [`headless-browser`](https://github.com/spider-rs/headless-browser) repo, achieves **full stealth** against leading bot detection services — even when running fully headless.
Our techniques make Spider the most powerful crawling stack available today, providing an invisible footprint while scraping at scale.
Below are some screenshots proving Spider's stealth against major bot detectors:
| Detector | Screenshot |
| :--------------------------------------- | :------------------------------------------------------------------------------------------------------- |
| BrowserScan.net Bot Detection | ✅ [View Screenshot](images/anti_bot/www_browserscan_net_bot_detection.png) |
| Bot Detector Rebrowser | ✅ [View Screenshot](images/anti_bot/bot_detector_rebrowser_net.png) |
| SammySoft Bot Ecom | ✅ [View Screenshot](images/anti_bot/bot_sannysoft_com.png) |
| Device and Browser Info (Are You a Bot?) | ✅ [View Screenshot](images/anti_bot/deviceandbrowserinfo_com_are_you_a_bot.png) |
| Fingerprint Ecom Playground | ✅ [View Screenshot](images/anti_bot/demo_fingerprint_com_playground.png) |
| Device and Browser Info - Device Test | ✅ [View Screenshot](images/anti_bot/deviceandbrowserinfo_com_info_device.png) |
| Creepjs - Device Test | ✅ [View Screenshot](images/anti_bot/abrahamjuliot_github_io_creepjs.png) |
Spider is designed for **extreme evasion**, **high concurrency**, and **human-like behavior**, allowing you to dominate even the most protected websites.
## Contribute
We're happy to accept requests in the issue tracker, improvements to the content, and additional guides.