https://github.com/lukas-bear/awesome-web-scraping
Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web
https://github.com/lukas-bear/awesome-web-scraping
List: awesome-web-scraping
anti-bot bot captcha crawler go java javascript network nodejs perl php proxies proxy proxy-server python ruby rust tools webscraping xml
Last synced: 3 months ago
JSON representation
Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web
- Host: GitHub
- URL: https://github.com/lukas-bear/awesome-web-scraping
- Owner: lukas-bear
- Created: 2025-01-30T16:40:44.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-02-07T16:32:16.000Z (4 months ago)
- Last Synced: 2025-03-28T23:01:39.881Z (3 months ago)
- Topics: anti-bot, bot, captcha, crawler, go, java, javascript, network, nodejs, perl, php, proxies, proxy, proxy-server, python, ruby, rust, tools, webscraping, xml
- Homepage:
- Size: 40 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
Awesome Lists containing this project
- ultimate-awesome - awesome-web-scraping - Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web. (Other Lists / Julia Lists)
README
# Awesome Web Scraping
A comprehensive collection of web scraping resources, tools, and libraries.
## Contents
- [Core Libraries](#core-libraries)
- [Specialized Tools](#specialized-tools)
- [Network Utilities](#network-utilities)
- [HTML/XML Processing](#htmlxml-processing)
- [Text Processing](#text-processing)
- [Data Formats](#data-formats)
- [Browser Automation](#browser-automation)
- [Headless Browsers](#headless-browsers)
- [Testing Frameworks](#testing-frameworks)
- [Browser Extensions](#browser-extensions)
- [Anti-detect Browsers](#anti-detect-browsers)
- [Anti-Bot Solutions](#anti-bot-solutions)
- [Proxy Services](#proxy-services)
- [CAPTCHA Solvers](#captcha-solvers)
- [Browser Fingerprinting](#browser-fingerprinting)
- [Data Processing](#data-processing)
- [Natural Language Processing](#natural-language-processing)
- [Data Cleaning](#data-cleaning)
- [Data Storage](#data-storage)
- [Best Practices](#best-practices)
- [Rate Limiting](#rate-limiting)
- [Error Handling](#error-handling)
- [Data Management](#data-management)
- [Resources](#resources)
- [Documentation](#documentation)
- [Tutorials](#tutorials)
- [Community](#community)
- [How to Contribute](#Contributing)## Core Libraries
* [Go](go.md) - Collection of modern libraries like Colly, Chromedp, Arachnid, and Soup, with built-in concurrent processing support
* [Java](java.md) - Comprehensive set of tools including JSoup, Selenium WebDriver, Apache HttpComponents and Heritrix for enterprise crawling
* [JavaScript/Node.js](javascript.md) - Features Puppeteer, Cheerio, Playwright, and Axios, with strong HTTP clients and browser automation capabilities
* [Perl](perl.md) - Established libraries like WWW::Mechanize, HTML::Parser, LWP, and Mojo for text processing and web scraping
* [PHP](php.md) - Includes Goutte, Symfony DomCrawler, PHP Simple HTML DOM Parser, and Guzzle for web scraping and automation
* [Python](python.md) - Rich ecosystem featuring Scrapy, pyspider, BeautifulSoup, lxml, and Selenium, with extensive text processing and automation tools
* [R](r.md) - Data-focused tools including rvest, httr, xml2, and RSelenium, with strong integration to the tidyverse ecosystem
* [Ruby](ruby.md) - Features Nokogiri, Mechanize, Kimurai framework, and HTTParty, with elegant APIs for web scraping and parsing
* [Rust](rust.md) - Modern tooling with reqwest, scraper, tokio, and tungstenite for high-performance async scraping## Specialized Tools
### Network Utilities
* [mitmproxy](https://mitmproxy.org/) - Interactive HTTPS proxy
* [Charles Proxy](https://www.charlesproxy.com/) - Web debugging proxy
* [Fiddler](https://www.telerik.com/fiddler) - Web debugging proxy
* [Proxychains](https://github.com/haad/proxychains) - Proxy chaining tool### HTML/XML Processing
* [XPath](https://www.w3.org/TR/xpath-31/) - XML path language
* [CSS Selectors](https://www.w3.org/TR/selectors-4/) - Pattern matching syntax
* [html5lib](https://github.com/html5lib/) - HTML parser and serializer
* [xmltodict](https://github.com/martinblech/xmltodict) - XML to Python dict converter### Text Processing
* [Dateparser](https://github.com/scrapinghub/dateparser) - Date parsing library
* [Ftfy](https://github.com/LuminosoInsight/python-ftfy) - Text encoding fixer
* [Price-parser](https://github.com/scrapinghub/price-parser) - Price extraction
* [Phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) - Phone number parsing## Browser Automation
### Headless Browsers
* [Chrome](https://www.google.com/chrome/browser/) - Most widely supported
* [Firefox](https://www.mozilla.org/firefox/) - Open-source alternative
* [PhantomJS](https://phantomjs.org/) - Scriptable headless WebKit### Testing Frameworks
* [Selenium](https://www.selenium.dev/) - Browser automation standard
* [Playwright](https://playwright.dev/) - Modern web testing
* [Cypress](https://www.cypress.io/) - JavaScript testing framework### Anti-detect Browsers
* [Multilogin](https://multilogin.com)
* [AdsPower ](https://www.adspower.com)
* [GoLogin](https://gologin.com)
* [Incogniton](https://incogniton.com)
* [Dolphin Anty](https://dolphin-anty.com)
* [MoreLogin](https://www.morelogin.com)
* [Lalicat](https://www.lalicat.com)
* [HideMyAcc](https://hidemyacc.com)
* [BitBrowser](https://www.bitbrowser.net)
* [Ghost Browser](https://ghostbrowser.com)## Anti-Bot Solutions
### Proxy Services
* [anyIP.io](https://anyip.io/) - Reliable proxy solutions, solid mobile proxies
* [Bright Data](https://brightdata.com/) - Enterprise proxy network
* [Oxylabs](https://oxylabs.io/) - Proxy and scraping solutions
* [ScraperAPI](https://www.scraperapi.com/) - Proxy API service
* [IPRotate](https://www.iprotatepro.com/) - IP rotation service
* [Smartproxy](https://smartproxy.com/) – Residential and datacenter proxies
* [SOAX](https://soax.com/) – Rotating residential and mobile proxies
* [ProxyEmpire](https://proxyempire.io/) – Ok residential and mobile proxies
* [NetNut](https://netnut.io/) – ISP proxies with high uptime### CAPTCHA Solvers
* [2captcha](https://2captcha.com/) - Human captcha solving
* [Anti-Captcha](https://anti-captcha.com/) - Automated solving
* [DeathByCaptcha](https://deathbycaptcha.com/) - API-based solving### Browser Fingerprinting
* [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth)
* [selenium-stealth](https://github.com/diprajpatra/selenium-stealth)
* [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver)## Data Processing
### Natural Language Processing
* [NLTK](https://www.nltk.org/) - Natural Language Toolkit
* [spaCy](https://spacy.io/) - Industrial-strength NLP
* [TextBlob](https://textblob.readthedocs.io/) - Simplified text processing
* [langdetect](https://github.com/Mimino666/langdetect) - Language detection### Data Storage
* [MongoDB](https://www.mongodb.com/) - Document database
* [Elasticsearch](https://www.elastic.co/) - Search and analytics
* [PostgreSQL](https://www.postgresql.org/) - Relational database
* [Redis](https://redis.io/) - In-memory data store## Best Practices
### Rate Limiting
* Implement exponential backoff
* Respect robots.txt directives
* Use delays between requests
* Monitor response codes### Error Handling
* Implement retry logic
* Log errors comprehensively
* Handle timeouts gracefully
* Monitor scraping health### Data Management
* Validate extracted data
* Remove duplicates
* Store raw and processed data
* Document data schema## Resources
### Documentation
* [Scrapy Documentation](https://docs.scrapy.org/)
* [Selenium Documentation](https://selenium.dev/documentation/)
* [Puppeteer Documentation](https://pptr.dev/)
* [Playwright Documentation](https://playwright.dev/docs/intro)### Tutorials
* [Web Scraping Best Practices](https://www.scrapehero.com/web-scraping-best-practices/)
* [Scraping with Python](https://realpython.com/web-scraping-101-with-python/)
* [JavaScript Web Scraping Guide](https://www.browserless.io/blog/web-scraping-in-nodejs/)
* [Anti-Bot Bypass Techniques](https://medium.com/@selvaganesh93/how-to-bypass-anti-bot-protection-while-web-scraping-14bb87d1c326)### Community
* [Stack Overflow](https://stackoverflow.com/questions/tagged/web-scraping)
* [Reddit r/webscraping](https://reddit.com/r/webscraping)
* [Scrapy Community](https://scrapy.org/community/)---
## Contributing
Check the [Contribution Guidelines](CONTRIBUTING.md) before sending any updates.
You can [open an issue](https://github.com/lukas-bear/awesome-web-scraping/issues) or [create a new PR](https://github.com/lukas-bear/awesome-web-scraping/pulls) with your additions.
I'll make sure to check them quickly!