awesome-web-scraping
Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web
https://github.com/lukas-bear/awesome-web-scraping
Last synced: 4 days ago
JSON representation
-
Anti-Bot Solutions
-
Proxy Services
- ScraperAPI - Proxy API service
- IPRotate - IP rotation service
- Smartproxy
- anyIP.io - Reliable proxy solutions, solid mobile proxies
- Bright Data - Enterprise proxy network
- Oxylabs - Proxy and scraping solutions
- ProxyEmpire
- NetNut
- Smartproxy
- SOAX
-
CAPTCHA Solvers
- 2captcha - Human captcha solving
- Anti-Captcha - Automated solving
- DeathByCaptcha - API-based solving
-
Browser Fingerprinting
-
-
Data Processing
-
Natural Language Processing
- NLTK - Natural Language Toolkit
- spaCy - Industrial-strength NLP
- langdetect - Language detection
-
Data Storage
- MongoDB - Document database
- Elasticsearch - Search and analytics
- PostgreSQL - Relational database
- Redis - In-memory data store
-
-
Resources
-
Specialized Tools
-
HTML/XML Processing
- html5lib - HTML parser and serializer
- XPath - XML path language
- CSS Selectors - Pattern matching syntax
- xmltodict - XML to Python dict converter
- html5lib - HTML parser and serializer
-
Network Utilities
- mitmproxy - Interactive HTTPS proxy
- Charles Proxy - Web debugging proxy
- Fiddler - Web debugging proxy
- Proxychains - Proxy chaining tool
-
Text Processing
- Ftfy - Text encoding fixer
- Dateparser - Date parsing library
- Price-parser - Price extraction
- Phonenumbers - Phone number parsing
-
-
Browser Automation
-
Anti-detect Browsers
-
Testing Frameworks
- Playwright - Modern web testing
- Cypress - JavaScript testing framework
- Selenium - Browser automation standard
-
Headless Browsers
-
-
Core Libraries
-
JavaScript/Node.js
- Cheerio - Fast jQuery-like parsing
- Axios - Promise based HTTP client
- node-crawler - Web crawler with jQuery
- Crawlee - Web scraping and browser automation
- Puppeteer - Chrome automation API
-
Java
- JSoup - HTML parsing and manipulation
- Apache HttpClient - HTTP client library
- webmagic - Distributed crawler framework
-
Go
- Colly
- Fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
- Goquery - like API for parsing and manipulating HTML documents.
- Rod - level browser automation framework powered by Chromium DevTools.
- Playwright-go - headless browser automation.
- Gocrawl - Polite, slim and concurrent web crawler.
-
PHP
-
Python
- Scrapy - Comprehensive web scraping framework
- Beautiful Soup - HTML/XML parsing library
- requests - HTTP library for humans
- aiohttp - Asynchronous HTTP client/server
- pyspider - Web crawler with GUI interface
- MechanicalSoup - Web automation library
-
Ruby
-
Programming Languages
Categories
Sub Categories
Anti-detect Browsers
16
Proxy Services
10
Go
6
Python
6
HTML/XML Processing
5
Ruby
5
JavaScript/Node.js
5
Text Processing
4
Tutorials
4
Network Utilities
4
Data Storage
4
Community
4
Documentation
3
Natural Language Processing
3
Testing Frameworks
3
Java
3
CAPTCHA Solvers
3
PHP
3
Browser Fingerprinting
3
Headless Browsers
2
Keywords
crawler
9
scraping
6
automation
5
python
5
scraper
5
headless-chrome
4
web
4
crawling
4
ruby
3
headless
3
web-scraping
3
nodejs
3
javascript
3
jquery
3
selenium
3
testing
3
framework
3
golang
3
go
3
http
2
requests
2
firefox
2
chromium
2
chrome
2
http-client
2
xml
2
parser
2
playwright
2
html
2
spider
2
dom
2
robots-txt
2
cheerio
2
python-library
1
pypi
1
developer-tools
1
web-scraping-python
1
apify
1
npm
1
puppeteer
1
mechanicalsoup
1
beautifulsoup
1
http-server
1
asyncio
1
async
1
aiohttp
1
promise
1
node-module
1
selector
1
htmlparser2
1