Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-web-scraping
Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web
https://github.com/lukas-bear/awesome-web-scraping
Last synced: 1 day ago
JSON representation
-
Specialized Tools
-
Network Utilities
- mitmproxy - Interactive HTTPS proxy
- Charles Proxy - Web debugging proxy
- Fiddler - Web debugging proxy
- Proxychains - Proxy chaining tool
-
HTML/XML Processing
- XPath - XML path language
- CSS Selectors - Pattern matching syntax
- html5lib - HTML parser and serializer
- xmltodict - XML to Python dict converter
-
Text Processing
- Dateparser - Date parsing library
- Ftfy - Text encoding fixer
- Price-parser - Price extraction
- Phonenumbers - Phone number parsing
-
-
Browser Automation
-
Headless Browsers
- PhantomJS - Scriptable headless WebKit
-
Testing Frameworks
- Playwright - Modern web testing
- Cypress - JavaScript testing framework
- Selenium - Browser automation standard
-
Anti-detect Browsers
-
-
Anti-Bot Solutions
-
Proxy Services
- anyIP.io - Reliable proxy solutions, solid mobile proxies
- Oxylabs - Proxy and scraping solutions
- ScraperAPI - Proxy API service
- Bright Data - Enterprise proxy network
- IPRotate - IP rotation service
- SOAX
- ProxyEmpire
- NetNut
- Smartproxy
-
CAPTCHA Solvers
- 2captcha - Human captcha solving
- Anti-Captcha - Automated solving
- DeathByCaptcha - API-based solving
-
Browser Fingerprinting
-
-
Resources
-
Data Processing
-
Natural Language Processing
- NLTK - Natural Language Toolkit
- spaCy - Industrial-strength NLP
- langdetect - Language detection
-
Data Storage
- MongoDB - Document database
- Elasticsearch - Search and analytics
- PostgreSQL - Relational database
- Redis - In-memory data store
-
-
Core Libraries
-
Python
- MechanicalSoup - Web automation library
- Scrapy - Comprehensive web scraping framework
- Beautiful Soup - HTML/XML parsing library
- requests - HTTP library for humans
- aiohttp - Asynchronous HTTP client/server
- pyspider - Web crawler with GUI interface
- Scrapy - Comprehensive web scraping framework
- Beautiful Soup - HTML/XML parsing library
- requests - HTTP library for humans
- aiohttp - Asynchronous HTTP client/server
- pyspider - Web crawler with GUI interface
- MechanicalSoup - Web automation library
-
JavaScript/Node.js
- Puppeteer - Chrome automation API
- Cheerio - Fast jQuery-like parsing
- Axios - Promise based HTTP client
- node-crawler - Web crawler with jQuery
- Crawlee - Web scraping and browser automation
- Nightmare - High-level browser automation
- Axios - Promise based HTTP client
- node-crawler - Web crawler with jQuery
- Crawlee - Web scraping and browser automation
- Nightmare - High-level browser automation
- Puppeteer - Chrome automation API
- Cheerio - Fast jQuery-like parsing
-
Java
- JSoup - HTML parsing and manipulation
- Apache HttpClient - HTTP client library
- crawler4j - Multithreaded crawler
- webmagic - Distributed crawler framework
- JSoup - HTML parsing and manipulation
- Selenium WebDriver - Browser automation
- Apache HttpClient - HTTP client library
- crawler4j - Multithreaded crawler
- webmagic - Distributed crawler framework
-
Go
- Colly
- Fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
- Goquery - like API for parsing and manipulating HTML documents.
- Rod - level browser automation framework powered by Chromium DevTools.
- Playwright-go - headless browser automation.
- Gocrawl - Polite, slim and concurrent web crawler.
- Colly
- Fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
- Goquery - like API for parsing and manipulating HTML documents.
- Rod - level browser automation framework powered by Chromium DevTools.
- Playwright-go - headless browser automation.
- Gocrawl - Polite, slim and concurrent web crawler.
-
Ruby
- Nokogiri - HTML/XML parsing
- Mechanize - Automated web interaction
- Kimurai - Modern scraping framework
- Watir - Ruby browser automation
- Anemone - Web spider framework
- Nokogiri - HTML/XML parsing
- Mechanize - Automated web interaction
- Kimurai - Modern scraping framework
- Watir - Ruby browser automation
- Anemone - Web spider framework
-
PHP
- DiDOM - A blazing-fast and easy-to-use HTML parser.
- Crawler - A powerful library for rapid web scraping and crawling development.
- DiDOM - A blazing-fast and easy-to-use HTML parser.
- Goutte - A lightweight PHP web scraper for effortless data extraction.
- Crawler - A powerful library for rapid web scraping and crawling development.
-
Programming Languages
Categories
Sub Categories
Go
12
Python
12
JavaScript/Node.js
12
Ruby
10
Proxy Services
9
Java
9
Anti-detect Browsers
6
PHP
5
Network Utilities
4
HTML/XML Processing
4
Text Processing
4
Tutorials
4
Data Storage
4
Testing Frameworks
3
CAPTCHA Solvers
3
Community
3
Natural Language Processing
3
Browser Fingerprinting
3
Documentation
2
Headless Browsers
1
Keywords
crawler
18
scraping
11
python
10
scraper
10
automation
9
web
8
headless-chrome
8
crawling
8
golang
6
go
6
framework
6
web-scraping
6
headless
6
nodejs
6
javascript
6
jquery
6
ruby
6
testing
5
selenium
5
parser
4
http-client
4
html
4
dom
4
cheerio
4
spider
4
firefox
4
playwright
4
chromium
4
requests
4
robots-txt
4
http
4
xml
4
chrome
3
asyncio
2
selector
2
http-server
2
browser-automation
2
webkit
2
htmlparser2
2
htmlparser
2
libxml2
2
libxslt
2
nokogiri
2
node-module
2
ruby-gem
2
sax
2
developer-tools
2
xerces
2
xslt
2
python-library
2