awesome-web-scraping

Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web
https://github.com/lukas-bear/awesome-web-scraping

Last synced: 4 days ago
JSON representation

Specialized Tools
- Network Utilities
  - mitmproxy - Interactive HTTPS proxy
  - Charles Proxy - Web debugging proxy
  - Fiddler - Web debugging proxy
  - Proxychains - Proxy chaining tool
- HTML/XML Processing
  - XPath - XML path language
  - CSS Selectors - Pattern matching syntax
  - html5lib - HTML parser and serializer
  - xmltodict - XML to Python dict converter
- Text Processing
  - Dateparser - Date parsing library
  - Ftfy - Text encoding fixer
  - Price-parser - Price extraction
  - Phonenumbers - Phone number parsing
Browser Automation
- Headless Browsers
  - PhantomJS - Scriptable headless WebKit
- Testing Frameworks
  - Playwright - Modern web testing
  - Cypress - JavaScript testing framework
  - Selenium - Browser automation standard
- Anti-detect Browsers
  - AdsPower
  - Lalicat
  - HideMyAcc
  - BitBrowser
  - Dolphin Anty
  - MoreLogin
  - Ghost Browser
Anti-Bot Solutions
- Proxy Services
  - anyIP.io - Reliable proxy solutions, solid mobile proxies
  - Oxylabs - Proxy and scraping solutions
  - ScraperAPI - Proxy API service
  - Bright Data - Enterprise proxy network
  - IPRotate - IP rotation service
  - SOAX
  - ProxyEmpire
  - NetNut
  - Smartproxy
- CAPTCHA Solvers
  - 2captcha - Human captcha solving
  - Anti-Captcha - Automated solving
  - DeathByCaptcha - API-based solving
- Browser Fingerprinting
Resources
- Documentation
  - Puppeteer Documentation
  - Playwright Documentation
- Tutorials
- Community
Data Processing
- Natural Language Processing
  - NLTK - Natural Language Toolkit
  - spaCy - Industrial-strength NLP
  - langdetect - Language detection
- Data Storage
  - MongoDB - Document database
  - Elasticsearch - Search and analytics
  - PostgreSQL - Relational database
  - Redis - In-memory data store
Core Libraries
- Python
  - MechanicalSoup - Web automation library
  - Scrapy - Comprehensive web scraping framework
  - Beautiful Soup - HTML/XML parsing library
  - requests - HTTP library for humans
  - aiohttp - Asynchronous HTTP client/server
  - pyspider - Web crawler with GUI interface
  - Scrapy - Comprehensive web scraping framework
  - Beautiful Soup - HTML/XML parsing library
  - requests - HTTP library for humans
  - aiohttp - Asynchronous HTTP client/server
  - pyspider - Web crawler with GUI interface
  - MechanicalSoup - Web automation library
- JavaScript/Node.js
  - Puppeteer - Chrome automation API
  - Cheerio - Fast jQuery-like parsing
  - Axios - Promise based HTTP client
  - node-crawler - Web crawler with jQuery
  - Crawlee - Web scraping and browser automation
  - Axios - Promise based HTTP client
  - node-crawler - Web crawler with jQuery
  - Crawlee - Web scraping and browser automation
  - Puppeteer - Chrome automation API
  - Cheerio - Fast jQuery-like parsing
- Java
  - JSoup - HTML parsing and manipulation
  - Apache HttpClient - HTTP client library
  - crawler4j - Multithreaded crawler
  - webmagic - Distributed crawler framework
  - JSoup - HTML parsing and manipulation
  - Selenium WebDriver - Browser automation
  - Apache HttpClient - HTTP client library
  - crawler4j - Multithreaded crawler
  - webmagic - Distributed crawler framework
- Go
  - Colly
  - Fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
  - Goquery - like API for parsing and manipulating HTML documents.
  - Rod - level browser automation framework powered by Chromium DevTools.
  - Playwright-go - headless browser automation.
  - Gocrawl - Polite, slim and concurrent web crawler.
  - Colly
  - Fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
  - Goquery - like API for parsing and manipulating HTML documents.
  - Rod - level browser automation framework powered by Chromium DevTools.
  - Playwright-go - headless browser automation.
  - Gocrawl - Polite, slim and concurrent web crawler.
- Ruby
  - Nokogiri - HTML/XML parsing
  - Mechanize - Automated web interaction
  - Kimurai - Modern scraping framework
  - Watir - Ruby browser automation
  - Anemone - Web spider framework
  - Nokogiri - HTML/XML parsing
  - Mechanize - Automated web interaction
  - Kimurai - Modern scraping framework
  - Watir - Ruby browser automation
  - Anemone - Web spider framework
- PHP
  - DiDOM - A blazing-fast and easy-to-use HTML parser.
  - Crawler - A powerful library for rapid web scraping and crawling development.
  - DiDOM - A blazing-fast and easy-to-use HTML parser.
  - Goutte - A lightweight PHP web scraper for effortless data extraction.
  - Crawler - A powerful library for rapid web scraping and crawling development.

Programming Languages

Python 18 Go 12 Ruby 8 TypeScript 8 PHP 3 C 3 Java 2 JavaScript 2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-web-scraping

Specialized Tools

Network Utilities

HTML/XML Processing

Text Processing

Browser Automation

Headless Browsers

Testing Frameworks

Anti-detect Browsers

Anti-Bot Solutions

Proxy Services

CAPTCHA Solvers

Browser Fingerprinting

Resources

Documentation

Tutorials

Community

Data Processing

Natural Language Processing

Data Storage

Core Libraries

Python

JavaScript/Node.js

Java

Go

Ruby

PHP