awesome-web-scraping

Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web
https://github.com/lukas-bear/awesome-web-scraping

Last synced: 4 days ago
JSON representation

Anti-Bot Solutions
- Proxy Services
  - ScraperAPI - Proxy API service
  - IPRotate - IP rotation service
  - Smartproxy
  - anyIP.io - Reliable proxy solutions, solid mobile proxies
  - Bright Data - Enterprise proxy network
  - Oxylabs - Proxy and scraping solutions
  - ProxyEmpire
  - NetNut
  - Smartproxy
  - SOAX
- CAPTCHA Solvers
  - 2captcha - Human captcha solving
  - Anti-Captcha - Automated solving
  - DeathByCaptcha - API-based solving
- Browser Fingerprinting
Data Processing
- Natural Language Processing
  - NLTK - Natural Language Toolkit
  - spaCy - Industrial-strength NLP
  - langdetect - Language detection
- Data Storage
  - MongoDB - Document database
  - Elasticsearch - Search and analytics
  - PostgreSQL - Relational database
  - Redis - In-memory data store
Resources
- Documentation
- Community
- Tutorials
Specialized Tools
- HTML/XML Processing
  - html5lib - HTML parser and serializer
  - XPath - XML path language
  - CSS Selectors - Pattern matching syntax
  - xmltodict - XML to Python dict converter
  - html5lib - HTML parser and serializer
- Network Utilities
  - mitmproxy - Interactive HTTPS proxy
  - Charles Proxy - Web debugging proxy
  - Fiddler - Web debugging proxy
  - Proxychains - Proxy chaining tool
- Text Processing
  - Ftfy - Text encoding fixer
  - Dateparser - Date parsing library
  - Price-parser - Price extraction
  - Phonenumbers - Phone number parsing
Browser Automation
- Anti-detect Browsers
  - AdsPower
  - Lalicat
  - HideMyAcc
  - BitBrowser
  - Dolphin Anty
  - MoreLogin
  - Dolphin Anty
  - Multilogin
  - AdsPower
  - GoLogin
  - Incogniton
  - MoreLogin
  - Lalicat
  - HideMyAcc
  - BitBrowser
  - Ghost Browser
- Testing Frameworks
  - Playwright - Modern web testing
  - Cypress - JavaScript testing framework
  - Selenium - Browser automation standard
- Headless Browsers
  - Chrome - Most widely supported
  - PhantomJS - Scriptable headless WebKit
Core Libraries
- JavaScript/Node.js
  - Cheerio - Fast jQuery-like parsing
  - Axios - Promise based HTTP client
  - node-crawler - Web crawler with jQuery
  - Crawlee - Web scraping and browser automation
  - Puppeteer - Chrome automation API
- Java
  - JSoup - HTML parsing and manipulation
  - Apache HttpClient - HTTP client library
  - webmagic - Distributed crawler framework
- Go
  - Colly
  - Fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
  - Goquery - like API for parsing and manipulating HTML documents.
  - Rod - level browser automation framework powered by Chromium DevTools.
  - Playwright-go - headless browser automation.
  - Gocrawl - Polite, slim and concurrent web crawler.
- PHP
  - DiDOM - A blazing-fast and easy-to-use HTML parser.
  - Crawler - A powerful library for rapid web scraping and crawling development.
  - Goutte - A lightweight PHP web scraper for effortless data extraction.
- Python
  - Scrapy - Comprehensive web scraping framework
  - Beautiful Soup - HTML/XML parsing library
  - requests - HTTP library for humans
  - aiohttp - Asynchronous HTTP client/server
  - pyspider - Web crawler with GUI interface
  - MechanicalSoup - Web automation library
- Ruby
  - Nokogiri - HTML/XML parsing
  - Mechanize - Automated web interaction
  - Kimurai - Modern scraping framework
  - Watir - Ruby browser automation
  - Anemone - Web spider framework

Programming Languages

Python 12 Go 6 Ruby 4 TypeScript 4 PHP 2 C 2 Java 1 JavaScript 1

Categories

Core Libraries 28 Browser Automation 21 Anti-Bot Solutions 16 Specialized Tools 13 Resources 11 Data Processing 7

Sub Categories

Anti-detect Browsers 16 Proxy Services 10 Go 6 Python 6 HTML/XML Processing 5 Ruby 5 JavaScript/Node.js 5 Text Processing 4 Tutorials 4 Network Utilities 4 Data Storage 4 Community 4 Documentation 3 Natural Language Processing 3 Testing Frameworks 3 Java 3 CAPTCHA Solvers 3 PHP 3 Browser Fingerprinting 3 Headless Browsers 2

Keywords

crawler 9 scraping 6 automation 5 python 5 scraper 5 headless-chrome 4 web 4 crawling 4 ruby 3 headless 3 web-scraping 3 nodejs 3 javascript 3 jquery 3 selenium 3 testing 3 framework 3 golang 3 go 3 http 2 requests 2 firefox 2 chromium 2 chrome 2 http-client 2 xml 2 parser 2 playwright 2 html 2 spider 2 dom 2 robots-txt 2 cheerio 2 python-library 1 pypi 1 developer-tools 1 web-scraping-python 1 apify 1 npm 1 puppeteer 1 mechanicalsoup 1 beautifulsoup 1 http-server 1 asyncio 1 async 1 aiohttp 1 promise 1 node-module 1 selector 1 htmlparser2 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-web-scraping

Anti-Bot Solutions

Proxy Services

CAPTCHA Solvers

Browser Fingerprinting

Data Processing

Natural Language Processing

Data Storage

Resources

Documentation

Community

Tutorials

Specialized Tools

HTML/XML Processing

Network Utilities

Text Processing

Browser Automation

Anti-detect Browsers

Testing Frameworks

Headless Browsers

Core Libraries

JavaScript/Node.js

Java

Go

PHP

Python

Ruby