awesome-web-scraping

Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web
https://github.com/lukas-bear/awesome-web-scraping

Last synced: about 7 hours ago
JSON representation

Specialized Tools
- Network Utilities
  - Charles Proxy - Web debugging proxy
  - Fiddler - Web debugging proxy
  - Proxychains - Proxy chaining tool
  - mitmproxy - Interactive HTTPS proxy
- HTML/XML Processing
  - XPath - XML path language
  - xmltodict - XML to Python dict converter
  - CSS Selectors - Pattern matching syntax
  - html5lib - HTML parser and serializer
- Text Processing
  - Phonenumbers - Phone number parsing
  - Price-parser - Price extraction
  - Dateparser - Date parsing library
  - Ftfy - Text encoding fixer
Data Processing
- Data Storage
  - MongoDB - Document database
  - PostgreSQL - Relational database
  - Redis - In-memory data store
  - Elasticsearch - Search and analytics
- Natural Language Processing
  - NLTK - Natural Language Toolkit
  - spaCy - Industrial-strength NLP
  - langdetect - Language detection
  - TextBlob - Simplified text processing
Browser Automation
- Testing Frameworks
  - Cypress - JavaScript testing framework
  - Playwright - Modern web testing
  - Selenium - Browser automation standard
- Headless Browsers
  - PhantomJS - Scriptable headless WebKit
- Anti-detect Browsers
Core Libraries
- Java
  - Apache HttpClient - HTTP client library
  - JSoup - HTML parsing and manipulation
  - webmagic - Distributed crawler framework
- Python
  - Beautiful Soup - HTML/XML parsing library
  - aiohttp - Asynchronous HTTP client/server
  - MechanicalSoup - Web automation library
  - pyspider - Web crawler with GUI interface
  - Scrapy - Comprehensive web scraping framework
  - requests - HTTP library for humans
- JavaScript/Node.js
  - Puppeteer - Chrome automation API
  - Cheerio - Fast jQuery-like parsing
  - Axios - Promise based HTTP client
  - Crawlee - Web scraping and browser automation
  - node-crawler - Web crawler with jQuery
- Go
  - Colly
  - Rod - level browser automation framework powered by Chromium DevTools.
  - Goquery - like API for parsing and manipulating HTML documents.
  - Gocrawl - Polite, slim and concurrent web crawler.
  - Fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
  - Playwright-go - headless browser automation.
- Ruby
  - Mechanize - Automated web interaction
  - Kimurai - Modern scraping framework
  - Nokogiri - HTML/XML parsing
  - Watir - Ruby browser automation
  - Anemone - Web spider framework
- PHP
  - Goutte - A lightweight PHP web scraper for effortless data extraction.
  - DiDOM - A blazing-fast and easy-to-use HTML parser.
  - Crawler - A powerful library for rapid web scraping and crawling development.
Resources
- Documentation
- Tutorials
  - Anti-Bot Bypass Techniques
- Community
Anti-Bot Solutions
- Proxy Services
  - ScraperAPI - Proxy API service
  - Bright Data - Enterprise proxy network
  - anyIP.io - Reliable proxy solutions, solid mobile proxies
  - Oxylabs - Proxy and scraping solutions
  - IPRotate - IP rotation service
  - Smartproxy
  - SOAX
  - ProxyEmpire
  - NetNut
- CAPTCHA Solvers
  - Anti-Captcha - Automated solving
  - 2captcha - Human captcha solving
  - DeathByCaptcha - API-based solving
- Browser Fingerprinting

Programming Languages

Python 12 Go 6 Ruby 4 TypeScript 4 PHP 2 C 2 Java 1 JavaScript 1

Categories

Core Libraries 28 Anti-Bot Solutions 15 Browser Automation 12 Specialized Tools 12 Data Processing 8 Resources 7

Sub Categories

Proxy Services 9 Anti-detect Browsers 8 Go 6 Python 6 Ruby 5 JavaScript/Node.js 5 Text Processing 4 Natural Language Processing 4 HTML/XML Processing 4 Network Utilities 4 Data Storage 4 Documentation 3 Testing Frameworks 3 Java 3 CAPTCHA Solvers 3 Community 3 PHP 3 Browser Fingerprinting 3 Headless Browsers 1 Tutorials 1

Keywords

crawler 9 scraping 6 automation 5 python 5 scraper 5 headless-chrome 4 web 4 crawling 4 ruby 3 headless 3 web-scraping 3 nodejs 3 javascript 3 jquery 3 selenium 3 testing 3 framework 3 golang 3 go 3 http 2 requests 2 firefox 2 chromium 2 chrome 2 http-client 2 xml 2 parser 2 playwright 2 html 2 spider 2 dom 2 robots-txt 2 cheerio 2 python-library 1 pypi 1 developer-tools 1 web-scraping-python 1 apify 1 npm 1 puppeteer 1 mechanicalsoup 1 beautifulsoup 1 http-server 1 asyncio 1 async 1 aiohttp 1 promise 1 node-module 1 selector 1 htmlparser2 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-web-scraping

Specialized Tools

Network Utilities

HTML/XML Processing

Text Processing

Data Processing

Data Storage

Natural Language Processing

Browser Automation

Testing Frameworks

Headless Browsers

Anti-detect Browsers

Core Libraries

Java

Python

JavaScript/Node.js

Go

Ruby

PHP

Resources

Documentation

Tutorials

Community

Anti-Bot Solutions

Proxy Services

CAPTCHA Solvers

Browser Fingerprinting