https://github.com/lukas-bear/awesome-web-scraping

Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web
https://github.com/lukas-bear/awesome-web-scraping

List: awesome-web-scraping

anti-bot bot captcha crawler go java javascript network nodejs perl php proxies proxy proxy-server python ruby rust tools webscraping xml

Last synced: 3 months ago
JSON representation

Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web

Host: GitHub
URL: https://github.com/lukas-bear/awesome-web-scraping
Owner: lukas-bear
Created: 2025-01-30T16:40:44.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-02-07T16:32:16.000Z (4 months ago)
Last Synced: 2025-03-28T23:01:39.881Z (3 months ago)
Topics: anti-bot, bot, captcha, crawler, go, java, javascript, network, nodejs, perl, php, proxies, proxy, proxy-server, python, ruby, rust, tools, webscraping, xml
Homepage:
Size: 40 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md

Awesome Lists containing this project

ultimate-awesome - awesome-web-scraping - Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web. (Other Lists / Julia Lists)

README

        # Awesome Web Scraping

A comprehensive collection of web scraping resources, tools, and libraries.

## Contents

- [Core Libraries](#core-libraries)

- [Specialized Tools](#specialized-tools)

  - [Network Utilities](#network-utilities)

  - [HTML/XML Processing](#htmlxml-processing)

  - [Text Processing](#text-processing)

  - [Data Formats](#data-formats)

- [Browser Automation](#browser-automation)

  - [Headless Browsers](#headless-browsers)

  - [Testing Frameworks](#testing-frameworks)

  - [Browser Extensions](#browser-extensions)

  - [Anti-detect Browsers](#anti-detect-browsers)

- [Anti-Bot Solutions](#anti-bot-solutions)

  - [Proxy Services](#proxy-services)

  - [CAPTCHA Solvers](#captcha-solvers)

  - [Browser Fingerprinting](#browser-fingerprinting)

- [Data Processing](#data-processing)

  - [Natural Language Processing](#natural-language-processing)

  - [Data Cleaning](#data-cleaning)

  - [Data Storage](#data-storage)

- [Best Practices](#best-practices)

  - [Rate Limiting](#rate-limiting)

  - [Error Handling](#error-handling)

  - [Data Management](#data-management)

- [Resources](#resources)

  - [Documentation](#documentation)

  - [Tutorials](#tutorials)

  - [Community](#community)

- [How to Contribute](#Contributing)

## Core Libraries

* [Go](go.md) - Collection of modern libraries like Colly, Chromedp, Arachnid, and Soup, with built-in concurrent processing support

* [Java](java.md) - Comprehensive set of tools including JSoup, Selenium WebDriver, Apache HttpComponents and Heritrix for enterprise crawling

* [JavaScript/Node.js](javascript.md) - Features Puppeteer, Cheerio, Playwright, and Axios, with strong HTTP clients and browser automation capabilities

* [Perl](perl.md) - Established libraries like WWW::Mechanize, HTML::Parser, LWP, and Mojo for text processing and web scraping

* [PHP](php.md) - Includes Goutte, Symfony DomCrawler, PHP Simple HTML DOM Parser, and Guzzle for web scraping and automation

* [Python](python.md) - Rich ecosystem featuring Scrapy, pyspider, BeautifulSoup, lxml, and Selenium, with extensive text processing and automation tools

* [R](r.md) - Data-focused tools including rvest, httr, xml2, and RSelenium, with strong integration to the tidyverse ecosystem

* [Ruby](ruby.md) - Features Nokogiri, Mechanize, Kimurai framework, and HTTParty, with elegant APIs for web scraping and parsing

* [Rust](rust.md) - Modern tooling with reqwest, scraper, tokio, and tungstenite for high-performance async scraping

## Specialized Tools

### Network Utilities

* [mitmproxy](https://mitmproxy.org/) - Interactive HTTPS proxy

* [Charles Proxy](https://www.charlesproxy.com/) - Web debugging proxy

* [Fiddler](https://www.telerik.com/fiddler) - Web debugging proxy

* [Proxychains](https://github.com/haad/proxychains) - Proxy chaining tool

### HTML/XML Processing

* [XPath](https://www.w3.org/TR/xpath-31/) - XML path language

* [CSS Selectors](https://www.w3.org/TR/selectors-4/) - Pattern matching syntax

* [html5lib](https://github.com/html5lib/) - HTML parser and serializer

* [xmltodict](https://github.com/martinblech/xmltodict) - XML to Python dict converter

### Text Processing

* [Dateparser](https://github.com/scrapinghub/dateparser) - Date parsing library

* [Ftfy](https://github.com/LuminosoInsight/python-ftfy) - Text encoding fixer

* [Price-parser](https://github.com/scrapinghub/price-parser) - Price extraction

* [Phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) - Phone number parsing

## Browser Automation

### Headless Browsers

* [Chrome](https://www.google.com/chrome/browser/) - Most widely supported

* [Firefox](https://www.mozilla.org/firefox/) - Open-source alternative

* [PhantomJS](https://phantomjs.org/) - Scriptable headless WebKit

### Testing Frameworks

* [Selenium](https://www.selenium.dev/) - Browser automation standard

* [Playwright](https://playwright.dev/) - Modern web testing

* [Cypress](https://www.cypress.io/) - JavaScript testing framework

### Anti-detect Browsers

* [Multilogin](https://multilogin.com)

* [AdsPower ](https://www.adspower.com)

* [GoLogin](https://gologin.com)

* [Incogniton](https://incogniton.com)

* [Dolphin Anty](https://dolphin-anty.com)

* [MoreLogin](https://www.morelogin.com)

* [Lalicat](https://www.lalicat.com)

* [HideMyAcc](https://hidemyacc.com)

* [BitBrowser](https://www.bitbrowser.net)

* [Ghost Browser](https://ghostbrowser.com)

## Anti-Bot Solutions

### Proxy Services

* [anyIP.io](https://anyip.io/) - Reliable proxy solutions, solid mobile proxies

* [Bright Data](https://brightdata.com/) - Enterprise proxy network

* [Oxylabs](https://oxylabs.io/) - Proxy and scraping solutions

* [ScraperAPI](https://www.scraperapi.com/) - Proxy API service

* [IPRotate](https://www.iprotatepro.com/) - IP rotation service

* [Smartproxy](https://smartproxy.com/) – Residential and datacenter proxies

* [SOAX](https://soax.com/) – Rotating residential and mobile proxies

* [ProxyEmpire](https://proxyempire.io/) – Ok residential and mobile proxies

* [NetNut](https://netnut.io/) – ISP proxies with high uptime

### CAPTCHA Solvers

* [2captcha](https://2captcha.com/) - Human captcha solving

* [Anti-Captcha](https://anti-captcha.com/) - Automated solving

* [DeathByCaptcha](https://deathbycaptcha.com/) - API-based solving

### Browser Fingerprinting

* [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth)

* [selenium-stealth](https://github.com/diprajpatra/selenium-stealth)

* [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver)

## Data Processing

### Natural Language Processing

* [NLTK](https://www.nltk.org/) - Natural Language Toolkit

* [spaCy](https://spacy.io/) - Industrial-strength NLP

* [TextBlob](https://textblob.readthedocs.io/) - Simplified text processing

* [langdetect](https://github.com/Mimino666/langdetect) - Language detection

### Data Storage

* [MongoDB](https://www.mongodb.com/) - Document database

* [Elasticsearch](https://www.elastic.co/) - Search and analytics

* [PostgreSQL](https://www.postgresql.org/) - Relational database

* [Redis](https://redis.io/) - In-memory data store

## Best Practices

### Rate Limiting

* Implement exponential backoff

* Respect robots.txt directives

* Use delays between requests

* Monitor response codes

### Error Handling

* Implement retry logic

* Log errors comprehensively

* Handle timeouts gracefully

* Monitor scraping health

### Data Management

* Validate extracted data

* Remove duplicates

* Store raw and processed data

* Document data schema

## Resources

### Documentation

* [Scrapy Documentation](https://docs.scrapy.org/)

* [Selenium Documentation](https://selenium.dev/documentation/)

* [Puppeteer Documentation](https://pptr.dev/)

* [Playwright Documentation](https://playwright.dev/docs/intro)

### Tutorials

* [Web Scraping Best Practices](https://www.scrapehero.com/web-scraping-best-practices/)

* [Scraping with Python](https://realpython.com/web-scraping-101-with-python/)

* [JavaScript Web Scraping Guide](https://www.browserless.io/blog/web-scraping-in-nodejs/)

* [Anti-Bot Bypass Techniques](https://medium.com/@selvaganesh93/how-to-bypass-anti-bot-protection-while-web-scraping-14bb87d1c326)

### Community

* [Stack Overflow](https://stackoverflow.com/questions/tagged/web-scraping)

* [Reddit r/webscraping](https://reddit.com/r/webscraping)

* [Scrapy Community](https://scrapy.org/community/)

---

## Contributing

Check the [Contribution Guidelines](CONTRIBUTING.md) before sending any updates.

You can [open an issue](https://github.com/lukas-bear/awesome-web-scraping/issues) or [create a new PR](https://github.com/lukas-bear/awesome-web-scraping/pulls) with your additions.

I'll make sure to check them quickly!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lukas-bear/awesome-web-scraping

Awesome Lists containing this project

README