{"id":84170,"url":"https://github.com/lukas-bear/awesome-web-scraping","name":"awesome-web-scraping","description":"Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web","projects_count":62,"last_synced_at":"2025-06-18T18:02:19.014Z","repository":{"id":275007783,"uuid":"924783833","full_name":"lukas-bear/awesome-web-scraping","owner":"lukas-bear","description":"Best scraping tools collection in town. Find everything you need for scraping, crawling, and processing data from the web","archived":false,"fork":false,"pushed_at":"2025-02-07T16:32:16.000Z","size":41,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-10T16:15:25.120Z","etag":null,"topics":["anti-bot","bot","captcha","crawler","go","java","javascript","network","nodejs","perl","php","proxies","proxy","proxy-server","python","ruby","rust","tools","webscraping","xml"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lukas-bear.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-30T16:40:44.000Z","updated_at":"2025-02-07T16:32:20.000Z","dependencies_parsed_at":"2025-03-03T20:01:39.601Z","dependency_job_id":"eb38cbb6-dcd5-41bd-83be-0ec21e73a723","html_url":"https://github.com/lukas-bear/awesome-web-scraping","commit_stats":null,"previous_names":["lukas-bear/awesome-web-scraping"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukas-bear%2Fawesome-web-scraping","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukas-bear%2Fawesome-web-scraping/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukas-bear%2Fawesome-web-scraping/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukas-bear%2Fawesome-web-scraping/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lukas-bear","download_url":"https://codeload.github.com/lukas-bear/awesome-web-scraping/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250395253,"owners_count":21423395,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"readme":"# Awesome Web Scraping\n\nA comprehensive collection of web scraping resources, tools, and libraries.\n\n## Contents\n\n- [Core Libraries](#core-libraries)\n- [Specialized Tools](#specialized-tools)\n  - [Network Utilities](#network-utilities)\n  - [HTML/XML Processing](#htmlxml-processing)\n  - [Text Processing](#text-processing)\n  - [Data Formats](#data-formats)\n- [Browser Automation](#browser-automation)\n  - [Headless Browsers](#headless-browsers)\n  - [Testing Frameworks](#testing-frameworks)\n  - [Browser Extensions](#browser-extensions)\n  - [Anti-detect Browsers](#anti-detect-browsers)\n- [Anti-Bot Solutions](#anti-bot-solutions)\n  - [Proxy Services](#proxy-services)\n  - [CAPTCHA Solvers](#captcha-solvers)\n  - [Browser Fingerprinting](#browser-fingerprinting)\n- [Data Processing](#data-processing)\n  - [Natural Language Processing](#natural-language-processing)\n  - [Data Cleaning](#data-cleaning)\n  - [Data Storage](#data-storage)\n- [Best Practices](#best-practices)\n  - [Rate Limiting](#rate-limiting)\n  - [Error Handling](#error-handling)\n  - [Data Management](#data-management)\n- [Resources](#resources)\n  - [Documentation](#documentation)\n  - [Tutorials](#tutorials)\n  - [Community](#community)\n- [How to Contribute](#Contributing)\n\n## Core Libraries\n* [Go](go.md) - Collection of modern libraries like Colly, Chromedp, Arachnid, and Soup, with built-in concurrent processing support\n* [Java](java.md) - Comprehensive set of tools including JSoup, Selenium WebDriver, Apache HttpComponents and Heritrix for enterprise crawling\n* [JavaScript/Node.js](javascript.md) - Features Puppeteer, Cheerio, Playwright, and Axios, with strong HTTP clients and browser automation capabilities\n* [Perl](perl.md) - Established libraries like WWW::Mechanize, HTML::Parser, LWP, and Mojo for text processing and web scraping\n* [PHP](php.md) - Includes Goutte, Symfony DomCrawler, PHP Simple HTML DOM Parser, and Guzzle for web scraping and automation\n* [Python](python.md) - Rich ecosystem featuring Scrapy, pyspider, BeautifulSoup, lxml, and Selenium, with extensive text processing and automation tools\n* [R](r.md) - Data-focused tools including rvest, httr, xml2, and RSelenium, with strong integration to the tidyverse ecosystem\n* [Ruby](ruby.md) - Features Nokogiri, Mechanize, Kimurai framework, and HTTParty, with elegant APIs for web scraping and parsing\n* [Rust](rust.md) - Modern tooling with reqwest, scraper, tokio, and tungstenite for high-performance async scraping\n\n## Specialized Tools\n\n### Network Utilities\n* [mitmproxy](https://mitmproxy.org/) - Interactive HTTPS proxy\n* [Charles Proxy](https://www.charlesproxy.com/) - Web debugging proxy\n* [Fiddler](https://www.telerik.com/fiddler) - Web debugging proxy\n* [Proxychains](https://github.com/haad/proxychains) - Proxy chaining tool\n\n### HTML/XML Processing\n* [XPath](https://www.w3.org/TR/xpath-31/) - XML path language\n* [CSS Selectors](https://www.w3.org/TR/selectors-4/) - Pattern matching syntax\n* [html5lib](https://github.com/html5lib/) - HTML parser and serializer\n* [xmltodict](https://github.com/martinblech/xmltodict) - XML to Python dict converter\n\n### Text Processing\n* [Dateparser](https://github.com/scrapinghub/dateparser) - Date parsing library\n* [Ftfy](https://github.com/LuminosoInsight/python-ftfy) - Text encoding fixer\n* [Price-parser](https://github.com/scrapinghub/price-parser) - Price extraction\n* [Phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) - Phone number parsing\n\n## Browser Automation\n\n### Headless Browsers\n* [Chrome](https://www.google.com/chrome/browser/) - Most widely supported\n* [Firefox](https://www.mozilla.org/firefox/) - Open-source alternative\n* [PhantomJS](https://phantomjs.org/) - Scriptable headless WebKit\n\n### Testing Frameworks\n* [Selenium](https://www.selenium.dev/) - Browser automation standard\n* [Playwright](https://playwright.dev/) - Modern web testing\n* [Cypress](https://www.cypress.io/) - JavaScript testing framework\n\n### Anti-detect Browsers\n* [Multilogin](https://multilogin.com)\n* [AdsPower ](https://www.adspower.com)\n* [GoLogin](https://gologin.com)\n* [Incogniton](https://incogniton.com)\n* [Dolphin Anty](https://dolphin-anty.com)\n* [MoreLogin](https://www.morelogin.com)\n* [Lalicat](https://www.lalicat.com)\n* [HideMyAcc](https://hidemyacc.com)\n* [BitBrowser](https://www.bitbrowser.net)\n* [Ghost Browser](https://ghostbrowser.com)\n\n## Anti-Bot Solutions\n\n### Proxy Services\n* [anyIP.io](https://anyip.io/) - Reliable proxy solutions, solid mobile proxies\n* [Bright Data](https://brightdata.com/) - Enterprise proxy network\n* [Oxylabs](https://oxylabs.io/) - Proxy and scraping solutions\n* [ScraperAPI](https://www.scraperapi.com/) - Proxy API service\n* [IPRotate](https://www.iprotatepro.com/) - IP rotation service\n* [Smartproxy](https://smartproxy.com/) – Residential and datacenter proxies\n* [SOAX](https://soax.com/) – Rotating residential and mobile proxies\n* [ProxyEmpire](https://proxyempire.io/) – Ok residential and mobile proxies\n* [NetNut](https://netnut.io/) – ISP proxies with high uptime\n\n### CAPTCHA Solvers\n* [2captcha](https://2captcha.com/) - Human captcha solving\n* [Anti-Captcha](https://anti-captcha.com/) - Automated solving\n* [DeathByCaptcha](https://deathbycaptcha.com/) - API-based solving\n\n### Browser Fingerprinting\n* [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth)\n* [selenium-stealth](https://github.com/diprajpatra/selenium-stealth)\n* [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver)\n\n## Data Processing\n\n### Natural Language Processing\n* [NLTK](https://www.nltk.org/) - Natural Language Toolkit\n* [spaCy](https://spacy.io/) - Industrial-strength NLP\n* [TextBlob](https://textblob.readthedocs.io/) - Simplified text processing\n* [langdetect](https://github.com/Mimino666/langdetect) - Language detection\n\n### Data Storage\n* [MongoDB](https://www.mongodb.com/) - Document database\n* [Elasticsearch](https://www.elastic.co/) - Search and analytics\n* [PostgreSQL](https://www.postgresql.org/) - Relational database\n* [Redis](https://redis.io/) - In-memory data store\n\n## Best Practices\n\n### Rate Limiting\n* Implement exponential backoff\n* Respect robots.txt directives\n* Use delays between requests\n* Monitor response codes\n\n### Error Handling\n* Implement retry logic\n* Log errors comprehensively\n* Handle timeouts gracefully\n* Monitor scraping health\n\n### Data Management\n* Validate extracted data\n* Remove duplicates\n* Store raw and processed data\n* Document data schema\n\n## Resources\n\n### Documentation\n* [Scrapy Documentation](https://docs.scrapy.org/)\n* [Selenium Documentation](https://selenium.dev/documentation/)\n* [Puppeteer Documentation](https://pptr.dev/)\n* [Playwright Documentation](https://playwright.dev/docs/intro)\n\n### Tutorials\n* [Web Scraping Best Practices](https://www.scrapehero.com/web-scraping-best-practices/)\n* [Scraping with Python](https://realpython.com/web-scraping-101-with-python/)\n* [JavaScript Web Scraping Guide](https://www.browserless.io/blog/web-scraping-in-nodejs/)\n* [Anti-Bot Bypass Techniques](https://medium.com/@selvaganesh93/how-to-bypass-anti-bot-protection-while-web-scraping-14bb87d1c326)\n\n### Community\n* [Stack Overflow](https://stackoverflow.com/questions/tagged/web-scraping)\n* [Reddit r/webscraping](https://reddit.com/r/webscraping)\n* [Scrapy Community](https://scrapy.org/community/)\n\n---\n\n## Contributing\n\nCheck the [Contribution Guidelines](CONTRIBUTING.md) before sending any updates.\n\nYou can [open an issue](https://github.com/lukas-bear/awesome-web-scraping/issues) or [create a new PR](https://github.com/lukas-bear/awesome-web-scraping/pulls) with your additions.\nI'll make sure to check them quickly!\n","created_at":"2025-02-07T00:16:45.916Z","updated_at":"2025-06-18T18:02:19.014Z","primary_language":null,"list_of_lists":false,"displayable":true,"categories":["Specialized Tools","Browser Automation","Anti-Bot Solutions","Resources","Data Processing","Core Libraries"],"sub_categories":["Network Utilities","HTML/XML Processing","Text Processing","Headless Browsers","Testing Frameworks","Anti-detect Browsers","Proxy Services","Documentation","CAPTCHA Solvers","Browser Fingerprinting","Natural Language Processing","Data Storage","Tutorials","Community","Python","JavaScript/Node.js","Java","Go","Ruby","PHP"],"projects_url":"https://awesome.ecosyste.ms/api/v1/lists/lukas-bear%2Fawesome-web-scraping/projects"}