https://github.com/chiemekaifemegbulem/useful_tools
Advanced Web Scraping
https://github.com/chiemekaifemegbulem/useful_tools
automation beautifulsoup captcha-solving data-analysis data-extraction data-science proxy-rotation python scraping-bots selenium tor-network web-scraping web-scraping-python webscraping
Last synced: 3 months ago
JSON representation
Advanced Web Scraping
- Host: GitHub
- URL: https://github.com/chiemekaifemegbulem/useful_tools
- Owner: chiemekaifemegbulem
- License: mit
- Created: 2025-03-01T01:01:19.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-03-14T17:04:59.000Z (3 months ago)
- Last Synced: 2025-03-14T17:37:55.464Z (3 months ago)
- Topics: automation, beautifulsoup, captcha-solving, data-analysis, data-extraction, data-science, proxy-rotation, python, scraping-bots, selenium, tor-network, web-scraping, web-scraping-python, webscraping
- Language: Python
- Homepage:
- Size: 11.7 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Advanced Web Scraper
## Overview
This is an advanced web scraping tool built with Python. It is designed to efficiently extract structured data from web pages while handling various challenges like CAPTCHA, dynamic content, and IP blocking.
## Features
- **Proxy Support:** Uses a rotating list of proxies to prevent IP bans.
- **User-Agent Rotation:** Randomized user agents to mimic real users.
- **CAPTCHA Solving:** Integrates with 2Captcha for bypassing CAPTCHA challenges.
- **Tor Integration:** Optionally routes requests through the Tor network.
- **Retries and Timeout Handling:** Ensures resilience against connection failures.
- **JavaScript Rendering Support:** Uses Selenium with undetected ChromeDriver for scraping JavaScript-heavy websites.
- **Multi-threaded Scraping:** Utilizes threading for faster data extraction.
- **Automatic Data Deduplication:** Ensures only new data is stored.
- **JSON Storage:** Saves extracted data in JSON format.## Installation
### Prerequisites
Ensure you have Python 3 installed. Then, install the required dependencies:
```sh
pip install requests beautifulsoup4 fake-useragent undetected-chromedriver selenium twocaptcha stem
```### Setting Up Tor (Optional for Anonymity)
1. Install Tor and run it.
2. Add your Tor password to the script.
3. Ensure Tor is listening on port 9051.## Usage
Run the script with:
```sh
python scraper.py
```Modify the `base_url` in `main()` to scrape different websites.
## Contributing
We welcome contributions! Here’s how you can help:
- Improve proxy rotation mechanisms.
- Enhance JavaScript rendering efficiency.
- Add support for more CAPTCHA-solving services.
- Extend the scraper to handle more complex data structures.### How to Contribute
1. Fork this repository.
2. Create a feature branch:
```sh
git checkout -b feature-name
```3. Commit your changes:
```sh
git commit -am 'Add new feature'
```4. Push to the branch:
```sh
git push origin feature-name
```5. Open a Pull Request.
## License
This project is open-source under the MIT License.
## Contact
For suggestions or issues, please open an issue or reach out on GitHub.