https://github.com/biraj21/web-wanderer

A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them.
https://github.com/biraj21/web-wanderer

data-extraction multithreading python web-crawler webcrawler

Last synced: 6 months ago
JSON representation

A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them.

Host: GitHub
URL: https://github.com/biraj21/web-wanderer
Owner: biraj21
License: mit
Created: 2023-07-29T08:37:10.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2024-11-30T15:18:20.000Z (over 1 year ago)
Last Synced: 2024-11-30T16:24:45.450Z (over 1 year ago)
Topics: data-extraction, multithreading, python, web-crawler, webcrawler
Language: Python
Homepage:
Size: 207 KB
Stars: 20
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Web Wanderer

Web Wanderer is a multi-threaded web crawler written in Python, utilizing `concurrent.futures.ThreadPoolExecutor` and Playwright to efficiently crawl and download web pages. This web crawler is designed to handle dynamically rendered websites, making it capable of extracting content from modern web applications.

![Screenshot](images/ss.png)

## How to Use

First install the [required dependencies](#dependencies).

Then you can use it as either a cli tool or as a library.

### 1. As command-line interface

```bash

python src/main.py https://python.langchain.com/en/latest/

```

### 2. As a Library

To start crawling, simply instantiate the `MultithreadedCrawler` class with the seed URL and optional parameters:

```python

from crawlers import MultithreadedCrawler

crawler = MultithreadedCrawler("https://python.langchain.com/en/latest/")

crawler.start()

```

The `MultithreadedCrawler` class is initialized with the following parameters:

- `seed_url` (str): The URL from which the crawling process will begin.

- `output_dir` (str): The directory where the downloaded pages will be stored. By default, the pages are saved in a folder named after the base URL of the seed. Defaults to `web-wanderer/downloads/"`

- `num_threads` (int): The number of threads the crawler should use. This determines the level of concurrency during the crawling process. Defaults to `8`.

- `done_callback` (Callable | None): A callback function that will be called after crawling is successfully done.

## Features

- **Multi-Threaded:** Web Wanderer employs multi-threading using the `ThreadPoolExecutor`, which allows for concurrent fetching of web pages, making the crawling process faster and more efficient.

- **Dynamic Website Support:** The integration of Playwright enables Web Wanderer to handle dynamically rendered websites, extracting content from modern web applications that rely on JavaScript for rendering.

- **Queue-Based URL Management:** URLs to be crawled are managed using a shared queue, ensuring efficient and organized distribution of tasks among threads.

- **Done Callback:** You have the option to set a callback function that will be executed after the crawling process is successfully completed, allowing you to perform specific actions or analyze the results.

## Dependencies

Web Wanderer relies on the following libraries:

- `playwright`: To handle dynamically rendered websites and interact with web pages.

## Getting Started with Development

_Note: Have only tested this project with **Python 3.11.4**._

1. Clone the repository:

```bash

git clone https://github.com/biraj21/web-wanderer.git

cd web-wanderer

```

2. Install and setup [pipenv](https://pypi.org/project/pipenv/)

3. Active virtual environment

```bash

pipenv shell

```

4. Install dependencies

```bash

pipenv install

```

5. Install headless browser with `playwright`

```bash

playwright install

```

## Planned things

- Replace `pipenv` with `poetry` cuz `pipenv` is shit

- `asyncio` crawler

- `trio` crawler (cuz why not)

- Allow choosing between HTML engine (requests/aiohttp) & JavaScript engine (Playwright)

Will do it when I get time.

List created on 30th Nov, 2024

Happy web crawling with Web Wanderer! 🕸️🚀

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/biraj21/web-wanderer

Awesome Lists containing this project

README