Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/esaatci/rescale_crawler

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/esaatci/rescale_crawler
Owner: esaatci
Created: 2022-02-20T00:02:56.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-02-23T23:35:11.000Z (over 2 years ago)
Last Synced: 2024-06-28T13:32:19.204Z (5 months ago)
Language: Python
Size: 11.7 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# rescale_crawler

This project implements a web crawler that logs and follows links.

# Philosopy

I tried to keep the implementation as simple as possible. Most of the code is written in a somewhat functional style with lots of composition along the way. File structure is also intentionally flat.

# Content

There are 6 python files in this project which are:

1. crawler.py
2. main.py
3. utils.py
4. test_get_absolute_links.py
5. LRUCache.py
6. test_LRUCache.py

### crawler.py
Contains two crawlers, one serial and one parallel. I decided to keep the serial crawler to see the evolution between two different versions.

Serial crawler implements a recursive Depth First Search approach with an optional depth argument. It is defaulted to 3 and can be changed in the code. Once crawler reaches the provided depth, it'll stop.

Parallel crawler is very similar, but uses threads to increase the throughput. It doesn't have a stop conditon.

### main.py
The main entry point of the project. Parses the command line args and starts the crawler. Refer to the Usage section for more info.

### utils.py
Utility functions that are used by the crawlers and the main function. Url fetching, parsing etc. is done inside this file.

### test_get_absolute_links.py
Contains a unit test to validate getting links from a url

### LRUCache.py
Contains a cache to keep track of visited urls

### test_LRUCache.py
Contains unit tests for the LRUCache

# Parallelization
The parallelization is achieved by using Python's built-in `ThreadPoolExecutor` to implement a thread pool. Threads access a url, retrieve the links and sends the next batch of urls to the main thread and submit logging data to the Daemon thread

I chose using the threadpool over the `multiprocessing` library because this program is mostly I/O bound.

# Extra libraries
There are 3 extra libraries used.

1. `BeautifulSoup` is used for parsing html
2. `black` is used for formatting
3. `requests` is used for network requests

# Installation
I'm making the assumption that this will be run on a Linux/MacOs environment and the machine has git and python version 3 installed.

Step 1: Clone the Repository

`git clone https://github.com/esaatci/rescale_crawler.git`

Step 2: go into the cloned directory

`cd path/to/the/cloned/repo`

Step 3: Create a virtual environment

`python3 -m venv .`

Step 4: Activate the virtual environment

`source bin/activate`

Step 5: Install the dependencies

`pip install -r requirements.txt`

# Usage
After the installation is complete if you are not inside the project directory with the virtual environment activated, perform Step 2 and Step 4 from the installation section.

To run the crawler use the following command:

`python3 main.py -p https://theUrlYouWantToStartFrom.com`

It is important to format the url in absolute format with either http or https. Otherwise the program will exit.

the `-p` signifies that the program will run in parallel mode. To run in serial replace `-p` with `-s`:

`python3 main.py -s https://theUrlYouWantToStartFrom.com`

to stop the program:

`ctrl + c`

to run the unit test:

`python3 -m unittest`

# Further Improvements

There are couple of things that can be done to improve this project in the future.

1. Adding more unit tests
2. Adding the support for writing to a file
3. Adding stop conditions to the parallel crawler. Ex. time based, depth-based, link-based
4. Better command line arg support. Passing stop conditions, thread count etc.
5. Dockerizing the project.
6. Different ways of parallelizing