https://github.com/moe131/webcrawler

Python web crawler designed to scrape websites
https://github.com/moe131/webcrawler

crawler crawling-python python python-crawler scraping simhash web-crawler

Last synced: 3 months ago
JSON representation

Python web crawler designed to scrape websites

Host: GitHub
URL: https://github.com/moe131/webcrawler
Owner: Moe131
Created: 2024-04-25T04:48:16.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-07-23T17:34:52.000Z (12 months ago)
Last Synced: 2025-02-15T14:47:15.990Z (5 months ago)
Topics: crawler, crawling-python, python, python-crawler, scraping, simhash, web-crawler
Language: Python
Homepage:
Size: 3.52 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Web Crawler - UCI Project

This Python web crawler is designed to crawl the web by extracting new links from web pages and downloading the content of each crawled page. It respects the politeness delay for each site and checks if crawling is allowed using robots.txt. Additionally, it finds the top 50 most frequently occurring words in the crawled content and saves them in `summary.txt`. Each crawled page will be downloaded and saved in a JSON file in the `data` folder on your system.

## Features

- Crawls the web and downloads each web page.
- Processes the content to extract new links.
- Avoids traps or loops and avoids downloading duplicate pages.
- Counts the occurrence of each word.
- Identifies the top 50 most frequent words.
- Outputs the results to `summary.txt`.

# Dependencies

To install the dependencies for this project run the following two commands after ensuring pip is installed for the version of python you are using. Admin privileges might be required to execute the commands. Also make sure that the terminal is at the root folder of this project.
```
python -m pip install packages/spacetime-2.1.1-py3-none-any.whl
python -m pip install -r packages/requirements.txt
```

# Run the crawler

## step 1: Configure the crawler by updating config.ini file :

- provide your seed urls separated by comma

```
SEEDURL = https://www.ics.uci.edu,https://www.cs.uci.edu
```

- If you want to allow all urls to be crawled, inside config.ini set CRAWLALL to TRUE. Otherwise, only URLs that begin with the seed URLs will be crawled.

```
CRAWLALL = TRUE
```
- You can change the time wait in seconds between each request.

```
POLITENESS = 0.5
```

## Step 2: Run the crawler by running the following command

```
python3 launch.py
```

If you wish to restart the crawler, run :
```
python3 launch.py --restart
```

# Custom Parsing
You can customize the way you want to parse each web page. To customize the parsing, modify the **'parse()'** method inside **'scraper.py'**.

```
def parse(resp):
""" Parse the web page """
# Your Custom Parsing
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/moe131/webcrawler

Awesome Lists containing this project

README