Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/moe131/webcrawler
Python web crawler designed to scrape websites
https://github.com/moe131/webcrawler
crawler crawling-python python python-crawler scraping simhash web-crawler
Last synced: 17 days ago
JSON representation
Python web crawler designed to scrape websites
- Host: GitHub
- URL: https://github.com/moe131/webcrawler
- Owner: Moe131
- Created: 2024-04-25T04:48:16.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-07-23T17:34:52.000Z (6 months ago)
- Last Synced: 2024-11-05T21:14:22.119Z (2 months ago)
- Topics: crawler, crawling-python, python, python-crawler, scraping, simhash, web-crawler
- Language: Python
- Homepage:
- Size: 3.52 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Web Crawler - UCI Project
This Python web crawler is designed to crawl the web by extracting new links from web pages and downloading the content of each crawled page. It respects the politeness delay for each site and checks if crawling is allowed using robots.txt. Additionally, it finds the top 50 most frequently occurring words in the crawled content and saves them in `summary.txt`. Each crawled page will be downloaded and saved in a JSON file in the `data` folder on your system.
## Features
- Crawls the web and downloads each web page.
- Processes the content to extract new links.
- Avoids traps or loops and avoids downloading duplicate pages.
- Counts the occurrence of each word.
- Identifies the top 50 most frequent words.
- Outputs the results to `summary.txt`.# Dependencies
To install the dependencies for this project run the following two commands after ensuring pip is installed for the version of python you are using. Admin privileges might be required to execute the commands. Also make sure that the terminal is at the root folder of this project.
```
python -m pip install packages/spacetime-2.1.1-py3-none-any.whl
python -m pip install -r packages/requirements.txt
```# Run the crawler
## step 1: Configure the crawler by updating config.ini file :
- provide your seed urls separated by comma
```
SEEDURL = https://www.ics.uci.edu,https://www.cs.uci.edu
```- If you want to allow all urls to be crawled, inside config.ini set CRAWLALL to TRUE. Otherwise, only URLs that begin with the seed URLs will be crawled.
```
CRAWLALL = TRUE
```
- You can change the time wait in seconds between each request.```
POLITENESS = 0.5
```## Step 2: Run the crawler by running the following command
```
python3 launch.py
```If you wish to restart the crawler, run :
```
python3 launch.py --restart
```# Custom Parsing
You can customize the way you want to parse each web page. To customize the parsing, modify the **'parse()'** method inside **'scraper.py'**.```
def parse(resp):
""" Parse the web page """
# Your Custom Parsing
```