https://github.com/nava45/simplempcrawler

Simple Multiprocessing Crawler in python
https://github.com/nava45/simplempcrawler

crawler multiprocessing python

Last synced: about 11 hours ago
JSON representation

Simple Multiprocessing Crawler in python

Host: GitHub
URL: https://github.com/nava45/simplempcrawler
Owner: nava45
Created: 2017-04-01T19:47:19.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2017-04-01T19:55:49.000Z (about 9 years ago)
Last Synced: 2025-02-24T01:38:00.143Z (over 1 year ago)
Topics: crawler, multiprocessing, python
Language: Python
Size: 2.93 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## SimpleMPCrawler
Simple Multi processing crawler in python

### Problem stmt:

Using python's multiprocessing and any one of threading/gevent module, task is to write a web-scraper which takes a huge file as an input ( 1Million rows ) which contains a url in each line.
The scraper then uses BeatuifulSoup to parse the content and finds if the content contains "jquery.js". If it does, dump the url into a file "accepted.csv" or if it doesn't, dump it into file "rejected.csv".

### install
```

virtualenv env/
source env/bin/activate
pip install -r requirements.txt
python crawler.py

```

### test
```
python test_crawler.py
```

### steps

`urls.csv` is the input file which has list of urls to be processed

output `accepted.csv`, `rejected.csv` files will be created and the respective urls are put in to the respective files

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nava45/simplempcrawler

Awesome Lists containing this project

README