https://github.com/nava45/simplempcrawler
Simple Multiprocessing Crawler in python
https://github.com/nava45/simplempcrawler
crawler multiprocessing python
Last synced: about 11 hours ago
JSON representation
Simple Multiprocessing Crawler in python
- Host: GitHub
- URL: https://github.com/nava45/simplempcrawler
- Owner: nava45
- Created: 2017-04-01T19:47:19.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-04-01T19:55:49.000Z (about 9 years ago)
- Last Synced: 2025-02-24T01:38:00.143Z (over 1 year ago)
- Topics: crawler, multiprocessing, python
- Language: Python
- Size: 2.93 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## SimpleMPCrawler
Simple Multi processing crawler in python
### Problem stmt:
Using python's multiprocessing and any one of threading/gevent module, task is to write a web-scraper which takes a huge file as an input ( 1Million rows ) which contains a url in each line.
The scraper then uses BeatuifulSoup to parse the content and finds if the content contains "jquery.js". If it does, dump the url into a file "accepted.csv" or if it doesn't, dump it into file "rejected.csv".
### install
```
virtualenv env/
source env/bin/activate
pip install -r requirements.txt
python crawler.py
```
### test
```
python test_crawler.py
```
### steps
`urls.csv` is the input file which has list of urls to be processed
output `accepted.csv`, `rejected.csv` files will be created and the respective urls are put in to the respective files