https://github.com/samparsky/web-crawler
This a site crawler built with scrapy and stores data generated in mongodb using scrapy
https://github.com/samparsky/web-crawler
extruct scrapy scrapy-demo
Last synced: 2 months ago
JSON representation
This a site crawler built with scrapy and stores data generated in mongodb using scrapy
- Host: GitHub
- URL: https://github.com/samparsky/web-crawler
- Owner: samparsky
- Created: 2017-03-06T09:03:04.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2017-03-06T10:39:13.000Z (about 8 years ago)
- Last Synced: 2023-10-26T11:51:41.954Z (over 1 year ago)
- Topics: extruct, scrapy, scrapy-demo
- Language: Python
- Size: 12.7 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
## Simple Web Crawler
-----------------This is a simple web crawler that crawls
[link](https://mommypoppins.com/events?area%5B%5D=118&field_event_date_value%5B%5D=03-04-2017&event_end=2017-04-07). Parses through the results page. It works based on the (Scrapy)[https://scrapy.org/] crawling engine. Its uses Extruct to parse application/ld+json content of the pages to retrieve basic content and Xpath to query the### To start
------------
```shpip install -r requirements.txt
````
### To run the crawler
```sh
cd
scrapy crawl wizard````
### MongoDB
The mongodb collection schema is as follows
```python
event_name
description
age_group
location
price
link
event_link
date
```The mongodb database is `mommy` and the collection is `crawl`
To view the crawled data run the below commands at the mongo shell```sh
> use mommy
> db.crawl.find()```