https://github.com/samparsky/web-crawler
This a site crawler built with scrapy and stores data generated in mongodb using scrapy
https://github.com/samparsky/web-crawler
extruct scrapy scrapy-demo
Last synced: 7 months ago
JSON representation
This a site crawler built with scrapy and stores data generated in mongodb using scrapy
- Host: GitHub
- URL: https://github.com/samparsky/web-crawler
- Owner: samparsky
- Created: 2017-03-06T09:03:04.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-03-06T10:39:13.000Z (about 9 years ago)
- Last Synced: 2025-03-26T23:18:37.953Z (about 1 year ago)
- Topics: extruct, scrapy, scrapy-demo
- Language: Python
- Size: 12.7 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
## Simple Web Crawler
-----------------
This is a simple web crawler that crawls
[link](https://mommypoppins.com/events?area%5B%5D=118&field_event_date_value%5B%5D=03-04-2017&event_end=2017-04-07). Parses through the results page. It works based on the (Scrapy)[https://scrapy.org/] crawling engine. Its uses Extruct to parse application/ld+json content of the pages to retrieve basic content and Xpath to query the
### To start
------------
```sh
pip install -r requirements.txt
````
### To run the crawler
```sh
cd
scrapy crawl wizard
````
### MongoDB
The mongodb collection schema is as follows
```python
event_name
description
age_group
location
price
link
event_link
date
```
The mongodb database is `mommy` and the collection is `crawl`
To view the crawled data run the below commands at the mongo shell
```sh
> use mommy
> db.crawl.find()
```