https://github.com/samparsky/web-crawler

This a site crawler built with scrapy and stores data generated in mongodb using scrapy
https://github.com/samparsky/web-crawler

extruct scrapy scrapy-demo

Last synced: 2 months ago
JSON representation

This a site crawler built with scrapy and stores data generated in mongodb using scrapy

Host: GitHub
URL: https://github.com/samparsky/web-crawler
Owner: samparsky
Created: 2017-03-06T09:03:04.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-03-06T10:39:13.000Z (about 8 years ago)
Last Synced: 2023-10-26T11:51:41.954Z (over 1 year ago)
Topics: extruct, scrapy, scrapy-demo
Language: Python
Size: 12.7 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

## Simple Web Crawler
-----------------

This is a simple web crawler that crawls
[link](https://mommypoppins.com/events?area%5B%5D=118&field_event_date_value%5B%5D=03-04-2017&event_end=2017-04-07). Parses through the results page. It works based on the (Scrapy)[https://scrapy.org/] crawling engine. Its uses Extruct to parse application/ld+json content of the pages to retrieve basic content and Xpath to query the

### To start
------------
```sh

pip install -r requirements.txt

````

### To run the crawler

```sh
cd
scrapy crawl wizard

````

### MongoDB

The mongodb collection schema is as follows

```python
event_name
description
age_group
location
price
link
event_link
date
```

The mongodb database is `mommy` and the collection is `crawl`
To view the crawled data run the below commands at the mongo shell

```sh
> use mommy
> db.crawl.find()

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/samparsky/web-crawler

Awesome Lists containing this project

README