https://github.com/samparsky/nodejs-crawler-demo

This is a NodeJS crawler built using request and xpath npm modules.
https://github.com/samparsky/nodejs-crawler-demo

Last synced: 2 months ago
JSON representation

This is a NodeJS crawler built using request and xpath npm modules.

Host: GitHub
URL: https://github.com/samparsky/nodejs-crawler-demo
Owner: samparsky
Created: 2017-03-12T08:33:41.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-03-12T09:17:09.000Z (about 8 years ago)
Last Synced: 2023-10-26T11:51:41.553Z (over 1 year ago)
Language: JavaScript
Size: 11.7 KB
Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

## Simple Web Crawler
-----------------

This is a simple web crawler that crawls
[link](https://mommypoppins.com/events?area%5B%5D=118&field_event_date_value%5B%5D=03-04-2017&event_end=2017-04-07).
Parses through the results page. Its uses Xpath to parse the html document.

### To start
------------

```sh

$ npm install

````

### To run the crawler
To start the crawler
```sh
cd
npm run crawl

````

### MongoDB

The mongodb collection schema is as follows

```js
{
"_id": "58c4f2882b8daa24cd185054",
"link": "https://mommypoppins.com/new-york-city-kids/event/indoor/treehouse-shakers-olive-pearl-a-magical-story-of-home-family-show",
"event_link": "http://www.flushingtownhall.org/event/c17de736955ca330f80eb072a5aefefe",
"name": "Treehouse Shakers' Olive & Pearl",
"description": "Shows at 11:00 am and 2:15 pm",
"location": "Flushing Town Hall",
"age_group": "2-5",
"price": " $13 for adults, $8 for children ",
"date": "Saturday, March 18, 2017 ",
"__v": 0
}
```

The mongodb database is `mommy` and the collection is `crawl`
To view the crawled data run the below commands at the mongo shell

```sh
> use mommy
> db.crawl.find()

```

The application also exposes an http interface to access the crawled contents

```sh
..$ npm start

```
Then access `http://localhost:3000/crawl` either in your browser or postman. It returns a json response of the crawled
contents.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/samparsky/nodejs-crawler-demo

Awesome Lists containing this project

README