Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/oscarnevarezleal/ecommerce-crawler
Parallel ecommerce crawler using Docker and Puppeter on GCP
https://github.com/oscarnevarezleal/ecommerce-crawler
crawler gcp nodejs pubnub puppeteer
Last synced: about 2 months ago
JSON representation
Parallel ecommerce crawler using Docker and Puppeter on GCP
- Host: GitHub
- URL: https://github.com/oscarnevarezleal/ecommerce-crawler
- Owner: oscarnevarezleal
- License: mit
- Created: 2018-07-17T15:21:05.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-03-22T01:22:51.000Z (almost 6 years ago)
- Last Synced: 2024-08-09T13:16:12.760Z (6 months ago)
- Topics: crawler, gcp, nodejs, pubnub, puppeteer
- Language: JavaScript
- Homepage: https://oscarnevarezleal.github.io/ecommerce-crawler/
- Size: 38.1 KB
- Stars: 5
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Before you start
- This is mostly a proof of concept and have tons of improvements to be made. It goes without saying that this is not a production-ready project.
- Altought is very easy to port to AWS its is not planned in a near future. You can ways fork and send a PR
- Note: If you don’t have gcloud installed please refer to [gcloud sdk installation ](https://cloud.google.com/storage/docs/gsutil_install)
- Note: After you configre your project in GCP, make sure you got your own ```service-key.json``` file
- Note: if this is your first time using GCP, you need to authenticate your laptop to use GCP services by running gcloud auth login from the command line.## Overview
This project takes a bunch of urls and it transform them into [```Jobs```](#Jobs) , these [```Jobs```](#Jobs) are published into [Pub/Nub](https://cloud.google.com/pubsub/docs/overview) and awaits there until they´re read by the [```Worker```](#Worker).
When the [```Worker```](#Worker) became aware of one Job it spawn a new crawler using [Puppeteer](https://github.com/GoogleChrome/puppeteer), after the content has been grabbed the result is persisted in [DataStore](https://cloud.google.com/datastore/docs/) and the worker moves to the next Job in line. This process is repeated as long as there are Jobs.## Usage
### Environment
The following environment variables must be setted before testing
- GAE_APPLICATION - the name of your GCP application
- GOOGLE_APPLICATION_CREDENTIALS - the path to your ```service-key.json```### Steps
Follow this steps to run your application locally
- Clone this repository
- Run npm install
- Rename ```urls.sample.js``` to ```urls.js``` and include the urls you want to crawl
- Rename ```config.sample.js``` to ```config.js``` and edit ```gcp``` and ```descriptor``` section```
# generate the messages
node index.js
# when finished run the worker
node src/worker.js
```## Config
### Descriptor
Descriptor object is the backbone of this crawler, in here you specify each one of the things you want to grab from page.|Property | Type | Comments |
|--- |--- |--- |
|name | String | |
|primary | Boolean | Wether this is a primary attribute ( think about saving process) |
|required | Boolean | If is set to true execution will stop when element is not found
|selector | Boolean | A valid CSS selector
|attribute | String | Optional attribute to grab from selector
|format | Function | A callback to format the value grabbed### Puppeteer
Puppeteer configuration|Property | Default | Comments |
|--- |--- |--- |
|waitUntil | load | When to consider navigation succeeded, defaults to load. Given an array of event strings, navigation is considered to be successful after all events have been fired [See docs](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md)### Aggregates
Aggregation is a process that ocurrs after all elements have been grabbed.|Property | Type | Comments |
|--- |--- |--- |
|name | String | |
|source | Function | Descriptor object contains all elements grabbed in first step## Saving strategies
- Data storage GCP
- Others [pending documentation]## Publish
```
docker build -t gcr.io/$(gcloud config get-value project)/worker .
gcloud docker -- push gcr.io/$(gcloud config get-value project)/worker
```## Final notes
- Finalize your cluster after workload has been finished is trongly recommended to avoid incurring on innecesary charges.
- If you think this project suits your needs but needs a little tweak send me a message and I´d be happy to talk about it.