https://github.com/fschaeffler/lambda-web-scrapper

Last synced: 10 months ago
JSON representation

Host: GitHub
URL: https://github.com/fschaeffler/lambda-web-scrapper
Owner: fschaeffler
Created: 2019-07-30T08:10:28.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2019-08-03T21:23:31.000Z (almost 7 years ago)
Last Synced: 2025-08-23T04:45:20.604Z (10 months ago)
Language: JavaScript
Size: 99.6 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# AWS Lambda Web Scrapper

This web scrapper runs on AWS with a headless Chromium browser and Puppeteer.

## Development Setup

- install Node.js 8.10 because this is the version being used by the Serverless service

- install yarn from https://yarnpkg.com

- install dependencies via `yarn install`

- install serverless globally via `yarn global add serverless`

## Run (Local)

- run the service via `yarn serve`

## Deploy (Production)

- deploy the Serverless service via `yarn deploy`

- note down the created API-key from the deployment output

## Usage (Production)

The actualy endpoint URL is displayed in the deployment output. It looks like `https://.execute-api.eu-central-1.amazonaws.com/prod/scrape`.

In order to ensure only authorized clients can call this API, an API-key is needed for the call. This key needs to get specified in the HTTP Request Header for `x-api-key`.

An exemplary call of the endpoint would look like the following one.

Content of `request.json`

```
{
"url": "https://www.sfit.services",
"xpathes": [
"//*/div/div/div/a",
"//*/div/div/div"
],
"uniqueResults": true
}
```

`curl -X POST -H "x-api-key: API-SECRET-KEY" -d @request.json https://GATEWAY-ID.execute-api.eu-central-1.amazonaws.com/prod/scrape`

### Request Options

The request options get handed in as JSON-data. This data need to adhere the following structure.

- `url`: [string] The URL which should get scrapped

- `xpathes`: [array of strings] A list of XPathes of which the content should get retrieved

- `uniqueResults` (optional): [boolean] When set to `true`, only unique results will get returned

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fschaeffler/lambda-web-scrapper

Awesome Lists containing this project

README